Fri. Feb 3rd, 2023
This data may contain malware.
Enlarge / This data may contain malware.

With everyone from academics to Microsoft eyeing the prospect of storing data using DNA, it was probably inevitable that someone would start looking at the security implications. Apparently they are worse than most people expected. It appears to be possible to code computer malware into DNA and thus attack vulnerabilities on the computer that analyzes the sequence of that DNA.

The researchers found no actual vulnerability in DNA analysis software, but specifically created a version of certain software with an exploitable vulnerability to demonstrate that the risk is more than hypothetical. Still, an audit of some open source DNA analysis software shows that the academics who wrote it haven’t paid much attention to security best practices.

More like a virus than most

DNA sequencing involves determining the precise order of the bases that make up a DNA strand. While the process that generates the sequence is generally a combination of biology and/or chemistry, once read, the sequence is usually stored as an ASCII sequence of As, Ts, Cs, and Gs. If that piece of data is not handled properly, it could exploit vulnerable software to execute arbitrary code. And DNA sequencing tends to see a lot of software, which finds overlapping sequences, matches them to known genomes, looks for key differences, and more.

To see if this threat was more than hypothetical, the researchers started with a very simple exploit: Store more data than a chunk of memory could hold and redirect program execution to the excess. In this case, the excess contained an exploit that would use a function of the bash shell to connect to a remote server that the researchers monitored. If it worked, the server would have full shell access to the machine running the DNA analysis software.

However, actually implementing this in DNA proved to be a challenge. DNA with Gs and Cs forms a stronger double helix. Too many of them, and the strand won’t open easily for sequencing. Too little, and it will pop open if you don’t want it to. Repetitive DNA can form complex structures that get in the way of all the enzymes we normally use to manipulate DNA. However, the computer code they wanted to use had many long sequences of the same character, making for a repetitive sequence with very few Gs and Cs. The company they ordered DNA from couldn’t even synthesize it.

In the end, they had to completely redesign their malware so that its translation into nucleic acids produced a strand of DNA that could be synthesized and sequenced. The latter created another hurdle. The most common method of sequencing is currently limited to reading a few hundred bases at a time. Since each base has two pieces of information, this means the malware must be incredibly dense. That limits what can be done, and it explains why this particular payload just opened a remote connection.

Then there was the issue of running the malware. Since this was a proof of concept, the researchers made it easy for themselves: they modified an existing tool to create an exploitable vulnerability. They also made some changes to the system configuration to make random memory location execution easier (made the stack executable and disabled memory address randomization). While that makes the test environment less realistic, the goal was simply to demonstrate that DNA-delivered malware was possible.

Once everything was in place, they ordered some DNA online and then sent it to a sequencing facility. When their sequences came back, they sent them through a software pipeline that included their vulnerable utility. Almost immediately, the computer running the software connected to their host, giving them access to the machine. The malware worked.

Semi-realism

Given how easy the authors have made it – one known vulnerability and some protections disabled – does this really pose a threat? There is good news and bad news here.

On the good side, there are the complications of translating computer instructions into DNA that can be synthesized and sequenced. In addition, there is the problem that most sequencing machines are limited in how long a sequence they can read. The machine used in this work has a maximum of 300 bases, which equates to 600 bits, and most facilities keep it shorter than that. Machines that read longer are available, but they are also error-prone and any errors will usually disable the malware.

But it’s also common for the software used to analyze DNA to look for places where two short sequences overlap and use that to build longer sequences. This has the potential to significantly expand the size of the malware, although less of the analysis software pipeline will be exposed to these longer, curated sequences.

Similar issues exist with the way the malware is coded. While the authors used each base to encode two bits, DNA analysis software internally processes DNA in different ways. For example, if sequencing does not give a clear indication of what a base is, other characters can be used (for example, N for a base, or R for G or A). Any software dealing with these ambiguous bases must have a more complex coding scheme; many simply use ASCII characters.

As a result, different pieces of software will be vulnerable to different malware encodings. While that means some software will be immune, the size of DNA analysis pipelines typically means a dozen or more pieces of software running in tandem. Chances are that at least one of them uses the same encryption as the malware.

Bad habits

The habits of the research community are also an important point of vulnerability. The analytics software is generally not written with security in mind. Using the analysis tools of the Clang compiler and HP’s Fortify compiler, the authors scoured a collection of open source DNA analysis software for potential vulnerabilities. They found widespread use of buffer overflow-prone functions (strcat, strcpy, sprintf, vsprintf, gets, and scanf), about twice every 1,000 lines of code. “Our research suggests that DNA sequencing and analysis have received no significant or hostile pressure to date,” they conclude.

The second problem is how easy it is for malicious code to infiltrate other machines via DNA. The sequencing machines have such a high capacity that the work of several labs runs simultaneously on one machine. As a result, some of the sequences returned by the machine will be mixed into an unrelated sample. When the researchers checked another group that performed their sequencing at the same time, they found that the other group’s results included 27 instances of the malware.

Separately, many services allow you to simply send any DNA for sequencing, putting their software at risk. And many public repositories allow people to upload their sequence for analysis by others. So you don’t even need to synthesize DNA to have your exploit analyzed – you can simply upload the text of the sequence you designed to someone else’s data repository.

None of this means that a DNA-based exploit is just around the corner. But it’s a healthy warning that the research community and commercial DNA companies should try to improve their practices before this becomes a problem.

By akfire1

Leave a Reply

Your email address will not be published.