A few years ago, a company called Oxford Nanopore announced that it was developing a radically different way to sequence DNA. The approach involved taking some strands of the double helix and stuffing them through a protein pore. With a little bit of current flowing through the pore, the four bases of DNA each created a distinct (if small) change in voltage as it passed through. These could be used to read the DNA base by base as it wound through the pore.
After several years of slow progress, Oxford Nanopore announced that its sequencing hardware would be as distinctive as its wetware: a USB device that would fit comfortably in a person’s hand. When the first devices went to users, it became clear that the device had some advantages and disadvantages. On the plus side, the device was fast and could be used without requiring a large facility to support it. It can also read very long stretches of DNA at once. But the downside was significant: it made a lot of mistakes.
With a few years of experience, people are now beginning to learn how to get the most out of the devices, as evidenced by a new paper in which researchers are using it to sequence a human genome. By using the machine’s long readouts — nearly 900,000 bases from one DNA molecule in one case — the authors were able to get data from parts of the human genome that had previously been resistant to characterization. And they were able to distinguish between the two sets of chromosomes (one from mother, one from father) and locate regions of epigenetic control in many regions of the genome.
In light of all the different information it can provide, the machine’s error rate seems less of an issue.
Errors and corrections
We have DNA sequencing machines that make very few mistakes. Unfortunately, they are only good for reading DNA in bits of about 200 bases or so. Computer software must recognize the instances where these small pieces overlap and use them to build larger sequences. This process fails when DNA repeats itself or when very similar sequences appear in multiple parts of the genome – the software simply cannot tell what goes where.
As we saw with the axolotl genome, it is possible to use longer, error-prone reads to understand the mess. The high-precision method provides a sequence, while the longer reads tell us how these sequences are put together into larger chunks. There will still be gaps, but they will be fewer and more sequences will be found in large chunks rather than small fragments. While the axolotl genome relied on machines from Pacific Biosystems, the nanopore system would also work in this regard.
Or at least it should. Part of the goal of the new paper was to confirm this, and much of the paper involves figuring out how to extract the best possible sequence from the authors’ nanopores. For example, they tried two different software packages to interpret the voltage data coming from their machine and found that a community-developed, open source package that uses a neural network produced the best data. Combining nanopore reads with shorter, high-quality fragments increased the overall accuracy of the genome assembly to 99.88 percent, showing that this works.
But the researchers went much further than that. By itself, the nanopore sequence had an accuracy of only 92 percent. When combined, having multiple readings from the same series from the same machine increased the accuracy to over 97 percent. A separate software package could then compare cases where different readings disagreed and make a decision as to which ones were likely correct; this increased accuracy to 99.44 percent. This isn’t as good as short, high-quality lectures, but it’s close enough for many purposes. Adding the high quality short reads to this increased accuracy to 99.96 percent.
The nanopores also offered some very clear advantages. For example, the activity of genes can be changed by so-called epigenetic modifications – a chemical alteration of some bases that do not change the DNA sequence. These changes also slightly alter the voltage measurements as a base passes, allowing the researchers to identify where they had occurred in the genome.
We also inherit two copies of each chromosome (excluding the X and Y from males): one from mom and one from dad. While these copies are different, most of the underlying DNA is identical for long stretches, making it impossible to use short DNA reads to determine which chromosome is which. As a result, although you can see where differences are present, it is impossible to tell which differences are inherited together, on the same chromosome. The long reads of nanopores make this possible.
Finally, the researchers decided to make the reads as long as possible. DNA is a long, thin molecule, and manipulating solutions of long DNA tends to break it into small fragments, as the movement of the fluid will create shear and stretch forces. However, if you are very careful, these can be minimized. When the authors took these precautions, the typical nanopore machine read length skyrocketed to over 100,000 bases; one read reached 882,000.
That was big enough to close some of the gaps left by the original project that mapped the human genome. One was 50,000 bases long and contained a duplication of a gene. Another had eight copies of a repeated sequence in rapid succession. In the long run, it should be possible to really bring the genome to a successful conclusion with this approach.
However, the work identified some shortcomings on the sequence side. For example, a common file format used to store DNA data is not intended to process sequences for that long. Therefore, some analysis software was unable to work with the nanopore readings at all. Due to these compatibility issues, the team had to rely on a very processor-intensive algorithm for some of their analysis.
The impressive results that came from this analysis suggest that it is worth getting the software up to date.
Nature Biotechnology2017. DOI: 10.1038/nbt.4060 (About DOIs).
Go to discussion…