Sumary of Scientists can now assemble entire genomes on their personal computers in minutes:
- The study, published September 14 in the journal Cell Systems, allows for a more compact representation of genome data inspired by the way in which words, rather than letters, offer condensed building blocks for language models.
- Third-generation sequencing technologies offer terabytes of high-quality genomic sequences with tens of thousands of base pairs, yet genome assembly using such an immense quantity of data has proved challenging.
- Building from the concept of a de Bruijn graph, a simple, efficient data structure used for genome assembly, the researchers developed a minimizer-space de Bruin graph (mdBG), which uses short sequences of nucleotides called minimizers instead of single nucleotides.
- “Our minimizer-space de Bruijn graphs store only a small fraction of the total nucleotides, while preserving the overall genome structure, enabling them to be orders of magnitude more efficient than classical de Bruijn graphs,” says Berger.
- “We can also handle sequencing data with up to 4% error rates,” adds Berger.