logo
ResearchBunny Logo
Introduction
Current reference-based methods for human genome analysis, such as GATK, rely on short-read sequencing and only cover approximately 90% of the reference genome. While accurate for single-nucleotide variants and short indels within this mappable region, they struggle with de novo assembly, structural variant discovery (including large indels and copy number variations), and resolving phasing relationships. Third-generation sequencing technologies, including linked-reads and long-read technologies, offer significant advantages over short-read sequencing for genome inference. Long-read sequences can generate highly contiguous de novo genome assemblies. Nanopore sequencing, in particular, is attractive for de novo genome assembly due to its ability to produce high yields of very long (100+ kb) reads, offering the potential to assemble even the most challenging regions of the human genome, such as centromeric satellites, acrocentric short arms, ribosomal DNA arrays, and recent segmental duplications. However, previous attempts at de novo human genome assembly using nanopore sequencing required extensive computational resources (more than 150,000 CPU hours and weeks of wall-clock time), making it impractical for high-throughput applications. To address this limitation, this research developed a toolkit for rapid and cost-effective de novo assembly and polishing of nanopore sequencing data. The toolkit combines nanopore and proximity-ligation (HiC) sequencing to improve both speed and accuracy of human genome sequencing.
Literature Review
The existing literature highlights the limitations of short-read sequencing in resolving complex genomic regions and the potential of long-read sequencing technologies to overcome these limitations. Studies have demonstrated the use of long-read sequencing in reference-guided methods and de novo assembly, but these often require significant computational resources. The use of nanopore sequencing for de novo assembly has been reported, however the computational requirements for this approach have historically presented a significant barrier to widespread adoption. The need for efficient and scalable de novo assembly tools for long-read data is clearly highlighted by the high computational cost of previous efforts. This need motivated the development of the Shasta assembler and its associated polishing tools, aiming to provide a significantly faster and more cost-effective solution.
Methodology
The study sequenced 11 low-passage human cell lines from the 1000 Genomes Project and genome-in-a-bottle (GIAB) sample collections, selected to maximize allelic diversity. PromethION nanopore sequencing and HiC Illumina sequencing were performed for each genome. Three flow cells were used per genome, with nuclease flushes every 20–24 h to prevent pore blockage. Basecalling was done using Guppy v.2.3.5 with the high-accuracy flipflop model. The Shasta assembler was developed to be significantly faster than existing assemblers like Canu, employing run-length encoding (RLE) for efficient read storage and a modified MinHash scheme to identify overlapping reads. The assembler utilizes a marker representation of reads, optimizing alignment computations. The Shasta assembler produces a marker graph, which is then simplified through a series of steps (approximate transitive reduction, pruning of short side branches, and removal of bubbles and superbubbles) to generate an assembly graph from which the final sequence is assembled. The base-level accuracy of the assemblies was improved using a deep neural network-based polishing pipeline consisting of two modules: MarginPolish and HELEN. MarginPolish utilizes a banded forward-backward algorithm on a pairwise hidden Markov model to generate pairwise alignment statistics and a weighted POA graph. HELEN, a multi-task recurrent neural network (RNN), uses these statistics to predict nucleotide bases and run lengths. The effectiveness of MarginPolish and HELEN was compared with the state-of-the-art nanopore assembly polishing workflow (four iterations of Racon polishing followed by Medaka). HiRise was used to scaffold the polished Shasta assemblies with HiC proximity-ligation data to achieve near chromosome-scale sequences. Comparative Annotation Toolkit (CAT) and BUSCO were used to evaluate the completeness and accuracy of the assembled transcriptome. Finally, structural variant analysis and MHC haplotype resolution were conducted. Detailed explanations of the algorithm underlying the Shasta assembler, MarginPolish polisher, and HELEN polisher are presented in the supplementary material.
Key Findings
The study successfully assembled 11 human genomes de novo using nanopore sequencing in just 9 days, generating 2.3 terabases of sequence data. The average throughput per flow cell was 69 gigabases. The Shasta assembler significantly outperformed existing assemblers in terms of speed and cost, producing a complete haploid human genome assembly in under 6 hours on a single commercial compute node at a cost of approximately US$70. The MarginPolish and HELEN polishing pipeline improved the base-level accuracy of the assemblies to more than 99.9% identity (QV = 30) using nanopore reads alone. Combining Shasta with HiC scaffolding resulted in near chromosome-level scaffolds for all 11 genomes. Comparisons with other assemblers (Wtdbg2, Flye, and Canu) showed that Shasta produced the most base-level accurate assemblies, with fewer disagreements and a higher ratio of mapped assembly sequence. Shasta assemblies, when polished with MarginPolish and HELEN, contained nearly all human protein-coding genes. The trio-binning approach proved useful for haplotype assembly, successfully assembling MHC haplogroups. The study also shows successful structural variant identification, demonstrating the capabilities of the pipeline in characterizing structural variations. The MarginPolish and HELEN polishing pipeline also demonstrated superior performance to the Racon/Medaka pipeline in terms of accuracy and cost. Finally, a comparison with a PacBio HiFi assembly showed that the Shasta assembly had a lower disagreement count and comparable NGAx, despite a slightly lower NG50.
Discussion
The results demonstrate the significant advancements in speed and cost-effectiveness achieved by the newly developed toolkit. The ability to assemble a complete human genome de novo in under 6 hours and for approximately US$70 represents a substantial improvement over existing methods. The superior accuracy and contiguity of the Shasta assemblies, especially when combined with the MarginPolish and HELEN polishing pipeline, highlight the potential of this approach for various genomic applications. The findings support the use of nanopore sequencing as a powerful and efficient method for human genome sequencing, especially for de novo assemblies. The successful assembly of the MHC region and accurate structural variant identification underscore the robustness and broad applicability of this method. The integration of phasing into the assembly algorithm is a potential direction for future research, and the framework presented in this study provides a foundation for future development in this area. Future improvements in basecalling and further development of the polishing algorithms could lead to even higher-quality genome assemblies.
Conclusion
This research presented Shasta, a novel de novo long-read assembler, and its associated polishing pipeline (MarginPolish and HELEN), which enable efficient and cost-effective de novo assembly of human genomes. The toolkit significantly improved the speed and accuracy of human genome assembly compared to existing methods. Future work could focus on integrating phasing into the assembly algorithm and further optimizing the polishing algorithms for even higher accuracy. This improved efficiency has considerable implications for high-throughput human genome sequencing and research.
Limitations
While the study achieved impressive results, it is important to acknowledge some limitations. The sample selection, although designed to maximize allelic diversity, might not fully capture the entire spectrum of human genetic variation. Further studies with a broader range of samples are needed for complete validation. Although Shasta produces high-quality assemblies, it is relatively conservative compared to competitors, prioritizing correctness over contiguity. The deep learning model used for polishing (HELEN) was trained on specific basecaller data and may require retraining for optimal performance with newer basecallers. Additionally, the accuracy of the structural variant calling remains dependent on the quality of the reference genome. Future improvements could also address the limitations of haplotype assembly in low-coverage regions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny