Biology
Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes
K. Shafin, T. Pesout, et al.
The study addresses the challenge that short-read, reference-based methods capture only about 90% of the human reference genome and are limited for de novo assembly, structural variation discovery and phasing. Third-generation sequencing, particularly long reads such as those from Oxford Nanopore Technologies (ONT), can overcome these limitations and produce highly contiguous assemblies. However, prior nanopore-based human genome assemblies required extensive resources (>150,000 CPU hours and weeks), limiting throughput. The research question is whether a new toolkit can enable fast, cost-effective, accurate de novo assembly of human genomes from nanopore reads. The purpose is to develop and benchmark Shasta, a rapid long-read assembler, along with MarginPolish and HELEN polishers, to produce high-quality assemblies quickly and cheaply, and to demonstrate near chromosome-level scaffolding with HiC data. This work is important to make large-scale human de novo assembly practical, enabling improved analysis of complex genomic regions and structural variation.
Background highlights include: reference-based variant calling from short reads is accurate for small variants in mappable regions but inadequate for de novo assembly and structural variation resolution. Long-read and linked-read technologies improve contiguity and structural variant detection and have enabled high-quality de novo assemblies. ONT nanopore sequencing provides very long reads (>100 kb), promising resolution of difficult genomic regions (centromeres, acrocentric arms, rDNA arrays, recent segmental duplications). A prior nanopore human genome assembly was possible but computationally prohibitive (53 MinION flow cells, >150,000 CPU hours, weeks). These motivate the need for faster assembly algorithms tailored to nanopore error profiles and for efficient polishing pipelines.
Samples and sequencing: 11 low-passage human lymphoblastoid cell lines (offspring from parent-child trios, 1000 Genomes and GIAB collections) were selected to maximize allelic diversity. High molecular weight DNA was extracted (Qiagen Puregene, modified protocol), size-selected (Circulomics Short-Read Eliminator). Nanopore libraries (SQK-LSK109) were prepared (1 µg input), and each PromethION flow cell underwent two nuclease flushes (every ~24 h) and three sequential libraries of the same sample. Basecalling used Guppy v2.3.5 (high-accuracy flipflop). HiC libraries were generated (DpnII digestion, biotin labeling, ligation, crosslink reversal), sonicated (~400 bp), and sequenced on Illumina HiSeq X for scaffolding.
Sequencing runs: Up to 15 flow cells in parallel, 3 per genome. Across 11 genomes in 9 days: total 2.3 Tb, average 69 Gb per flow cell, per-genome coverage 48×–85×; read N50 28–51 kb; median/modal alignment identity to GRCh38 of 90%/93%; per genome ~55× coverage from >10-kb reads and ~6.5× from >100-kb reads.
Assembler (Shasta): Designed for speed and cost efficiency on nanopore reads. Key innovations: run-length encoding (RLE) of reads to reduce homopolymer length error impact; marker representation (subset of k-mers in RLE) and alignment in marker space; modified MinHash (LowHash) on consecutive marker features (m=4) to find candidate overlaps while mitigating length bias; efficient banded marker-space alignments with heuristics (limits on maxSkip and maxTrim; filtering frequent markers). Graph construction: read graph (k-nearest neighbor by best alignments), marker graph (vertices are aligned markers; coverage filters), assembly graph (compaction of linear marker-edge paths). Graph simplification: approximate transitive reduction (bounded BFS), pruning of short leaves, and iterative bubble/superbubble removal with increasing length thresholds. Consensus sequence assembly from marker graph: trivial consensus at vertices; edge consensus via overlap agreement or constrained multiple sequence alignment (spoa) when necessary; final non-RLE sequence obtained by estimating repeat counts per position (mode across contributing reads). Implementation runs on a single large-memory node (1–2 TB RAM) with memory-mapped data structures and high parallel efficiency.
Polishing pipeline: MarginPolish (pair-HMM forward-backward, banded on input alignments) computes posterior match/indel probabilities to build a weighted RLE partial order alignment (POA), prunes low-weight edits, locally enumerates haplotype alternatives within subgraphs, updates assembly with higher-likelihood sequences, and decodes run lengths using a Bayesian model trained per base on run-length confusion (RLE). HELEN uses a multi-task recurrent neural network (bidirectional GRUs) consuming image-like summaries from MarginPolish’s POA (92 features per position, run-length block encoding) to jointly predict base and run length; windowed sliding inference and stitching yield polished contigs. Polishing down-samples to 50× per chunk and parallelizes over 1-kb chunks with 50-bp overlaps.
Scaffolding: HiC proximity-ligation reads were used with HiRise to scaffold polished assemblies to near chromosome scale; misjoins were assessed by HiC signal (no contig breaks were required across the 11 assemblies).
Benchmarking and evaluations: Assemblers compared (Shasta, wtdbg2, Flye, Canu) on HG00733, HG002 (diploid), and CHM13 (haploid). QUAST-LG for disagreements and NGAx; BAC clone analysis targeted to segmental duplications; base-level error rates via Pomoxis assess_assembly against high-quality truths (GIAB/GRCh38-based for HG002, phased PacBio assembly for HG00733, T2T CHM13 chrX); MHC assembly continuity and haplotype analyses including trio-binning; gene completeness via Comparative Annotation Toolkit (GENCODE v30) and BUSCO; runtime and cost measured on AWS (Canu on NIH Biowulf with estimated cost). Short-read polishing test: Pilon with 10x Chromium reads on CHM13 chrX after MarginPolish+HELEN. Comparison with PacBio HiFi (Canu) CHM13 assembly.
Trio-binning: Parental k-mer sets (meryl) and read partitioning for HG00733 and HG002 to generate maternal/paternal read bins and assemblies for haplotype-specific assessments (e.g., MHC).
- Throughput: 2.3 Tb nanopore data across 11 genomes in 9 days using PromethION; per genome 48×–85× coverage (avg ~63×); read N50 28–51 kb; ~55× coverage from reads >10 kb and ~6.5× from reads >100 kb.
- Speed and cost: Shasta assemblies completed in average 5.25 h per genome on AWS at ~$70; wtdbg2 was 7.5× slower and 3.7× more expensive; Flye was 11.9× slower and 9.9× more expensive. Canu required ~4–5 days and an estimated ~$18–19k compute cost per genome on a cluster.
- Contiguity: Before polishing, contig NG50 (Mb): Canu (HG00733 40.6; HG002 32.3; CHM13 79.5), Flye (25.2, 25.9, 35.3), Shasta (21.1, 20.2, 41.1), wtdbg2 (15.3, 13.7, 14.0). Shasta was more conservative but produced high mapped fraction and lowest unaligned sequence (on average only 0.38% unaligned vs ~1.6% others).
- Correctness: After discounting SV-enriched regions, Shasta had 1.2–2× fewer disagreements than Canu and wtdbg2 and comparable to Flye; on CHM13 chrX vs T2T assembly, Shasta had 2–17× fewer disagreements than others. Base-level accuracy (unpolished balanced error rate): Shasta diploid 0.98%, haploid 0.54%, outperforming others.
- Polishing performance: On HG00733 (diploid), MarginPolish+HELEN achieved 0.388% balanced error (QV 24.12) vs Racon×4+Medaka 0.455% (QV 23.42). On CHM13 chrX (haploid), 0.064% (QV 31.92) vs 0.110% (QV 29.59). Errors dominated by indels, particularly homopolymers. Polishing runtime/cost: MarginPolish+HELEN ~$107 and ~29 h vs Racon+Medaka ~$621 and ~142 h per sample. Applying Pilon with 10x reads after MarginPolish+HELEN on CHM13 chrX improved QV from ~32 to ~36.
- MHC assemblies: CHM13 MHC spanned by a single contig in all assemblers; for HG00733 diploid, Shasta and Flye used two contigs; trio-binned maternal HG00733 spanned MHC in a single contig matching GL000255.1; paternal had breaks due to lower coverage.
- Gene content: CAT annotation showed nearly all coding genes present in polished Shasta assemblies: HG00733 99.23% genes found (152 missing) with 68.07% complete; CHM13 99.11% (175 missing) with 74.20% complete. Comparable or better completeness vs other assemblers after HELEN; Wtdbg2 had more missing genes (506) on HG00733.
- Scaffolding: HiC scaffolding (HiRise) introduced on average 891 joins per assembly, achieving near chromosome-scale scaffolds with median scaffold NG50 ~129.96 Mb; no contig breaks required by HiC misjoin detection; dot plots showed chromosome-scale colinearity (e.g., HG00733 vs GRCh38).
- SV extraction: From Shasta assembly graph (HG002), putative heterozygous SVs showed expected retrotransposon length spikes; F1 = 0.68 vs GIAB benchmark in high-confidence regions.
- Comparison to PacBio HiFi (CHM13, Canu): Shasta NG50 41.0 Mb vs HiFi 29.0 Mb; disagreements to GRCh38 lower for Shasta (1073 vs 8469; intersection 380 vs 594); NGAx similar (~20 Mb). HiFi assembly had higher base QV (41) than nanopore+Shasta polished (QV ~32).
The toolkit demonstrates that nanopore-based de novo human genome assembly can be performed rapidly and inexpensively while maintaining high correctness. Shasta’s conservative assembly strategy yields fewer misassemblies and minimal unaligned sequence, trading some contiguity for correctness. MarginPolish and HELEN substantially improve base accuracy using only nanopore reads, achieving QV ≥30 on haploid assemblies and outperforming Racon+Medaka in both accuracy and efficiency. HiC scaffolding is effective and necessary to reach near chromosome-scale assemblies across diverse human samples without introducing detectable misjoins. Trio-binning enables haplotype-resolved assembly of complex loci like MHC, though reduced effective coverage can decrease contiguity and increase errors relative to chimeric diploid assemblies; higher coverage or merging strategies could mitigate this. Compared to PacBio HiFi, nanopore+Shasta delivers higher contiguity and fewer reference disagreements but lower base-level QV, suggesting complementary polishing with HiFi or 10x reads can achieve QV 40+. Overall, the results address the initial goal by reducing computational time and cost by orders of magnitude, enabling broader application of de novo assembly and improved access to complex genomic regions and structural variants.
This work introduces Shasta, a fast, memory-resident long-read assembler optimized for nanopore RLE/marker-space alignment and graph simplification, and a polishing pipeline (MarginPolish and HELEN) that leverages pair-HMM posteriors and a multi-task RNN to correct base and run-length errors. Using one PromethION and HiC scaffolding, the team assembled 11 human genomes in 9 days, producing near chromosome-scale scaffolds and high base accuracy with nanopore-only data. Shasta is significantly faster and cheaper than contemporary assemblers while yielding fewer misassemblies and higher base accuracy. The complete pipeline can produce a polished assembly in ~24 h for ~$180, and further parallelization could reduce turnaround toward a half-day. Future directions include integrating phasing into assembly to generate fully phased diploid assemblies, improving homopolymer error modeling, and incorporating complementary data types (e.g., PacBio HiFi, 10x) to reach QV 40+ across whole genomes. Real-time basecalling and streamlined computation could enable DNA-to-de novo assembly in under 96 hours, facilitating rapid genomic screening of difficult regions.
- Shasta’s conservative approach yields slightly lower contiguity than Canu and fewer assembled bases in duplicated regions; many BACs in duplicated regions were not contiguously assembled compared to Canu/Flye.
- Base errors remain dominated by indels, especially homopolymer run-length errors; despite RLE, residual errors and numerical/encoding artifacts can affect very long homopolymers.
- Trio-binned assemblies suffer reduced contiguity and higher base error due to halved coverage and unclassified reads; higher coverage is needed for optimal haplotype assemblies.
- The workflow requires large-memory compute nodes (1–2 TB RAM) for optimal Shasta performance, which may constrain some environments.
- Nanopore-polished assemblies have lower base QV than PacBio HiFi-based assemblies; additional polishing with complementary data is beneficial to reach QV ≥40.
- Chromosome-scale continuity still relies on additional HiC scaffolding; long ultra-long-read coverage (>100 kb) at ~30× is modeled to be needed for fully contiguous chromosome assemblies.
Related Publications
Explore these studies to deepen your understanding of the subject.

