logo
Loading...
Strainberry: automated strain separation in low-complexity metagenomes using long reads

Biology

Strainberry: automated strain separation in low-complexity metagenomes using long reads

R. Vicedomini, C. Quince, et al.

Discover Strainberry, an innovative metagenome assembly pipeline developed by Riccardo Vicedomini, Christopher Quince, Aaron E. Darling, and Rayan Chikhi. This cutting-edge tool enhances strain separation in low-complexity microbial communities, showcasing impressive results in both mock and real datasets. Experience unprecedented clarity in genomic assemblies and microbial diversity with Strainberry.... show more
Introduction

Metagenome sequencing enables reconstruction of microbial community genomes, but strain-level variation within species complicates assembly and downstream functional interpretation. Distinct strains can exhibit clinically and ecologically important phenotypic differences, yet most de novo metagenomic assemblers collapse conspecific strains into consensus sequences. The authors define strains operationally as haplotypes (contiguous nucleotide sequences co-observed at sufficient abundance) and formalize the strain separation problem as reconstructing partial or complete strain-specific sequences. Existing short-read approaches and reference-based profilers either reconstruct only core genes, require multiple samples, or do not produce assembled strain genomes. Long-read assemblers improve species-level contiguity but generally remain strain-oblivious. The study’s objective is to develop and evaluate a long-read-only, single-sample pipeline to perform automated strain separation in low-complexity metagenomes, improving strain-resolved assemblies over state-of-the-art long-read assemblers.

Literature Review

Prior de novo short-read methods include DESMAN (core gene haplotypes), STRONG (strain inference on SCGs via variational Bayesian methods), ConStrains (SNP-based strain models using marker genes), LSA (eigengenome partitioning across samples), OPERA-MS (hybrid assembly dependent on short-read pre-assembly), SAVAGE and VG-flow (viral quasispecies haplotypes). Reference-based tools such as StrainPhlAn and StrainEst profile strain-level variation by mapping to marker databases. These methods rely on short reads and high-fidelity k-mer detection or multi-sample designs and are not directly applicable to long-read-only data. Among long-read assemblers, metaFlye explicitly supports metagenomes and constructs repeat graphs; Canu, though not metagenome-specific, is often used; Lathe is a Nanopore-focused workflow combining assembly and extensive polishing. These can partially separate strains but often collapse closely related haplotypes or introduce duplications. A method is needed that leverages long-read variant calling and haplotype phasing to deliver de novo strain-resolved assemblies from single samples.

Methodology

Strainberry is an automated pipeline that takes as input a strain-oblivious long-read metagenome assembly (e.g., from Flye or Canu) and the corresponding long reads aligned to it. The pipeline iterates three main steps: (i) haplotype phasing and read separation, (ii) haplotype assembly, and (iii) strain-aware scaffolding, repeating for up to n−1 iterations where n is the detected number of conspecific strains, stopping when separation quality no longer improves.

  1. Haplotype phasing and read separation: Long reads are aligned to the input assembly (minimap2). Single-nucleotide variants (SNVs) are called on contigs; regions with high SNV density indicate collapsed haplotypes. The diploid long-read SNV caller Longshot is used iteratively to call variants and tag reads by haplotype; HapCUT2 performs phasing to produce phasesets (intervals spanning phased SNVs). Although polyploid separation was explored (freebayes + WhatsHap Polyphase), the iterative diploid approach provided robust performance and runtime.
  2. Haplotype assembly: For each phaseset PS = (contig i, start s, end e), reads assigned to each haplotype are assembled independently using wtdbg2 with tuned parameters for low-coverage, short-interval assemblies (e.g., -e 5, -l 1000, -L 3000, -S 1, -R). Phasesets with SNV density below 0.1% are discarded to avoid spurious splitting of highly similar regions (>99.9% identity). Assembled haplotype contigs are aligned back to their backbone contig and trimmed to the phaseset interval [s,e]; unphased backbone segments >500 bp are retained. The result is a set of strain-aware contigs comprising haplotype-specific and shared (core) regions.
  3. Strain-aware scaffolding: Original reads are realigned to strain-aware contigs (minimap2 base-level alignment). A bi-directed scaffolding graph is built from unique, high-quality dovetail alignments that bridge contigs, constrained by mappings to the backbone assembly: edges are allowed between consecutive backbone-mapped contigs, adjacent haplotype contigs within a backbone contig, read-linked core–haplotype pairs, and contigs at backbone extremities. The graph is simplified by removing transitive edges and weak edges (supported by <10 reads or <90% of total outgoing support). Maximal unambiguous paths yield scaffolds; gap sizes are set by median read-supported distances (joining directly when negative). Iterative improvement criterion: After each iteration, reads are mapped to new scaffolds and tentative phasing is evaluated. A new iteration proceeds only if the average Hamming rate (mismatch ratio over shared SNV sites between reads and closest haplotype) improves by at least 1% globally; assembly/scaffolding is applied only to sequences with local improvement. Evaluation design: Datasets include two PacBio mock communities (Mock3, Mock9), 24 simulated/dowsampled communities varying coverage, divergence, number of strains, and recombination rates, and two real datasets: NWC2 (whey starter culture; PacBio and Nanopore) with curated multi-technology references, and a Nanopore human stool microbiome (HSM) with a polished Lathe reference. Assemblies were generated with Flye (meta mode) and Canu where applicable; reads aligned with minimap2; downsampling applied to control coverage. Assemblies were assessed against references using MUMmer/dnadiff-derived metrics (NG50/N50, identity, duplication ratio, misassemblies), and for real data also with CheckM (completeness, contamination, strain heterogeneity). Runtime/memory were recorded on a shared compute node.
Key Findings

Mock3 (B. cereus; E. coli K-12 and W): Flye collapsed the two E. coli strains (98.65% inter-strain identity). Canu partially separated but missed 46% (K-12) and 11% (W) of reference and showed high duplication (1.23 and 1.39). Strainberry separated both strains from either Flye or Canu backbones, achieving near-complete reference coverage and 99.97% ANI for both E. coli strains (with Flye), total length ~9.8 Mbp, and reduced duplication compared to Canu. Misassemblies on Flye vs Strainberry (on Flye backbone): Flye 2 inversions, 3 relocations; Strainberry 2 inversions, 6 relocations—reflecting assembly of both strains rather than major rearrangements. Mock9 (9 genomes; two S. aureus strains; E. coli K-12, E. coli W, and S. sonnei similar): Flye produced a good strain-oblivious assembly, separating K. pneumoniae but collapsing others. Strainberry correctly separated the two S. aureus strains to near-complete coverage with 99.97% ANI and duplication ratio ~1.0; scaffolding improved NG50 to ~2.2 Mbp and ~2.7 Mbp. K. pneumoniae ANI improved slightly to 99.99% (from 99.95%). For E. coli K-12, E. coli W, and S. sonnei, Strainberry increased reference coverage and ANI with duplication ratios 1.08, 1.18, and 1.25, respectively. Canu alone showed high duplication and lower identity; Strainberry separation of Canu improved metrics but inherited duplication. Coverage, divergence, multiplicity, recombination (simulated/dowsampled): Quality plateaus after ~30x per-strain coverage; 10x is mediocre; near-complete (>95% coverage) and high identity (>99.8%) generally require ≥20x. Strains with divergence <~0.39% are difficult to separate; substantial improvement at 0.39% and near-complete at 0.50% divergence, with separated ANI >99.8%. Up to 5 conspecific strains can be separated, with performance decreasing at higher multiplicity (average coverage drops from ~95% at 3 strains to ~75% at 5 strains; ANI from ~99.9% to ~99.6%). Higher historical recombination rates slightly reduce separated ANI; Strainberry still separates strains where Flye remains strain-oblivious. NWC2 (whey starter; PacBio and Nanopore): PacBio Flye (no polishing) missed 23.03% of L. helveticus NWC_2_3 and 19.99% of NWC_2_4; Strainberry scaffolds covered almost fully (missing 9.87% and 0.64%). Canu covered both strains; Strainberry further reduced unaligned bases for NWC_2_3 from 7.65% to 5.26%. L. delbrueckii regions showed duplicated coverage across assemblies, suggesting an additional conspecific strain. Strainberry+Flye added ~20% genomic material for L. delbrueckii and L. helveticus, maintaining comparable or improved identity and fewer major misassemblies. CheckM showed higher completeness for L. helveticus with Strainberry+Flye; higher contamination likely due to duplicated/closely related sequences. With Nanopore, trends were similar; sequence identity was lower (98.78–99.76% ONT vs 99.69–99.97% PacBio), but longer reads yielded 25–43% additional genomic material on L. delbrueckii and L. helveticus; polishing (Racon/Medaka/MarginPolish) improved identity. HSM (human stool; Nanopore; Lathe reference): Strainberry increased total assembly length by 19% and duplication ratio by +0.365 over Lathe, indicating separated strain sequences; Flye also showed slight duplication increase and more unaligned sequence (Flye 34.6 Mbp unaligned vs Strainberry 2.4 Mbp). Genome fraction relative to Lathe was ~90% (Flye) vs ~86% (Strainberry), consistent with lower coverage of separated sets and differing strategies. Two near-complete, low-contamination bins (Veillonella atypica and Eubacterium eligens) showed halved coverage after separation, indicating two strains; fragmentation increased post-separation. After Racon×4 + Medaka polishing, Strainberry achieved comparable completeness but notably higher contamination and strain heterogeneity, consistent with strain duplication. V. atypica bin gained 86% additional genomic material and separated into two ~2 Mbp strain-level classifications (closest to V. atypica strains ACS-134-V-Col7a and ACS-049-V-Sch6); raw reads supported both strains (2,833 and 2,504 reads assigned at strain level). E. eligens bin gained 118% additional genomic material. Several additional bins increased in size and exhibited finer strain-level classification. Performance: Strainberry used <10 GB RAM and less wall time than Flye; on HSM, Flye took 6.5 h and 132 GB, Strainberry 4 h and 9.5 GB (12 threads). Canu required 2×–30× Flye’s time.

Discussion

Strainberry addresses the challenge of resolving conspecific strains in metagenomes using only long reads from a single sample. By leveraging long-read SNV calling, haplotype phasing, targeted assembly of phased read sets, and constrained scaffolding, the pipeline converts strain-oblivious assemblies into strain-aware drafts with high identity and reference coverage in low-complexity communities. Results on mock communities demonstrate accurate separation (often ~99.97% ANI) where conventional long-read assemblers collapse haplotypes, and real datasets show recovery of substantial additional strain-specific genomic material and refinement of strain diversity missed by polished metagenome references. The findings highlight trade-offs: PacBio data yield higher sequence identity, while Nanopore’s longer reads improve contiguity and enable recovery of additional material but require polishing to mitigate indels. Coverage and divergence govern separability: ≥20x per-strain coverage and ≥0.5% divergence support near-complete, high-identity reconstructions; separation may still occur at ≥10x but with fragmentation. The observed increases in duplication ratio and contamination/heterogeneity metrics post-separation reflect true multi-strain resolution, not assembly artifacts, underscoring the need to interpret evaluation metrics in a strain-aware context.

Conclusion

The study introduces Strainberry, a long-read-only, single-sample pipeline that automates strain separation and scaffolding to produce strain-aware metagenome assemblies in low-complexity settings. It surpasses strain-oblivious long-read assemblers by delivering near-complete, high-identity reconstructions for multiple strains, validated on mock datasets and real microbiomes. Practical guidance emerges: PacBio reads favor higher identity; Nanopore reads favor contiguity and added recovery, with polishing recommended. Effective separation generally requires ≥20x per-strain coverage and benefits from ≥0.5% inter-strain divergence; the method can handle up to five conspecific strains with some performance degradation at higher multiplicity. Future work should refine separation under lower coverage and divergence, improve modeling of highly variable and recombining regions, integrate plasmid resolution, and formalize relationships among read length, coverage, and haplotype resolution. Enhanced DNA extraction for higher molecular weight reads and hybrid strategies for polishing may further improve completeness and accuracy.

Limitations

Strainberry depends on an initial high-quality strain-oblivious assembly; poor backbones can limit separation quality. Separation quality degrades at low per-strain coverage (<~20x) and very low divergence (<~0.39%), leading to fragmented or missed separations. Long-read error profiles (especially Nanopore indels) reduce identity and completeness without polishing. Historical recombination increases SNV overdispersion and slightly reduces identity. Evaluation on real datasets is constrained by incomplete references and metrics (e.g., duplication ratio, contamination) that can conflate true multi-strain resolution with artifacts. Separation can increase fragmentation and apparent contamination/strain heterogeneity due to duplicated marker genes. The approach was not evaluated for plasmid resolution and may underperform in higher-complexity communities or for very low-abundance strains. Read length limitations from DNA extraction protocols may hinder full haplotype resolution.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny