Computer Science

VeChat: correcting errors in long reads using variation graphs

X. Luo, X. Kang, et al.

VeChat, developed by Xiao Luo, Xiongbin Kang, and Alexander Schönhuth, is revolutionizing long-read sequencing data correction by overcoming biases introduced by consensus sequences. This groundbreaking open-source tool significantly reduces errors across various platforms, enhancing genome assembly quality.... show more

Introduction

Third-generation sequencing (TGS) technologies such as PacBio and Oxford Nanopore provide long reads that enable applications including haplotype phasing, genome assembly, and complex variant calling. However, TGS reads have elevated error rates (5–15%), dominated by indels, which complicates downstream analyses and limits direct applicability of tools developed for low-error short reads. Existing error-correction strategies fall into hybrid (using short reads) and self-correction (using only long reads). Hybrid methods can suffer from mapping ambiguities in repetitive regions, PCR-induced biases, and practical constraints of multi-platform sequencing. Self-correction methods commonly rely on consensus sequences derived from multiple sequence alignments, de Bruijn graphs, or combinations thereof. A fundamental limitation of consensus-based templates is that they tend to mask true variants, especially those present in low-frequency haplotypes or strains in mixed samples (polyploid genomes, metagenomes, cancer), biasing correction toward the dominant allele. The research question addressed is whether a graph-based reference representation that captures alternative haplotypes can preserve haplotype-specific variation during error correction while reducing sequencing errors. The study introduces VeChat, a self-correction method that leverages variation graphs to perform haplotype-aware error correction for long reads, aiming to maintain genetic diversity and improve downstream analyses such as haplotype-aware assembly.

Literature Review

Prior self-correction methods for long reads can be grouped into: (1) multiple sequence alignment (MSA)-based approaches (e.g., Racon, Canu’s correction module, FLAS), (2) de Bruijn graph (DBG)-based approaches (e.g., Daccord using local DBGs), and (3) hybrids combining MSA and DBG ideas (e.g., LoRMA, CONSENT). All these methods ultimately produce a consensus sequence as a correction template, which cannot capture sequence ambiguity and thus favors default alleles in uncertain regions, potentially masking true haplotype-specific variants at low coverage or low frequency. Hybrid correction integrating short reads (e.g., proovread and related tools) can mitigate error but suffers from ambiguous mappings in repeats, reintroduces PCR-induced biases, and may be impractical due to multi-platform requirements. Variation graphs have been successfully applied to read mapping, variant calling, haplotype modeling, and assembly from mixed samples, but had not been applied to long-read error correction prior to this work.

Methodology

VeChat performs haplotype-aware self-correction via two iterative cycles operating on local windows of read piles represented as variation graphs. Workflow: (1) Read overlap calculation: In cycle 1, all-vs-all overlaps among raw reads are computed using Minimap2 with seed-chaining (no base-level alignment). Overlaps shorter than 500 bp, self-overlaps, internal matches, and high-error overlaps are filtered. In cycle 2, overlaps among pre-corrected reads include base-level alignments enabling overlap identity estimation; haplotype-aware filtering retains overlaps with identity ≥ δ (δ=0.99 for simulated data, 0.98 for real data) to keep reads from the same haplotype. (2) Read alignment pile generation: For each target read r, overlapping reads are aligned to r using fast edit-distance alignment to build a pile. Dangling ends are trimmed. (3) Window segmentation: The pile is segmented into non-overlapping 500 bp windows along the target read; the corresponding segment in r is the target subread. This reduces computational complexity and enables parallelism. (4) Error correction per window: Cycle 1 (graph pruning novelty). Subreads within a window are used to construct a variation graph G(V,E,P) using POA (SIMD-accelerated). Nodes represent nucleotides; edges capture observed adjacent nucleotide pairs in subreads; paths encode subreads. Because graphs built from noisy reads contain spurious nodes/edges, VeChat iteratively prunes edges using a frequent itemset model with Support and Confidence: - For node v, weight w(v) is the expected number of reads supporting v: sum over covering reads of 1 (FASTA) or 1−P_r (FASTQ Phred-derived error probability). - For edge e=(vi,vj), weight w(e) approximates expected reads covering e: 0.5 * sum over reads covering (vi,vj) of w(vi)+w(vj). Support(e)=w(e). Confidence is asymmetric: Confidence(vi→vj)=w(vi,vj) / sum_{vk in S(vi)} w(vi,vk); Confidence(vj→vi)=w(vi,vj) / sum_{vk in P(vj)} w(vk,vj); Confidence(e)=max of the two. Edges are pruned if Support(e) < s or Confidence(e) < c, with c=0.2 and s scaling with local coverage: s=0.2×C, where C is average coverage in the subwindow. After each pruning round, subreads are realigned to the pruned graph (POA) to update weights; iterate (empirically three iterations). The optimal alignment path of the target subread in the final pruned graph is the pre-corrected subread. Cycle 2 (consensus): Using haplotype-filtered overlaps, VeChat constructs local variation graphs again but now derives a single consensus per window via the heaviest-bundle dynamic programming algorithm instead of iterative pruning, as overlapping reads are assumed from the same haplotype after cycle 1. (5) Concatenation: Pre-corrected subreads are concatenated to reconstruct the full target read. (6) Merging: Repeat for all reads; pre-corrected reads from cycle 1 feed into cycle 2; cycle 2 outputs final corrected reads. Datasets and benchmarking: Simulated PacBio CLR and ONT reads were generated with PBSIM2 for polyploid E. coli mixtures (ploidy 2–4, 30×/haplotype, 10% error), and CAMISIM-based metagenomes (low complexity: 20 strains; high complexity: 100 strains). Real PacBio datasets included a yeast pseudo-diploid mixture, NWC metagenome, and a Microbial 10-plex mock metagenome. Real ONT datasets included HG002 (diploid) and a human gut microbiome sample. Competing tools: Racon, CONSENT, Canu, Daccord run with defaults. Evaluation metrics: QUAST (error rate = mismatches + indels, haplotype coverage, N50/NGA50, misassemblies) where references available; Merqury (QV, switch error, k-mer completeness) for reference-free evaluation. Runtime/memory were recorded on 48-core x86_64 machines.

Key Findings

Simulated PacBio CLR polyploid genomes: VeChat achieved the lowest error rates across ploidies with high haplotype coverage and favorable N50/NGA50 and misassembly counts. Example (Table 1): diploid error rate 0.014% (CONSENT 0.194%, Racon 0.276%, Canu 0.308%, Daccord 0.423%); triploid 0.031% (vs 0.276–0.807%); tetraploid 0.074% (vs 0.275–0.833%). This corresponds to approximately 14–30× (ploidy 2), 9–26× (ploidy 3), and 4–11× (ploidy 4) lower error rates. Mismatch rates were 4–69× lower than alternatives. - Simulated ONT polyploid genomes (Table 2): VeChat error rates 0.022% (diploid), 0.090% (triploid), 0.169% (tetraploid), delivering ~10–20×, ~3–9×, and ~2–5× reductions, respectively, with consistently lower mismatch rates (2–59× lower). - Simulated PacBio CLR metagenomes (Table 3): Low complexity (20 genomes): VeChat error rate 0.036% vs 0.200–0.259% (6–7× lower); High complexity (100 genomes): 0.088% vs 0.274–0.354% for methods that ran (3–4× lower). Racon and Daccord failed on the high complexity dataset; VeChat succeeded. - Real PacBio datasets (Table 4): Yeast pseudo-diploid: VeChat 0.236% vs 0.503–0.947% (2–4× lower). NWC metagenome: VeChat 0.101% vs 0.140–0.787% (1.4–7.8× lower). Microbial 10-plex: VeChat 0.089% vs 0.297–0.495% (3.3–5.6× lower). - Real ONT datasets evaluated with Merqury showed metric-specific biases favoring k-mer/DBG-based tools; nonetheless, VeChat achieved ~1.5× lower QV error rate and ~1.2× lower switch error than CONSENT on HG002 while losing some haplotype coverage. On human gut microbiome, Daccord had the lowest QV error, but VeChat achieved comparable accuracy and substantially lower overall error in contexts with known ground truth (noting Merqury’s bias). - Varying coverage (triploid E. coli, PacBio): VeChat maintained 2–47× lower error rates across 10×–50× per haplotype; error rate improved from 0.311% (10×) to 0.017% (50×). - Ultra-high coverage HIV (5 strains, ~1000×): VeChat was best on PacBio and second-best on ONT; effective but less so for ONT in this setting. - Varying sequencing error rates (triploid E. coli, PacBio): VeChat achieved 10–93×, 9–26×, and 7–9× lower error rates at 5%, 10%, and 15% raw error, respectively; VeChat’s corrected error dropped from 0.091% (15%) to 0.009% (5%). - Assembly improvements: Using VeChat-corrected reads substantially boosted (Hi)Canu and (meta)Flye assemblies: in diploid tests, VeChat+HiCanu increased haplotype coverage from 92% to 99.6% (+7.6%), N50/NGA50 ~10× higher, fewer misassemblies, and >90× lower error rate. Gains persisted and increased with ploidy. - Runtime/memory: VeChat is slower (by ~0.6–7.1× depending on dataset) and uses more memory than alternatives, but uniquely handled large/high-complexity datasets where some competitors failed.

Discussion

The findings support the hypothesis that representing read variation with graphs during correction preserves haplotype-specific differences while reducing sequencing errors. By avoiding a single consensus template, VeChat prevents masking of true variants, especially in low-frequency haplotypes or mixed samples. Iterative pruning guided by Support and Confidence effectively removes edges induced by random sequencing errors, while retaining edges supported by recurrent, co-occurring variants. Empirically, VeChat achieves markedly lower mismatch rates—consistent with SNPs being key haplotype markers—while also reducing indel errors. The method’s benefits translate directly to downstream assembly: corrected reads maintain haplotype identity and cleanliness required by assemblers like HiCanu, yielding assemblies with higher completeness (haplotype coverage), greater contiguity (N50/NGA50), fewer misassemblies, and lower base error. The evaluation also underscores metric dependencies: k-mer-based Merqury can bias comparisons toward DBG-centric tools, whereas reference-based QUAST exposes VeChat’s strengths more uniformly. Overall, the results indicate that integrating haplotype structure into error correction is not only beneficial but can be crucial for accurate analysis of polyploid genomes and metagenomes.

Conclusion

VeChat introduces a variation graph-based, haplotype-aware self-correction strategy for long reads that preserves genetic diversity while substantially reducing errors across technologies (PacBio, ONT), ploidies, and dataset complexities. It improves downstream haplotype-aware assemblies when used as a preprocessing step. As long-read technologies continue to lower raw error rates, VeChat’s relative advantages become even more evident, achieving sub-0.01% corrected error rates in favorable conditions. Future work includes optimizing computational performance (e.g., more efficient edit-distance and sequence-to-graph alignment routines), leveraging finer-grained components of off-the-shelf tools to reduce overhead, and designing pipelines that more tightly integrate correction with assembly for specific contexts (polyploid, cancer, metagenomes).

Limitations

VeChat incurs higher runtime and peak memory usage than competing tools and can be up to several-fold slower, especially on large or complex datasets. While effective at ultra-high coverage, it is not necessarily the most effective for ONT reads in such settings. Evaluation on real ONT data using Merqury is subject to metric-specific biases favoring DBG-based methods, complicating direct comparisons. Optimizing pipelines that combine correction and assembly for particular applications remains future work and was beyond the scope of this study.

Related Publications

Explore these studies to deepen your understanding of the subject.

Biology

Strainberry: automated strain separation in low-complexity metagenomes using long reads

R. Vicedomini, C. Quince, et al.

Engineering and Technology

Real-time outage management in active distribution networks using reinforcement learning over graphs

R. A. Jacob, S. Paul, et al.

Medicine and Health

Personalized prediction of negative affect in individuals with serious mental illness followed using long-term multimodal mobile phenotyping

C. A. Webb, B. Ren, et al.

Psychology

Tracking historical changes in perceived trustworthiness in Western Europe using machine learning analyses of facial cues in paintings

L. Safra, C. Chevallier, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny