
Computer Science
StratoMod: predicting sequencing and variant calling errors with interpretable machine learning
N. Dwarshuis, P. Tonner, et al.
Discover StratoMod, an innovative interpretable machine-learning classifier developed by Nathan Dwarshuis, Peter Tonner, Nathan D. Olson, Fritz J. Sedlazeck, Justin Wagner, and Justin M. Zook. This groundbreaking research predicts germline variant calling errors by utilizing genomic context features, revolutionizing the assessment of variant calling accuracy and enhancing identification of clinically relevant variants.
~3 min • Beginner • English
Introduction
Sequencing technologies (short- and long-read) exhibit variable performance across genomic contexts such as homopolymers, tandem repeats, segmental duplications, and GC extremes. Bioinformatics tools (mappers, variant callers) also vary by context, necessitating informed tradeoffs among cost, time, compute, expertise, and accuracy for targeted regions. GIAB maintains stratifications and benchmarks that highlight context-specific performance and have enabled improved references and benchmarks (e.g., T2T and pangenome assemblies). However, stratifications describe categories rather than predict where errors will occur. Prior modeling (e.g., GATK VQSR, deep learning) often lacks interpretability, focuses on read-level features over genome context, and primarily targets precision (filtering false positives) rather than recall (predicting missed true variants). The research goal is to develop an interpretable model that predicts precision and recall as functions of genomic context for specific methods, enabling quantitative, explainable assessments across contexts and pipelines.
Literature Review
Existing evaluation frameworks use GIAB stratifications to summarize performance in low complexity, segmental duplications, hard-to-map, functional regions, GC, and other categories, revealing technology-dependent strengths (e.g., ONT in segmental duplications; Illumina in homopolymers). Variant quality modeling approaches include GATK VQSR (Gaussian mixture models) and deep learning callers that jointly use reference sequence and aligned-read features. Clinical lab methods also predict variants not requiring orthogonal confirmation. These methods improve precision but often lack interpretability and underemphasize genome-context features; importantly, they do not predict true variants likely to be missed. Interpretable generalized additive models, specifically Explainable Boosting Machines (EBMs), can capture univariate and pairwise effects with transparent contribution plots and have uncovered domain-dependent confounding in prior work, motivating their use to model genome-context-driven error propensities.
Methodology
Overview: StratoMod uses Explainable Boosting Machines (EBMs) to model the probability of correct calling (recall) or precision given genomic-context features. Separate models are trained for SNVs and INDELs using labels (TP, FP, FN) from comparisons of query callsets to benchmarks, followed by annotation with context features.
Data and benchmarks: Primary analyses used GIAB v4.2.1 benchmarks and a new draft HG002 T2T-HG002-Q100 assembly-based small-variant benchmark created with DeFrABB using dipcall on the Q100 diploid assembly aligned to GRCh38. Benchmark regions include high-confidence 1:1 aligned regions with exclusions for assembly gaps and large repeats; SVs and large tandem repeats overlapping SVs are excluded. Training focused on autosomes (chr1-22), with MHC removed and multi-allelics split; SVs (>50 bp) and genotype errors removed.
Labeling: Query VCFs (e.g., DeepVariant callsets for Illumina PCR-free WGS, PacBio HiFi, Element with BWA-MEM or VG) were compared to benchmarks using vcfeval (with refoverlap and retaining filtered records). Variants were labeled TP, FP, FN and converted to BED. For recall models, non-PASS variants were remapped (FP→TN; TP→FN), and only TP/FN were used (TP=1, FN=0). For precision models, FN removed and FP=0, TP=1. For Jaccard index models, FP and FN constitute errors against TP.
Feature engineering (22 main effects plus select interactions):
- Homopolymers (HOMOPOL): per-base imperfect and perfect lengths ≥4 bp from reference with 1 bp slop; includes length and imperfect fraction (max ~20%).
- Tandem repeats (TR): from UCSC simple repeats, merged with statistics (period, copy number, identity, indels), AT/GC content, length, count; 5 bp slop; unit size 1 excluded (handled as homopolymers).
- Segmental duplications (SEGDUP): from UCSC genomicSuperDups; number, size, and similarity (min/max/mean fracMatchIndel); no slop.
- RepeatMasker (REPMASK): classes SINE, LINE, LTR with lengths or binary indicators; no slop.
- Hard-to-map (MAP): GIAB GEM-based low-mappability stratifications at 100 bp and 250 bp as binary indicators.
- VCF-derived: VCF_indel_length (ALT-REF length difference; sign indicates insertion/deletion), VCF_input (categorical pipeline/method), with depth and VAF used only in a separate precision model.
Modeling: EBMs (interpretml ExplainableBoostingClassifier) with univariate and selected pairwise interactions. Interactions included VCF_input with all features and INDEL length with each homopolymer length. Training used random 80/20 splits. Outputs are logit scores decomposable by feature. Performance assessed via PR and ROC curves, plus calibration.
Use cases:
1) Predict recall for Illumina PCR-free and PacBio HiFi DeepVariant callsets using the HG002 Q100 benchmark; apply to pathogenic/likely pathogenic ClinVar variants to predict missed variants at a 90% recall threshold.
2) Compare linear (BWA-MEM) vs graph-based (VG/giraffe) mappers on Element DeepVariant callsets by modeling TP vs FN across hard-to-map contexts.
3) Track technology/software improvements: Ultima R2022 vs R2024 callsets (DeepVariant) and ONT guppy4+clair1 vs guppy5+clair3, focusing on homopolymer-driven error profiles.
Validation: Compare ClinVar model predictions to gnomAD v4 filtering by computing agreement plots between predicted error rates (1 − predicted Jaccard probability) and fractions of non-PASS variants, stratified by allele count and region types (e.g., segmental duplications).
Key Findings
- Interpretable context effects: Hard-to-map regions strongly predict increased FN rates for both SNVs and INDELs, with smaller increases for HiFi than Illumina, consistent with improved mapping of long reads. In LINEs, SNV errors increase with LINE length, with a spike around ~6 kb (full-length L1s); INDEL error increases with tandem repeat length, with HiFi less impacted for long repeats and long insertions.
- Model performance: On HG002 using the Q100 benchmark, SNV models achieved near 1.0 area under PR and ~0.96 ROC AUC for both Illumina and HiFi; INDEL models achieved PR ~0.84–0.86 and ROC AUC ~0.985. Similar performance observed using GIAB v4.2.1 and across genomes HG005/HG007.
- ClinVar predictions (90% recall threshold): HiFi predicted to miss far fewer pathogenic/likely pathogenic variants than Illumina. Disagreements show Illumina misses driven by segmental duplications, low mappability, tandem repeats, and large indels; HiFi misses concentrated in homopolymers. Genes with many predicted missed variants include PKD1, PMS2, NFI, CYP21A2, CHEK2, NEB, FLG, SDHA, ABCC6, SMN1, STRC, KMT2D, CYP11B1, CDKN1C, PIK3CA, PROS1, HYDIN, TUBB2B, and falsely duplicated GRCh38 genes (e.g., CBS). HiFi-DeepVariant had ≥5-fold fewer predicted misses in most of these genes except a subset (e.g., CBS, CYP21A2, NEB, ABCC6, HYDIN).
- Linear vs graph-based mapping: EBMs captured VG's advantage in hard-to-map regions: error increases less for VG in difficult-to-map regions, longer LINEs, and high-similarity segmental duplications (>92%). In discordant cases, VG often had higher predicted recall than BWA (sometimes by up to ~50%), especially in hard regions; examples show VG correctly mapping reads in segmental duplications by leveraging variation in the graph.
- Technology and software improvements:
• Ultima: R2024 vs R2022 shows substantial overall error reduction with improved homopolymer handling; newer iteration shows increasing advantage at longer homopolymers (especially INDELs).
• ONT pipeline: guppy5+clair3 markedly improves predicted recall over guppy4+clair1, especially for INDELs in long AT/GC homopolymers (>8–10 bp for AT; >6 bp for GC), consistent with reduced homopolymer miscounting.
- gnomAD validation: Agreement plots show reasonable correlation between StratoMod-predicted error rates and gnomAD non-PASS fractions, with StratoMod generally predicting fewer errors for low allele count variants (<10), consistent with gnomAD's conservative filtering of rare variants. Strong correlations within segmental duplications across a wide probability range. Differences largely attributed to lack of read-level features (e.g., strand bias) in StratoMod and differing modeling goals (individual callset correctness vs population-level filtering).
Discussion
StratoMod meets the need for an interpretable, context-aware model that predicts where variant calling will succeed or fail across pipelines. By quantifying how specific region types (e.g., homopolymers, tandem repeats, LINEs, and segmental duplications) and their attributes modulate error rates, StratoMod translates existing intuition into calibrated, actionable probabilities. This enables risk analysis for pipeline design, such as choosing long reads to improve recall in hard-to-map contexts while recognizing potential tradeoffs in homopolymers. The model also distinguishes mapper-specific effects, showing where graph-based mapping provides substantial gains, thereby informing both tool development and user choices based on resource constraints and target regions. Validation against gnomAD filtering supports the model’s external consistency while highlighting differences due to rare variant handling and absence of read-level features. Overall, StratoMod provides a transparent framework to predict recall (and precision/Jaccard when desired), explain error mechanisms, and guide technology and software selection for clinical and research applications.
Conclusion
This work introduces StratoMod, an interpretable EBM-based framework that predicts variant calling errors from genomic context, quantifies feature contributions, and compares pipelines and mappers. It precisely models recall for short- and long-read technologies, identifies challenging regions and genes where variants are likely to be missed, and demonstrates quantified advantages of graph-based mapping in hard-to-map regions. StratoMod also tracks improvements in emerging platforms and caller pipelines and aligns with external population filtering signals from gnomAD. The open-source pipeline enables users to train models tailored to their data and applications. Future directions include expanding to larger and more complex variants (SVs), enriching feature sets and interactions (with care to preserve interpretability), leveraging more comprehensive assembly-based benchmarks, and extending to additional data types (e.g., targeted/exome, somatic) with appropriate features and training data.
Limitations
- Scope: Focused on germline small variants; structural variants (>50 bp) not modeled. MHC and sex chromosomes were excluded; autosomes only.
- Generalizability: Feature profiles and performance depend on specific mappers/callers and training data; conclusions may not generalize across all pipelines without retraining.
- Features: A limited, carefully selected feature set was used to maintain interpretability; some error mechanisms (e.g., read-level artifacts like strand bias) are not modeled. Up to ~20% of false positives lack genome-context features; some variants intersect no features.
- Modeling: EBMs (tree-based GAMs) provide limited uncertainty quantification; only select pairwise interactions included to retain interpretability.
- Data: ClinVar includes a small fraction of somatic variants (estimated ~2%), though gnomAD-based validation used germline overlap. Training emphasized WGS; exome/targeted panels may require additional coverage/edge features and more data for robust training.
- Predictions vs reality: Thresholds (e.g., 90% recall) for categorizing missed variants are user-dependent; predicted recall is related but not identical to the probability of a false negative in any genome and depends on priors (variant presence/benchmark inclusion).
Related Publications
Explore these studies to deepen your understanding of the subject.