Computer Science
Accurate somatic variant detection using weakly supervised deep learning
K. Krishnamachari, D. Lu, et al.
Discover VarNet, an innovative deep learning framework developed by Kiran Krishnamachari and colleagues, which revolutionizes the identification of somatic mutations in tumor samples. This groundbreaking research leverages 4.6 million somatic variants to enhance variant calling accuracy, potentially transforming conventional methods in the field.
~3 min • Beginner • English
Introduction
Somatic mutation detection in tumor DNA sequencing is complicated by biological variation (e.g., tumor heterogeneity) and technical noise (e.g., sequencing errors). State-of-the-art somatic variant callers rely on statistical models of variant allele frequencies combined with heuristic filters crafted from expert knowledge. Machine learning provides a data-centric alternative that can learn from large volumes of sequencing data. Prior work has incorporated ML either as post-hoc scoring models or by learning from aggregated counts near candidate sites; deep learning on raw read alignments has shown promise in germline calling. The research question addressed here is whether an end-to-end deep learning model operating on rich image-like encodings of tumor and matched-normal alignments can accurately and robustly identify somatic SNVs and indels across diverse cancers and sequencing conditions, potentially reducing reliance on human-engineered features and heuristics. The authors present VarNet, trained with weak supervision from ensemble pseudo-labels across multiple cancer types, and evaluate its generalization on real and synthetic benchmarks.
Literature Review
Existing methods include Strelka2, which augments a probabilistic model with a machine learning-based confidence score; SMURF, an ensemble caller using features from multiple variant callers; and NeuSomatic, a deep learning model that predicts somatic variants from aggregated base/read counts in a small window around candidate sites. DeepVariant demonstrated that image-based deep learning on raw alignments can match expert visual review for germline variants. However, deep models operating directly on raw tumor-normal alignments for somatic calling have been less explored, particularly given deeper tumor coverage, intratumor heterogeneity, and the need to consider matched normal reads. The work positions VarNet relative to these by using larger context encodings of raw alignments and training on real tumor datasets with weak supervision.
Methodology
Training data: 356 matched tumor/normal whole genomes spanning seven cancer types (lung, sarcoma, colorectal, lymphoma, thyroid, liver, gastric), sequenced on Illumina HiSeq at 50–150× depth and processed with bcbio-nextgen. Reads aligned to GRCh37 with BWA-MEM, duplicates marked/removed, and post-processing (including local indel realignment) performed with GATK.
Weak supervision and pseudo-labels: High-confidence pseudo-labels generated using SMURF, an ensemble method combining outputs and features from MuTect2, Freebayes somatic, VarDict, and VarScan via a random forest. For training, class-balanced datasets were created: SNV model with ~2.5 million sites and indel model with ~2.1 million sites, each containing equal mutated and non-mutated sites. Mutated sites were SMURF positives; non-mutated sites were positions not called by SMURF but called by at least one of the four callers, to increase task difficulty and discriminative signal. Cancer-type representation was adjusted: downsampling overrepresented cancer types improved SNV generalization, whereas indel training benefited from using all available data due to lower indel frequency and noisier pseudo-labels.
Input encoding: For each candidate site, tumor and matched-normal aligned reads are encoded as image-like tensors with multiple channels: base identity, base quality, mapping quality, strand bias, and a reference base channel. Deletions are encoded with a unique token; insertions are encoded in place with adjustments to the reference channel. Indels up to 35 bp are supported. Tumor and normal images are concatenated. Input tensor sizes: SNV (100, 70, 5) allowing up to 100 overlapping reads and a window that repeats the candidate position 5× to amplify signal; indel (140, 150, 5) allowing up to 140 reads and wider context (no candidate repetition due to variable indel length). If coverage exceeds limits, reads are randomly subsampled.
Models and training: SNV caller is a custom ConvNet with 10 convolutional blocks (Conv + ReLU + BatchNorm), two average-pooling layers, followed by dense layers (256, 128, 64 units) and a sigmoid output; ~3.5M trainable parameters. Indel caller uses InceptionV3 (~20M parameters). Both trained with Adam (lr=1e-4), mini-batch size 32, implemented in TensorFlow, trained on Nvidia Titan X GPUs.
Genome pre-filtering and post-processing: Prior to model inference, VarNet applies computationally efficient genome pre-filters to discard positions with very low likelihood of somatic mutation, reducing compute while maintaining high sensitivity. After calling, VarNet performs germline filtering without local re-assembly: scanning a ±10 bp window around each somatic call to identify suspected germline SNPs/indels in the normal; somatic calls overlapping or adjacent (≤1 bp) to these are filtered.
Benchmarks and simulations: Independent datasets processed similarly to training: ICGC Gold Set (CLL and MBL) downsampled from ~300× to ~100×; COLO829 reference set; SEQC2 breast cancer reference set (aligned to GRCh38, GATK4, evaluated in high-confidence regions); DREAM synthetic tumors (80× split tumor/normal, with in silico mutations). Tumor purity dilution experiments performed by mixing MBL tumor with matched-normal reads at varying proportions and downsampling tumor to 40×. Performance assessed using precision–recall curves by thresholding caller-specific scores and reporting F1 (harmonic mean of precision and recall).
Model interpretability: Guided backpropagation used to compute pixel-wise importance in input channels, visualized as heatmaps at individual sites and averaged over many sites to elucidate features leveraged by the model.
Key Findings
Real tumor benchmarks (ICGC Gold Set, downsampled to ~100×):
- MBL: VarNet F1 = 0.84 (SNV) and 0.79 (indel); Strelka2 0.79 (SNV), 0.65 (indel); Mutect2 0.68 (SNV), 0.40 (indel).
- CLL: VarNet F1 = 0.87 (SNV) and 0.62 (indel); Strelka2 0.85 (SNV), 0.52 (indel). NeuSomatic showed inconsistent performance: SNV F1 = 0.43 (CLL) and 0.76 (MBL); indel F1 = 0.16 (CLL) and 0.22 (MBL).
Other reference datasets:
- COLO829: SNV — VarNet best F1 = 0.94; Indel — VarNet 0.63 vs Strelka2 0.76 and Mutect2 0.66.
- SEQC2: SNV — VarNet and Mutect2 tied at F1 = 0.92; Indel — Strelka2 0.74, VarNet 0.70, Mutect2 0.70.
Overall summary across real tumors: VarNet achieved highest average max F1 for SNVs (0.89), ahead of Strelka2 (0.85) and Mutect2 (0.74). For indels, VarNet averaged F1 = 0.69, ahead of Strelka2 (0.64) and Mutect2 (0.49).
Low VAF performance (ICGC CLL/MBL): For VAF < 0.3, VarNet average F1 = 0.70 vs Strelka2 0.49, Mutect2 0.31, Freebayes 0.08. Performance improved for all methods at higher VAF. Strelka2 and Freebayes had markedly lower F1 at VAF > 0.5 (29% and 35% reduction vs the 0.45–0.5 range), suggesting misclassification of high-VAF somatic mutations as germline; VarNet maintained high accuracy across VAFs.
Tumor purity and depth: Reducing depth to 40× had minor impact (~1% change). With decreasing purity in MBL: at 70% purity, VarNet F1 = 0.80 (Strelka2 0.76; Mutect2 0.58); VarNet recall decreased by ~7% from original while precision rose from 0.96 to 0.97. At 50% purity, VarNet F1 = 0.77 (Strelka2 0.73; Mutect2 0.54), with recall 0.64 and precision 0.97.
DREAM synthetic tumors (80×): SNV — VarNet average top F1 = 0.90 (Strelka2 0.86; Mutect2 0.81); NeuSomatic best on DREAM SNVs (F1 = 0.96) but inconsistent on real ICGC (avg F1 = 0.60). Indels — VarNet average F1 = 0.66 (Strelka2 0.68; Mutect2 0.81); Mutect2’s strong DREAM indel results contrasted with lower performance on real ICGC indels (F1 ≈ 0.41), indicating higher variance; VarNet’s indel performance was consistent across real (0.69) and synthetic (0.66) datasets; NeuSomatic had the lowest indel F1 on DREAM (avg 0.09).
Interpretability: Guided backpropagation heatmaps showed VarNet assigns highest importance to variant alleles at tumor candidate sites across channels (base identity, base/mapping quality, strand). Mapping quality activation concentrated near mutated sites for true mutations vs evenly distributed in non-mutated sites; upstream context positions contributed to predictions; reference base channel was less informative globally.
Discussion
VarNet addresses the challenge of accurately detecting somatic variants by learning directly from raw tumor and matched-normal alignments using a weakly supervised approach. Across multiple real tumor benchmarks, VarNet consistently outperformed or matched state-of-the-art callers, especially for SNVs, and maintained robust performance across low VAFs, reduced tumor purity, and lower read depth. Its consistent indel performance across real and synthetic data suggests improved generalization relative to models that overfit to synthetic training distributions. The interpretability analyses indicate that VarNet learns to focus on biologically and technically meaningful features (variant alleles at candidate sites, associated quality metrics, and local sequence/read context) without explicit feature engineering, effectively emulating expert visual review. Furthermore, VarNet surpassed the ensemble method (SMURF) used to generate pseudo-labels on independent benchmarks, highlighting the capacity of deep learning to learn beyond noisy labels given sufficient data. These results underscore the potential of scalable deep learning to reduce reliance on handcrafted heuristics in somatic variant calling.
Conclusion
The study introduces VarNet, an end-to-end deep learning framework that encodes raw tumor and matched-normal read alignments into image-like tensors and predicts somatic SNVs and indels with strong accuracy. Trained on 356 tumor-normal genomes with weak supervision from ensemble pseudo-labels, VarNet consistently achieved state-of-the-art or superior performance across diverse real and synthetic benchmarks, including challenging scenarios (low VAF, reduced purity, and lower depth). The approach demonstrates that learned representations can effectively replace human-engineered features and heuristic filters. Future work includes improving indel calling by leveraging larger and more diverse training datasets, employing self-training with VarNet-generated pseudo-labels, and extending the method to more challenging genomic regions and variant types.
Limitations
Indel detection remains more challenging than SNV calling for VarNet and other methods, with performance more sensitive to training data scale and label quality. The weakly supervised training relies on pseudo-labels from existing callers, which can be noisy—particularly for indels—potentially biasing the model. Although VarNet demonstrated robustness, performance can vary across datasets and genomic contexts, and additional high-quality training data would likely be needed to further improve indel accuracy and generalization. The study primarily benchmarked short variants; larger structural variants and complex indels are outside the current scope.
Related Publications
Explore these studies to deepen your understanding of the subject.

