Introduction
The identification of somatic mutations from DNA sequencing of tumor samples is crucial for cancer research and precision oncology. Current methods often rely on statistical models of variant allele frequencies and heuristic filters to minimize false positives. These methods are largely based on human expert knowledge of DNA sequencing data and tumor biology. Machine learning offers a complementary, data-driven approach that can leverage the vast amounts of next-generation sequencing data. Existing machine learning approaches in somatic variant calling include Strelka2, which uses a machine learning model to predict variant confidence; SMURF, an ensemble caller using machine learning and features from multiple variant callers; and NeuSomatic, which uses deep learning on aggregated base and read counts. Deep Variant, a germline variant caller, utilizes images of aligned DNA reads and deep learning, similar to manual review by human experts. However, a deep learning approach operating directly on raw DNA read alignments for somatic variant calling, considering tumor sequencing data complexity, intratumor heterogeneity, and matched normal reads, hasn't been extensively explored. This paper introduces VarNet, a deep learning-based approach to predict somatic single nucleotide variants (SNVs), insertions, and deletions (indels) from tumor-normal sequencing data. VarNet generates image representations of aligned reads, incorporating base quality, mapping quality, and strand bias. Since large labeled datasets are scarce in cancer genomics, VarNet employs a weakly supervised learning approach, generating high-confidence pseudo-labels across multiple cancer types and whole genomes. The performance of VarNet is evaluated on real and synthetic benchmark datasets.
Literature Review
Several existing methods for somatic variant calling have been developed, each with its own strengths and weaknesses. Statistical methods combined with heuristic filters are widely used but rely heavily on human expertise and may not fully capture the complexities of the data. Machine learning offers a potential improvement by learning patterns directly from data. Strelka2 incorporates machine learning to improve its probabilistic model. SMURF uses an ensemble approach combining multiple variant callers and machine learning. NeuSomatic employs deep learning on local read counts, but it lacks the capability of utilizing the full context of the alignments. DeepVariant, designed for germline variants, shows the potential of image-based deep learning for variant calling, but hasn't been adapted to the complexities of somatic variants. This study builds upon these previous works, aiming to create a more accurate and robust somatic variant caller using a novel deep learning approach.
Methodology
VarNet was trained on data from over 300 matched tumor-normal whole-genome sequenced (WGS) samples across seven cancer types. Pseudo-labels, instead of true ground truth labels, were generated using SMURF, an ensemble caller that combines predictions from four popular variant callers (MuTect2, Freebayes, VarDict, and VarScan). This weakly supervised approach mitigates the need for extensive manual annotation. For training, image-like representations were created for each candidate site. These representations incorporated raw alignment data such as base, base quality, mapping quality, strand bias, and surrounding sequence context. For SNV calling, a custom convolutional neural network (ConvNet) was used, while an InceptionV3 architecture was employed for indel calling. The SNV model had approximately 3.5 million trainable parameters, whereas the InceptionV3 model for indels had 20 million. Both models were trained using the Adam optimizer. The input encoding for SNVs included a 100x70x5 tensor, representing 100 overlapping alignments, with 70 base pairs of context (30bp upstream and 40bp downstream of the candidate site) and 5 input channels representing base, base quality, mapping quality, strand bias, and reference base information. A similar approach was used for indels, but with a 140x150x5 tensor to account for their variable length. The candidate site was replicated five times in the SNV input to enhance the signal. Before feeding the data to VarNet, a pre-filtering step was applied to reduce computational cost. This step filters out positions that are extremely unlikely to be somatic mutations. Finally, germline variant filtering was performed as post-processing, using a local 10 bp window to identify and remove germline SNPs or short indels.
Key Findings
VarNet was benchmarked on several publicly available datasets, including the ICGC Gold Set (CLL and MBL samples), COLO829 (melanoma), and SEQC2 (breast cancer). It consistently outperformed existing callers, including Strelka2, Mutect2, Freebayes, and NeuSomatic. For example, in the MBL sample, VarNet achieved F1 scores of 0.84 (SNV) and 0.79 (indel), exceeding Strelka2's 0.79 (SNV) and 0.65 (indel) and Mutect2's 0.68 (SNV) and 0.40 (indel). Similar improvements were observed in other datasets. Across all datasets, VarNet achieved an average maximum F1-score of 0.89 for SNV calling and 0.69 for indel calling. VarNet also demonstrated robustness across different variant allele frequencies (VAFs). Even at low VAFs (<0.3), VarNet displayed higher accuracy than other callers. The impact of tumor purity and read depth was assessed by diluting the MBL sample with normal reads. Even at 50% tumor purity, VarNet maintained high accuracy, showing a recall of 0.64 and precision of 0.97. In the DREAM challenge, VarNet performed competitively with other methods, achieving a high average F1-score of 0.90 for SNV calling and 0.66 for indel calling. Finally, using guided backpropagation, the authors visualized the features learned by VarNet. The analysis showed that VarNet effectively identifies variant alleles at the candidate site, utilizing multiple positions and properties of the encoded alignments. Notably, VarNet's performance surpassed the SMURF ensemble method used to generate its pseudo-labels in independent benchmarks.
Discussion
VarNet's success stems from its unique approach of directly learning from raw alignment data rather than relying on human-engineered features. This mirrors how human experts manually review variant calls. Unlike NeuSomatic, VarNet is trained on real mutations from multiple cancer types and utilizes a larger sequence context in its input encoding, enabling it to capture more complex patterns in the data. The robust performance of VarNet across various benchmark datasets, including those with low VAFs, low purity, and challenging genomic regions, demonstrates the effectiveness of its deep learning approach. The ability of VarNet to outperform the ensemble method used for generating training labels suggests that deep learning can effectively learn from weakly supervised data and generalize well to unseen samples. However, indel calling still poses a challenge, as the indels are less frequent and their accurate pseudo-labeling is more difficult. Future work could explore strategies like self-training to improve indel calling accuracy.
Conclusion
This study introduces VarNet, a highly accurate deep learning method for somatic variant calling that directly learns from raw sequence alignment data. VarNet surpasses existing methods in accuracy and generalizability across various benchmark datasets, including those with challenging characteristics. Its success highlights the potential of deep learning in augmenting and potentially replacing human-engineered features and heuristic filters in somatic variant calling. Future research should focus on improving indel calling accuracy, perhaps through self-training or other strategies.
Limitations
While VarNet demonstrates superior performance, it's important to acknowledge certain limitations. The use of weakly supervised learning through pseudo-labels generated by an ensemble method introduces potential noise into the training data, potentially affecting model performance. The performance on indels is less impressive than on SNVs, mainly due to the lower frequency of indels and the difficulty in their accurate annotation. The generalizability of VarNet to other cancer types or sequencing platforms may require further investigation. Finally, the computational cost of VarNet might be higher than some existing methods, although this can be mitigated with proper pre-filtering.
Related Publications
Explore these studies to deepen your understanding of the subject.