Introduction
The accurate identification of somatic mutations is paramount in precision oncology, especially for determining tumor mutational burden (TMB). TMB, the number of nonsynonymous somatic mutations per megabase, is a powerful predictor of response to immunotherapy, such as immune checkpoint inhibitors (ICIs). The FDA's approval of TMB as a biomarker for pembrolizumab across tumor types underscores the need for reliable TMB estimation using whole-exome sequencing (WES). Somatic mutations drive cancer development and progression, impacting oncogene activation and tumor suppressor inactivation, while germline variants like BRCA and TP53 also contribute to cancer risk. However, matched-normal samples, essential for accurate somatic variant calling, are often unavailable clinically due to various reasons including failed quality control or lack of consent. This absence of matched normals complicates somatic variant calling, as the numerous rare germline variants make it difficult to identify genuine somatic mutations. Previous studies have shown that the absence of matched-normals inflates TMB estimates significantly, with false positive rates reaching 67%. This inflation is particularly problematic in minority groups due to underrepresentation in germline variant databases. Existing computational methods, while offering improvements, face challenges due to the complexity of cancer genomes and the statistics of next-generation sequencing (NGS). Machine learning has demonstrated success in somatic variant calling with matched-normal samples. This study investigates whether supervised machine learning could effectively classify somatic and germline variants in tumor-only solid tumor samples.
Literature Review
Several computational methods have been developed to address the challenges of tumor-only variant calling, employing sophisticated filtering or statistical inference. Bayesian methods like PureCN and SGZ infer somatic and germline probabilities by integrating global genomic properties (purity, ploidy, copy number) with variant allele frequencies (VAFs). However, the complexity of the cancer genome, including clonality and structural variations, limits the accuracy of such models. Recent advancements in machine learning have shown impressive speed and accuracy for somatic variant calling in matched-normal samples, bypassing explicit likelihood modeling by training classifiers on diverse datasets. This inspired the hypothesis that supervised machine learning can effectively classify mutations as somatic or germline in tumor-only samples.
Methodology
This study utilized three high-performing machine learning models for tabular data: TabNet, XGBoost, and LightGBM. A training set was constructed using features derived exclusively from tumor-only variant calling, with somatic/germline truth labels determined from an independent pipeline using patient-matched normal samples. The feature set included 30 mutation and copy-number specific features such as germline database frequency, COSMIC counts, VAF, major allele frequency, trinucleotide context, and base substitution subtypes. The training set consisted of 105 tumor samples from seven distinct TCGA cancer subtypes, all sequenced at the Broad Institute using the Agilent Custom V2 exome-capture kit. A validation set of 45 samples from three different TCGA subtypes, sequenced with a different capture kit (SeqCap EZ HGSC VCRome) at Baylor College of Medicine, was used to prevent overfitting. Two blind holdout test sets were created. The first included 45 TCGA samples from three additional cancer subtypes sequenced at Washington University with the Nimblegen SeqCap EZ Exome v3 kit. The second comprised 23 samples from the Hugo et al. (2016) metastatic melanoma study. The performance of the trained models was assessed using metrics like AUC, MCC, TPR, TNR, PPV, and NPV. A naive tumor-only approach was also employed for comparison, involving a panel of normals and standard variant filtering techniques. TMB estimates from the naive and ML-based approaches were compared to matched-normal TMB for evaluating reliability and the impact of racial bias in germline databases was also evaluated. Feature importance was analyzed to understand the models' decision-making process and explore the reasons for variability in model performance using multiple regression and covariance analysis. The runtime of the ML models was compared to PureCN.
Key Findings
All three trained models demonstrated state-of-the-art performance on the holdout test datasets. LightGBM achieved the best AUC (0.949), MCC (0.766), PPV (0.886), and balanced accuracy (0.883) on the TCGA holdout test set. The addition of a machine learning classifier significantly improved the concordance between matched-normal and tumor-only TMB estimates (R² increased from 0.006 to 0.71-0.76). LightGBM showed the most substantial improvement. The study replicated the finding that tumor-only TMB estimates are inflated for Black patients compared to white patients due to racial biases in germline databases. However, XGBoost and LightGBM effectively eliminated this racial bias in tumor-only variant calling, showing no significant difference in corrected TMB between the two groups. TabNet and PureCN exhibited some remaining bias. LightGBM was significantly faster than PureCN (21.9 times faster). Feature importance analysis revealed that germline database frequency was the most important feature for TabNet, while count (total number of variants) was most important for XGBoost and LightGBM, suggesting a reduced reliance on biased databases. The study also found that true TMB is strongly associated with positive predictive value (PPV), with lower TMB samples showing lower PPV. The interpretable feature masks of TabNet highlighted the importance of COSMIC count and the overall mutation count in distinguishing between true and false positives. Sensitivity was best explained by the median VAF of true somatic mutations.
Discussion
This study successfully demonstrates the potential of tabular machine learning for accurate and unbiased tumor-only variant calling. The developed models show superior performance compared to PureCN in both speed and accuracy, particularly in eliminating racial bias in TMB estimation. The findings highlight the importance of considering multiple informative features beyond germline database frequency for improved accuracy and fairness. The high concordance between matched-normal and ML-corrected TMB estimates makes these models valuable for clinical applications where matched normals are unavailable. The interpretability offered by TabNet's feature masks provides valuable insights into the models' decision-making process. The tissue-specific performance variations warrant further investigation to optimize models for specific cancer types. Although the models showed excellent generalization across different capture kits and cancer subtypes, a larger and more diverse training dataset could further enhance their performance and generalizability.
Conclusion
This research introduces a novel approach to tumor-only variant calling using tabular machine learning, significantly improving accuracy, speed, and fairness. The models successfully generalized across diverse cancer types and sequencing platforms, effectively mitigating the racial bias present in conventional tumor-only methods. Future research should focus on expanding the training dataset to encompass a wider range of cancer subtypes and genomic features, potentially incorporating additional data such as mutational signatures or RNA-seq data. The development of similar models trained on specific cancer subtypes may also lead to improvements in clinical applications. The framework established in this study offers a robust foundation for enhancing precision oncology and advancing equitable access to cancer care.
Limitations
The study's performance is based on models trained on a specific set of TCGA cohorts, potentially limiting the generalizability to other datasets or sequencing technologies. Although the models showed good performance across different capture kits, training on an even wider range of sequencing platforms could further improve robustness. The study's reliance on a panel of normals for germline variant filtering might still introduce some bias. The interpretation of TabNet's feature masks, while insightful, has been debated within the community. While the model significantly reduces racial bias in TMB estimation, some bias still remains in TabNet and PureCN, indicating the need for continued refinement to completely eliminate any residual disparity.
Related Publications
Explore these studies to deepen your understanding of the subject.