Medicine and Health

Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning

R. T. Mclaughlin, M. Asthana, et al.

This groundbreaking study explores the use of machine learning to improve the accuracy of somatic mutation identification, enhancing tumor mutational burden estimates critical for immunotherapy response. Conducted by R. Tyler McLaughlin and colleagues, the research showcases state-of-the-art performance in separating somatic from germline variants, revolutionizing the field of precision oncology.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses the challenge of accurate somatic variant identification and TMB estimation in tumor-only whole-exome sequencing, where matched-normal samples are unavailable. Tumor-only analyses suffer from high false positive rates due to rare germline variants, leading to inflated and biased TMB estimates that can impact immunotherapy decisions. Existing approaches include database filtering and statistical inference of tumor genomic state (e.g., ABSOLUTE, CLONET, PureCN, SGZ), but complexities of tumor genomics and NGS make modeling difficult. Inspired by successful machine-learning approaches for matched-normal somatic calling, the authors hypothesize that supervised tabular ML methods can accurately classify somatic vs. germline variants in tumor-only data, improve TMB concordance with matched-normal results, and mitigate racial bias introduced by underrepresentation in germline databases. The purpose is to develop fast, accurate, generalizable models for tumor-only variant calling across diverse cancer types and capture kits and to evaluate their clinical utility in TMB estimation.

Literature Review

The paper reviews limitations of tumor-only variant calling, noting high false positive rates and TMB inflation when relying solely on germline databases and panels of normals. Bayesian and copy-number-aware methods (PureCN, SGZ) infer purity, ploidy, and local copy number to estimate somatic probabilities without matched normals. However, the complexity of tumor clonality and sequencing error models makes further improvement challenging. Recent ML-based somatic callers for matched-normal data (e.g., methods leveraging gradient boosting and deep learning) show state-of-the-art accuracy and speed, motivating their application to tumor-only classification. The authors also highlight the underrepresentation of racial minorities in germline databases causing biased TMB inflation in tumor-only settings, necessitating approaches that rely less on biased germline frequency features.

Methodology

Data and cohorts: Training used 105 TCGA tumor samples (7 subtypes: BLCA, GBM, HNSC, LUAD, LUSC, OV, STAD) sequenced at the Broad Institute with Agilent Custom V2 exome-capture. Validation included 45 TCGA samples (COAD, DLBC, TGCT; 15 each) sequenced at Baylor using SeqCap EZ HGSC VCRome. Two holdout test sets were used: (1) 45 TCGA samples (BRCA, SARC, UCEC; 15 each) sequenced at Washington University with Nimblegen SeqCap EZ Exome v3, and (2) 23 metastatic melanoma samples from Hugo et al., sequenced with Nimblegen v3. Pipelines and truth labels: FASTQs were aligned to hg38 with Sentieon BWA-MEM. Variant calling used Sentieon TNScope with a process-matched leave-one-out panel of normals (PoN) per capture kit. Variants were annotated with SnpSift (dbSNP151, COSMIC v85) and dbNSFP4.0 (aggregating population AF across 1000G, UK10K, ExAC, gnomAD into pop_max). Copy number was inferred using CNVkit with PoN references; segmentation used circular binary segmentation. For training labels, an independent matched-normal pipeline provided somatic vs. germline truth; variants passing in the matched-normal pipeline were considered somatic, and all other tumor-only variants were labeled germline. Pre-filtering: To isolate coding mutations and reduce artifacts before ML classification, variants were filtered by population AF < 0.01 across 8 databases, coding consequence (missense, nonsense, frameshift_indel, inframe_indel), FPfilter == PASS, and TNScope filter == PASS. These removed variants did not count toward TNs, making specificity estimates conservative. Features: Thirty tumor-only features were engineered, including: population frequency (pop_max), COSMIC count (max_cosmic_count), read-level features (t_alt_freq, t_maj_allele), mutational context (trinucleotide categories), ontology flags, and local copy-number-informed features. A key CNV-derived feature set summarized the VAF distribution of nearby heterozygous germline SNPs within similar copy-number segments as a 20-bin histogram (snp_vaf_bin_00 to snp_vaf_bin_19). The count of variants per sample (count) was included. Models and training: Three tabular ML models were trained for binary classification (somatic vs. germline): TabNet (PyTorch implementation; n_d=24, n_a=24, n_steps=4, gamma=1.5, n_independent=2, n_shared=2, lambda_sparse=1e-4, momentum=0.3, clip_value=2; Adam lr=0.02; batch size 4000; 100 epochs; custom loss to maximize average precision; best-epoch selection via validation; categorical features one-hot encoded), XGBoost (v1.2.1, default parameters), and LightGBM (v3.3.2; objective=binary; num_iteration=10000; num_leaves=30; learning_rate=0.1; bagging_fraction=0.7; feature_fraction=0.7; bagging_frequency=5; bagging_seed=2018; verbosity=-1). An ensemble average of model posteriors was also evaluated. Thresholding and evaluation: Posterior-probability thresholds for binary metrics (e.g., F1, MCC) were selected on training data and allowed to differ for SNVs vs. indels (e.g., TabNet optimal SNV cutoff ~0.508; indel ~0.1368). Performance was reported by AUC, MCC, TPR, TNR, PPV, NPV, balanced accuracy, and call rate. ROC and precision-recall curves used 500 posterior quantiles. Compute times for feature engineering and inference were benchmarked on single CPU cores for ML and 250 cores for PureCN. Comparator: PureCN (v1.21.21) was run per documentation with NormalDB constructed from PoN VCFs; CNV segments from CNVkit; hg38 simple repeats blacklist; 250 parallel cores; post-optimization enabled. Outputs were filtered using the same criteria as ML methods before metric comparison. TMB estimation: TMB was computed as coding nonsynonymous somatic mutations per megabase. To harmonize across capture kits (VCRome 33.0 Mb, Nimblegen v3 37.3 Mb, Agilent V2 63.5 Mb), counts were normalized by a constant factor (41 Mb, the patient-weighted average footprint). Naive tumor-only TMB used PoN plus population filtering and standard QC; ML-corrected TMB applied model-predicted somatic labels with posterior ≥0.5 (unless otherwise optimized) to naive calls. Concordance with matched-normal TMB was assessed via linear regression (R², slope). Statistical analyses and interpretability: Linear regression and covariance analyses used R 3.5.2. TabNet feature masks were inspected to interpret per-variant decisions and to analyze differences among FP, TN, TP, and FN. Relationships between performance and predictors such as true TMB and median VAF of true somatic mutations (MVTSM) were modeled.

Key Findings

- Model accuracy: - Training AUCs: TabNet 0.96, LightGBM 0.98, XGBoost 0.99. - Validation AUCs: ~0.91–0.92 for all three models, indicating slight generalization drop vs. training. - Holdout TCGA test set (BRCA, SARC, UCEC; Nimblegen v3): Overall AUCs—TabNet 0.942, XGBoost 0.946, LightGBM 0.949; MCC—0.762, 0.757, 0.766 respectively; LightGBM best overall (AUC 0.949; MCC 0.766; PPV 0.886; balanced accuracy 0.883); call rate 100% for ML (PureCN call rate 82.2%). SNVs outperformed indels across models; PureCN had strong indel specificity. - Metastatic melanoma holdout (n=23): Overall AUCs—TabNet 0.852, XGBoost 0.861, LightGBM 0.867; MCC—0.55, 0.558, 0.57 respectively. For SNVs, ML AUCs 0.85–0.87 exceeded PureCN by ~3–4%; for indels, PureCN had substantially higher MCC (>24.4% over TabNet). - TMB concordance: - Naive tumor-only vs matched-normal TMB had poor concordance (R²: train 0.156; validation 0.318; test 0.006), with slopes far <1 (0.148, 0.254, 0.016), indicating strong inflation in tumor-only TMB. - ML-corrected tumor-only TMB vs matched-normal showed large improvements on the test set: R²—TabNet 0.705, XGBoost 0.725, LightGBM 0.759, ensemble 0.774; slopes—0.804, 0.717, 0.745, 0.770. Relative to naive, this corresponds to ~117–129-fold R² improvement and ~45–50-fold slope improvement. - Racial bias mitigation in TMB: - True (matched-normal) TMB showed no difference between Black (n=12) and white (n=55) patients (p>0.05). Naive tumor-only TMB was heavily inflated for Black patients (median 30.36) vs white (11.15), p<1×10⁻⁹. - ML correction reduced or eliminated bias: XGBoost and LightGBM yielded non-significant differences (NS); LightGBM medians 1.76 (Black) vs 1.68 (white). TabNet and PureCN still showed small but significant residual differences (e.g., TabNet medians 3.43 vs 1.85; PureCN 2.41 vs 1.22), yet vastly improved versus naive (~19 mut/Mb inflation removed). - Runtime and scalability: - LightGBM mean runtime 55.4 s on 1 CPU core vs PureCN 1214.2 s on 250 cores (≈21.9× faster despite far fewer resources). - Generalization and call rate: - Models generalized across tissue types and capture kits (Agilent V2, VCRome, Nimblegen v3) with 100% call rates and stable AUCs; performance varied by subtype (e.g., UCEC highest PPV, BRCA highest TPR in TCGA test set). - Feature importance and interpretability: - In tree models, the most important feature was count (variants per sample), with pop_max ranked lower (third); in TabNet, pop_max was most important. Other key features: t_maj_allele, max_cosmic_count, t_alt_freq, and snp_vaf_bin features (local CNV context). Lower reliance on population databases in tree models likely contributed to elimination of racial bias. TabNet attention masks highlighted max_cosmic_count and count as key discriminators between somatic/germline predictions.

Discussion

The findings demonstrate that supervised tabular ML classifiers can accurately distinguish somatic from germline variants in tumor-only WES across multiple cancer subtypes and capture kits, substantially improving TMB concordance with matched-normal pipelines and mitigating racial bias from underrepresented germline databases. LightGBM and XGBoost slightly outperformed TabNet in overall accuracy and in eliminating bias, likely due to reduced dependence on germline frequency features and better utilization of sample-level signals (e.g., total variant count) and local copy-number-informed VAF features. The approach yields major computational efficiency gains over Bayesian methods like PureCN and offers robust, high call-rate predictions suitable for clinical analyses. The improved TMB concordance (R² up to 0.759 and slopes closer to 1) addresses the key clinical need to harmonize TMB estimation in cohorts lacking matched normals, enabling more reliable biomarker use in immunotherapy contexts. Subtype-specific performance differences indicate biological factors (true TMB, purity, CNV burden) have greater impact on accuracy than model choice, suggesting performance could further improve with more diverse and subtype-balanced training data. Interpretability analyses with TabNet feature masks provide insights into model decision-making, supporting confidence in predictions while underscoring that population frequency, mutational context, and local CNV-aware VAF patterns jointly inform classification. While PureCN achieved better indel MCC, especially in melanoma, tabular ML performed best overall and for SNVs. Combining approaches or ensembling may yield further gains, particularly for indel-rich contexts. Overall, this work supports ML-corrected tumor-only variant calling as a practical alternative when matched normals are unavailable.

Conclusion

This study introduces fast, accurate, and generalizable tabular ML classifiers (LightGBM, XGBoost, TabNet) for tumor-only somatic variant retrieval and TMB estimation across diverse cancer types and WES capture kits. The models substantially improve concordance with matched-normal TMB, eliminate racial bias from naive tumor-only pipelines (particularly with LightGBM/XGBoost), and run orders of magnitude faster than Bayesian alternatives. Key contributions include a robust feature set integrating local CNV context, mutational spectra, and COSMIC information, demonstration of cross-kit/cross-subtype generalization, and interpretability via attention masks. Future work should: (1) expand training to more diverse subtypes, capture kits, and ancestries to enhance generalization; (2) integrate or ensemble ML with copy-number-aware Bayesian methods to improve indel performance; (3) refine features capturing tumor purity and clonality; and (4) validate in prospective clinical settings and across additional sequencing assays (e.g., panel-based testing).

Limitations

- Absence of matched normals inherently limits performance; in high-purity, copy-number-neutral tumors, somatic and germline VAFs can be indistinguishable. - Training data were limited to seven TCGA subtypes and a single capture kit/center for training; broader diversity would likely improve generalization. - Performance varies by tissue subtype, influenced by true TMB, purity, and CNV burden; models may underperform in low-TMB or high-purity settings where false positives persist. - Indel classification lagged PureCN in MCC in some settings (e.g., melanoma), suggesting room for improvement or hybrid approaches. - Reliance on tumor-only-derived features means purity and clonality are only indirectly inferred, potentially limiting sensitivity or PPV in certain genomic contexts. - Although racial bias was largely eliminated by tree models, small residual biases were observed for TabNet and PureCN in this cohort; further validation across larger and more diverse populations is needed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

StratoMod: predicting sequencing and variant calling errors with interpretable machine learning

N. Dwarshuis, P. Tonner, et al.

Engineering and Technology

Fast and accurate machine learning prediction of phonon scattering rates and lattice thermal conductivity

Z. Guo, P. R. Chowdhury, et al.

Biology

Accurate and scalable variant calling from single cell DNA sequencing data with ProSolo

D. Lähnemann, J. Köster, et al.

Medicine and Health

A machine learning algorithm with subclonal sensitivity reveals widespread pan-cancer human leukocyte antigen loss of heterozygosity

R. M. Pyke, D. Mellacheruvu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny