Medicine and Health

A machine learning algorithm with subclonal sensitivity reveals widespread pan-cancer human leukocyte antigen loss of heterozygosity

R. M. Pyke, D. Mellacheruvu, et al.

Discover how Rachel Marty Pyke and colleagues at Personalis, Inc developed DASH, a groundbreaking machine learning algorithm that outperforms existing tools in detecting human leukocyte antigen loss of heterozygosity (HLA LOH) from tumor-normal sequencing data. This innovative approach, validated by digital PCR, suggests a significant correlation between HLA LOH and immune resistance strategies in cancer patients.... show more

Introduction

Immune checkpoint blockade demonstrates that the immune system can recognize tumor neoantigens via MHC, yet many patients fail to respond, motivating better biomarkers and mechanistic understanding of resistance. HLA class I genes, along with B2M, are central to antigen presentation; somatic alterations and germline diversity in HLA influence tumor evolution and immunotherapy outcomes. Somatic HLA loss of heterozygosity (LOH) can reduce HLA diversity and facilitate immune escape, and has been linked to ICB response. However, accurate detection of HLA LOH from sequencing is challenging due to extreme HLA polymorphism, poor alignment to reference, variability in capture across alleles, and the need for allele-specific resolution; standard CNV tools struggle particularly at low purity and with subclonal events. Existing validation approaches (flanking-region concordance and PCR targeting surrounding regions) do not confirm allele identity or assess performance at low purity/subclonality. To address these gaps, the authors developed DASH, a machine learning algorithm tailored to HLA LOH with features capturing allele-specific depth and context, and designed orthogonal allele-specific digital PCR for validation, then assessed prevalence and selection signals across 610 tumors.

Literature Review

Prior work established the role of antigen presentation in tumor evolution and response to ICB, with frequent somatic mutations in HLA and B2M and impacts of HLA germline diversity on outcomes. Generic copy-number methods estimate purity and ploidy to interpret B-allele frequency and depth, but are unreliable in the HLA locus due to polymorphism and alignment issues and cannot identify the specific lost allele. Allele-specific alignment and graph-based methods improved HLA typing and somatic mutation calling, and some HLA LOH tools (e.g., LOHHLA) use allele-specific alignment but are limited by tumor-only approaches or reliance on standard CNV models that do not handle HLA-specific capture variability, low purity, or subclonality well. Validation has typically relied on flanking-region CNV concordance or PCR of regions surrounding HLA, which do not confirm which allele is lost and are insensitive to low purity/subclonality, limiting clinical utility. These gaps motivate a specialized, validated, allele-specific approach.

Methodology

Platform and cohort: Whole-exome sequencing on the ImmunoID NeXT Platform with augmented HLA capture was performed on paired tumor-normal samples. A training dataset comprised 279 patients across 15 tumor types; pan-cancer prevalence analyses used 610 patients. HLA typing was performed on normal samples; HLA somatic mutations were called with POLYSOLVER. Copy-number, purity, and ploidy were estimated with Sequenza.
Read processing and allele-specific mapping: Patient-specific HLA references were generated from typed alleles using an imputed HLA allele database. HLA reads (30 bp seed) were stringently mapped to patient-specific alleles with BWA; reads with >20% soft clipping or mismatches were excluded (except allowance for a single mismatch at somatic mutation sites). Depth was computed per position; positions of difference between homologous alleles (SNVs/indels) were identified to drive allele-specific features.
DASH features (per heterozygous HLA-A/B/C locus):
1. Adjusted b-allele frequency (tumor BAF normalized by normal BAF, aggregated by 150 bp bins).
2. Allele-specific tumor/normal sequencing depth ratio normalized by exome-wide T/N read ratio; the lower allele per bin is used, then median across bins.
3. Consistency of sequencing depth (fraction of bins where one allele consistently has lower depth across mismatch sites; 0.5–1 range).
4. Total sequencing depth ratio (sum of tumor/normal ratios of both alleles per bin, median across bins) to distinguish imbalance due to amplification vs deletion.
5. Patient-level features: tumor purity and tumor ploidy (from Sequenza).
6. Flanking-region deletion: binary indicator of a b-allele deletion within ±10 kb from Sequenza.
Labels and training: 720 heterozygous loci were manually curated using depth, BAF, allele-specific ratios, flanking-region copy number, and contextualized by purity/ploidy; 19.6% were labeled HLA LOH. An XGBoost classifier (binary logistic; max_depth 5; eta 0.3; subsample 0.5; min_child_weight 2; 100 estimators) was trained with 10-fold cross-validation; a post-filter required lost-allele ratio <0.98 and adjusted BAF >0.02. Bi-allelic deletions were flagged if the higher-depth allele also showed strong depletion.
Benchmarking and robustness: Compared to Sequenza and LOHHLA on ImmunoID NeXT; evaluated performance with and without low-purity samples (<20% removed). Downsampling experiments assessed performance across HLA depths (e.g., 200x to 10x), stratified by purity.
Cell line in silico mixtures: Identified 4 tumor-normal paired cell lines with HLA LOH; for three (CRL-5911, CRL-5922, CRL-2314), created simulated tumor purities by mixing tumor and normal reads (replicates of 10). For subclonality assessment, HLA-mapping reads were mixed at specified fractions to emulate event clonality, while whole-exome features reflected stated tumor purity.
Allele-specific digital PCR (dPCR) validation: Designed patient-specific allele-discriminating primers/probes (multiplexed with RNase P control) for one cell line across purity mixtures and for 21 patients (8 predicted LOH by both tools, 1 LOH by LOHHLA-only, 12 no LOH). Tested triplicates on tumor and matched normal DNA; copy numbers normalized to RNase P; one-sided t-tests compared tumor vs normal.
Quantitative immunopeptidomics: In six tumor/normal pairs (3 without HLA LOH, 3 with), purified HLA-B2M complexes, performed TMT-labeled LC-MS/MS, identified peptides (PEAKS), assigned predicted HLA binders (SHERPA, 0.1% rank), and compared log2 tumor/normal intensities across alleles (lost/retained/homozygous). Also attempted in one cell line pair.
Pan-cancer analyses: Applied DASH to 610 tumors across 15 types; computed HLA LOH frequencies, number of genes impacted, enrichment over genome-wide LOH fraction, and associations with mutation burden, predicted neoantigen burden, CD274 (PD-L1) expression, and MSI (MSIsensor).

Key Findings

Algorithm performance (279-patient cross-validation; 720 loci):
- DASH: 98.7% specificity, 92.9% sensitivity (F1=0.939; AUROC=0.939; AUPRC=0.940).
- LOHHLA: 94.3% specificity, 78.8% sensitivity (F1=0.777).
- Sequenza (baseline deletion detection, not allele-specific): 92.9% specificity, 95.0% sensitivity (F1=0.848).
- Excluding samples with tumor purity <20%: DASH 99.7% specificity, 100% sensitivity (F1=0.995; AUROC=0.990; AUPRC=0.992); LOHHLA 94.3% specificity, 91.8% sensitivity (F1=0.857).
- Most informative features: flanking-region deletion and adjusted BAF; all seven features contributed; no single feature matched combined performance.
- Robustness to lower HLA depth: F1 decreased modestly (~0.06 drop from ImmunoID NeXT depth to 100x), with stronger drops at low purity.
Limit of detection (cell line mixtures):
- Across three cell lines, >98% sensitivity for tumor purity >27% and >98% specificity across purities. LOHHLA had >97% sensitivity only above ~35% tumor purity.
- Subclonality heatmaps: Both tools maintained >99% specificity; DASH kept >97% sensitivity until the fraction of HLA LOH-derived reads fell below ~25% (e.g., 50% purity × 50% clonality), while LOHHLA missed most events <80% clonality.
Orthogonal allele-specific dPCR validation:
- Cell line CRL-5911: dPCR confirmed allele loss with sensitivity down to 10% tumor purity; normal controls showed expected haploid signal per allele (normalized to RNase P).
- 21 patient samples: For 8 loci predicted LOH by both tools, dPCR showed significant tumor-specific reduction for the predicted lost alleles and no reduction for retained alleles with specific primers. For 12 predicted no-LOH cases, no significant reduction was seen. One LOHHLA-only LOH call showed both alleles reduced due to RNase P amplification (likely false positive). Excluding training and ambiguous cases, both tools achieved 100% sensitivity and 100% specificity in this set.
Immunopeptidomics (functional readout): Across six patients, peptides predicted to bind lost alleles tended to show small, inconsistent increases in tumor vs normal; one dPCR-validated LOH case did not show the expected depletion, highlighting confounders (purity, allele expression, immune infiltration). Cell line exhibited very low tumor MHC expression; normal had ~10× higher peptide intensity, limiting functional validation.
Pan-cancer prevalence and selection signals (610 patients, 15 tumor types):
- Overall, 18% of patients had HLA LOH. Frequencies varied by tumor type (e.g., HNSCC ~40%, liver ~4%).
- In fully heterozygous patients with LOH, simultaneous loss affected multiple genes frequently: 76.4% lost all three (HLA-A/B/C), 21.1% lost one gene, 5.5% lost two genes; HLA-A and HLA-B were preferentially lost vs HLA-C.
- Patients with HLA LOH had significantly higher genome-wide LOH fractions (pan-cancer p=7.2e-20). However, HLA LOH frequencies exceeded expectations from genome-wide LOH rates across nearly all tumor types, indicating enrichment beyond random deletion; enrichment was notable in colorectal cancer and HNSCC.
- Associations: higher mutation burden (p=0.006) and higher predicted neoantigen burden (p=0.02) in HLA LOH cases; “goldilocks” pattern with highest HLA LOH at intermediate mutation burdens. Positive correlations with CD274 (PD-L1) expression (p=0.006) and MSI percentage (p=0.006). More neoantigens were predicted to bind lost alleles than retained (p=0.01), consistent with immune selection.

Discussion

DASH addresses long-standing challenges in accurately detecting allele-specific HLA LOH by leveraging features tailored to HLA biology and sequencing idiosyncrasies, leading to higher sensitivity and specificity than existing tools, especially at low tumor purity and for subclonal deletions. Orthogonal allele-specific dPCR demonstrates that precise allele loss can be validated sensitively and robustly, supporting potential clinical translation. Applying DASH pan-cancer reveals that HLA LOH is common and enriched beyond background genome-wide LOH, consistent with selection; its associations with mutation/neoantigen burden, PD-L1 expression, MSI, and preferential loss of alleles presenting more neoantigens further support HLA LOH as an immune-evasion mechanism. Functional immunopeptidomics showed limited and inconsistent peptide-level effects in bulk tumors, likely due to biological and technical confounders, suggesting that peptide-level validation may require refined approaches (e.g., higher purity, cell-type-resolved analyses). Overall, findings support incorporating HLA LOH into immuno-oncology biomarker frameworks and therapeutic design (e.g., neoantigen targeting).

Conclusion

The study introduces and validates DASH, a machine learning method for allele-specific detection of HLA LOH from paired tumor-normal sequencing, with superior performance to existing approaches and sensitivity to subclonal events and low tumor purity. A novel patient-specific allele-discriminating dPCR provides robust orthogonal validation and potential clinical utility. Pan-cancer analysis shows widespread, enriched HLA LOH with signatures of immune selection and associations with key immuno-oncology features. Future work should expand validation in larger cohorts, enhance functional readouts (e.g., improved immunopeptidomics or single-cell methods), and develop complementary methods to detect other allelic-imbalance mechanisms such as single-allele amplifications and allele-specific expression to create a more complete view of antigen-presentation escape.

Limitations

Training labels relied on manual curation of 720 heterozygous loci; larger cohorts and expanded orthogonal validation would increase confidence and generalizability.
Immunopeptidomics showed limited sensitivity in bulk samples due to low tumor purity, variable allele expression, immune infiltration, and technical factors; further method development is needed.
HLA LOH detection alone does not capture all immune escape via allelic imbalance; detection of single-allele amplifications and allele-specific expression remains to be developed.
Performance was assessed primarily on the ImmunoID NeXT platform with boosted HLA coverage; while downsampling suggests reasonable robustness, performance may vary on lower-depth or differently designed exomes.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

A machine learning contest enhances automated freezing of gait detection and reveals time-of-day effects

A. Salomon, E. Gazit, et al.

Medicine and Health

Fast, accurate, and racially unbiased pan-cancer tumor-only variant calling with tabular machine learning

R. T. Mclaughlin, M. Asthana, et al.

Medicine and Health

Comparison of NLP machine learning models with human physicians for ASA Physical Status classification

S. B. Yoon, J. Lee, et al.

Environmental Studies and Forestry

A pan-African spatial assessment of human conflicts with lions and elephants

E. D. Minin, R. Slotow, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny