
Medicine and Health
Expanding drug targets for 112 chronic diseases using a machine learning-assisted genetic priority score
R. Chen, A. Duffy, et al.
Discover ML-GPS, a groundbreaking machine learning-assisted genetic priority score developed by Robert Chen and colleagues, designed to revolutionize drug target discovery for chronic diseases. This innovative approach not only enhances prediction accuracy but also uncovers thousands of potential gene-phenotype pairs, paving the way for new drug targets in clinical trials.
~3 min • Beginner • English
Introduction
Chronic non-communicable diseases are major causes of morbidity and mortality, yet many lack effective treatments, partly due to limitations of preclinical models and high failure rates of drugs without target evidence. Since the first GWAS in 2005, large-scale biobank studies have uncovered disease-associated variants; clinical genetics resources (ClinVar, OMIM) have further informed drug discovery and precision medicine. Genetic support is associated with higher drug development success, and a prior Genetic Priority Score (GPS) framework showed substantial enrichment for drug indications and clinical advancement among top-prioritized gene–disease pairs. However, standard binary case/control phenotyping can suffer from underdiagnosis, misclassification, and lack of severity stratification, reducing power. Recent work has leveraged machine learning to model continuous disease representations, identifying additional variants. This study introduces ML-GPS, which uses machine learning-derived continuous disease probabilities and integrates multi-spectrum genetic associations and curated evidence to prioritize gene–disease targets, aiming to improve accuracy, coverage, and granularity over the original GPS.
Literature Review
The paper situates its work within several strands of prior research: (1) Biobank-scale genetics demonstrating the utility of human genetic associations for target discovery, with 63% of FDA-approved drugs (2013–2022) supported by genetics and genetics-supported mechanisms being 2.6× more likely to succeed. (2) The original GPS framework combining clinical variants and association signals, where the top 0.2% of scores showed a 9.9-fold increased likelihood of having a drug indication and an 8.5-fold increased likelihood of advancing from phase I to phase IV. (3) Emerging machine learning approaches generating continuous disease phenotypes (e.g., in cardiovascular, pulmonary, psychiatric domains) that reveal additional genetic associations beyond binary diagnoses. These advances suggest that enriching phenotype modeling and integrating diverse genetic evidence can improve target prioritization, motivating ML-GPS to incorporate predicted phenotypes, continuous feature encoding, and non-linear modeling.
Methodology
Study design: The authors selected chronic disease phenotypes (phecodes) in UK Biobank (UKB) and trained machine learning (ML) models to predict diagnoses using comprehensive phenotypic data. They then conducted genetic association analyses across common, rare, and ultra-rare variants using three phenotype representations—observed case/control (P), binarized ML probabilities (B), and continuous ML probabilities (C). Finally, they integrated genetic association features with curated genetic evidence via gradient boosting to construct ML-GPS, trained on Open Targets drug indication data and externally tested in SIDER. They also developed a direction-of-effect (DoE) extension to predict activator vs inhibitor mechanisms.
Phenotype selection and ML modeling: From 3,612 phecodes (phecodeX), the team filtered to chronic, non-communicable disease phecodes, excluding acute/traumatic/environmental/non-specific categories after manual review, yielding 386 phecodes. Preliminary LightGBM models using age, sex, and 72 labs/vitals identified 112 phecodes with AUROC ≥ 0.70 and AUPRC above prevalence. Final models incorporated 239 features: demographics; labs/vitals; lifestyle; medication usage (ATC classes); and 50-dimensional embeddings of inpatient ICD-10 diagnostic histories. Models were trained on 183,021 UKB participants with GP records and used to generate predictions for 213,584 without GP records; total sample n = 396,605 European-ancestry participants after standard QC. LightGBM parameters (classification): boosting_type=goss, up to 1000 iterations, learning_rate=0.01, num_leaves up to 800, min_data_in_leaf=100, early_stopping=10; benchmarking included XGBoost and Random Forest as sensitivity analyses.
Phenotype encodings for genetics: For each phecode, three encodings: P (observed diagnoses), B (binarized ML probabilities using F1-optimal threshold), and C (continuous probabilities). Predicted probabilities were analyzed for calibration and association with disease severity and mortality to support their clinical relevance.
Genetic analyses (allelic series): Genome-wide common variant association (MAF ≥ 0.01), exome-wide rare variant single-variant tests (0.0001 ≤ MAF < 0.01; missense or LOF), and gene-level ultra-rare variant burden tests (MAF < 0.0001; deleterious missense or LOF). Standard QC and regenie-based tests were used; LD clumping defined independent loci (genome-wide threshold 5×10−8; secondary threshold 0.04). eQTL mapping (GTEx v8) supported direction inference. Rare/ultra-rare analyses used burden/SKAT/SKAT-O/ACAT with Bonferroni thresholds for significance.
Feature construction for ML-GPS: Thirteen features per gene–phecode pair: nine allelic-series features (common/rare/ultra-rare for P, B, C) encoded continuously (e.g., −log10 p-values, counts), plus four curated evidence features (EVA-ClinVar, HGMD, OMIM, L2G). Gradient boosting (LightGBM) models were trained to predict whether a gene–phecode has at least one drug indication in Open Targets (training/holdout) and evaluated in SIDER (external test). ElasticNet logistic regression served as a comparator.
Direction-of-effect modeling: Multi-class one-vs-rest LightGBM predicted probabilities for activator vs inhibitor vs no indication, using mechanism-of-action labels from ChEMBL/DrugBank/Open Targets. Variant-level features were split by predicted functional consequence (GOF/LOF/neutral via VEP and LoGoF/LogFunc), and common/L2G features were split by concordant vs discordant GWAS-eQTL effect directions. Class weighting upweighted activator indications due to imbalance.
Drug and evidence datasets: Open Targets (v23.12) and SIDER (v4.1) provided drug indications, mechanisms, and gene targets; anti-infectives (ATC J) were excluded. Final datasets comprised 112,274 gene–phecode pairs in Open Targets (4,116 with indications) and 58,674 in SIDER (1,833 with indications). Curated evidence from EVA-ClinVar, HGMD Professional (2023.3), OMIM, and L2G was harmonized to phecodes via ontology mappings.
Evaluation: Performance assessed by AUPRC, odds ratios (ORs) per SD increase in score for any and phase-specific indications, and enrichment across score percentiles. SHAP analyses quantified feature contributions. Calibration and inflation metrics (λ) were evaluated for genetic tests. Precision/recall trade-offs were provided for thresholds. Pathway enrichment used hallmark gene sets with single-sample GSEA across phenotypes.
Key Findings
- ML models for 112 chronic disease phecodes achieved strong discrimination and calibration: median AUROC 0.85 (IQR 0.83–0.87) and median Brier score ~0.01. Incorporating broader feature sets improved AUROC and AUPRC over preliminary models. Predicted probabilities correlated with disease severity and all-cause mortality in most phenotypes, suggesting they capture risk/underdiagnosis.
- Predicted phenotypes (B, C) identified substantially more genes with significant associations than observed phenotypes (P) across common, rare, and ultra-rare analyses. Median inflation factors (λ) were acceptable; for common variants, λ median: P 1.03, B 1.06, C 1.34; for rare: P 1.04, B 1.04, C 1.03; for ultra-rare: P 0.76, B 0.89, C 1.03.
- Curated evidence features were strongly associated with drug indications (Open Targets): EVA-ClinVar OR ~6.61; HGMD OR ~4.86; OMIM OR ~12.03; L2G OR ~6.68. For common variant features, ORs were P 7.56, B 6.28, C 3.19; rare: P ~16.46, B ~15.62, C ~8.75; ultra-rare: P ~6.87, B ~8.66, C ~4.02 (no significant differences among P/B/C for rare/ultra-rare after certain comparisons). Even after removing P-identified genes, B and C remained significantly associated with indications, increasing coverage.
- Architecture and encoding improvements: Gradient boosting with continuous encoding outperformed logistic regression and binary encodings. In Open Targets, AUPRC increased from ~0.049 (LR) to ~0.063 (GB with continuous features); in SIDER from ~0.050 to ~0.066. ORs per SD for the best model (L2G + Clinical + PBC) were 1.26 for any indication and 1.41 for Phase IV indications in Open Targets. Top 99–100th percentile pairs had OR 6.49 (Open Targets) and 7.38 (SIDER).
- Coverage and predictions: ML-GPS produced scores for 2,362,636 gene–phecode pairs (26,035 genes; 112 phecodes). Predicted phenotypes notably expanded coverage; the top 1% of scores supported 15,077 gene–phecode pairs previously lacking support. Among top-score pairs (>99th percentile), a higher proportion had drug indications compared to a GPS-like model (L2G + Clinical + P). SHAP indicated rare B and rare C features, and ultra-rare B, as most informative; OMIM contributed little beyond HGMD and ClinVar due to redundancy.
- Direction-of-effect (DoE): For activator indications, AUPRCs were 0.0185 (Open Targets) and 0.022 (SIDER); for inhibitor indications, 0.052 (Open Targets) and 0.056 (SIDER). Top percentile enrichments were ~6–7× odds for both activator and inhibitor predictions using the best feature set (L2G + Clinical + PBC).
- Case studies: ML-GPS highlighted known and plausible targets: LRRK2 for Parkinson’s disease; olaparib-related cardiovascular targets; known relationships such as PCSK9–hyperlipidemia, ACE–chronic kidney disease, GUCY1A1–hypertensive heart disease, NPC1L1–hypercholesterolemia, ADRB1–CKD, with correct DoE predictions. It also nominated targets lacking current indications (e.g., LDLR activation for dyslipidemia; WNT16 activation for fracture/osteoporosis; TMPRSS6 inhibition contexts), and pathway enrichments consistent with disease biology across multiple hallmark gene sets.
- Overall, integrating predicted phenotypes (B, C) and continuous encodings significantly improved prioritization performance and target coverage over approaches relying solely on observed phenotypes and binary features.
Discussion
The study addresses key limitations of binary case/control phenotyping by introducing ML-derived continuous disease probabilities that better capture risk, severity, and underdiagnosis. Incorporating genetic associations from these predicted phenotypes uncovered additional gene–disease signals across the allelic spectrum, improving the coverage of potential drug targets. Continuous feature encoding enabled the model to leverage sub-threshold associations, and gradient boosting captured non-linear interactions among evidence sources, as supported by SHAP analyses. The resulting ML-GPS achieved higher precision–recall performance and stronger enrichment for drug indications across percentiles than GPS-like logistic models and expanded the pool of high-confidence targets significantly. Importantly, predicted phenotypes complemented rather than replaced observed phenotypes; both contributed distinct signals. The DoE extension further refined prioritization by suggesting activator versus inhibitor mechanisms, aiding therapeutic directionality. Examples spanning lipid metabolism, cardiovascular, renal, and neurological diseases illustrate the model’s ability to recover known relationships and highlight promising, tractable targets lacking current indications. Enrichment of disease-relevant pathways among top-scoring pairs supports biological validity and potential translational impact.
Conclusion
ML-GPS, a machine learning-assisted evolution of GPS, integrates genetic associations derived from observed and predicted phenotypes with curated genetic evidence using gradient boosting and continuous encoding. Applied to 112 chronic disease phecodes in UK Biobank, ML-GPS substantially improves prediction performance and expands coverage of potential drug targets, identifying thousands of high-priority gene–disease pairs, including many without prior support. The framework also infers direction-of-effect, guiding activator versus inhibitor strategies. Future work should extend ML-GPS to additional and more diverse biobanks to enhance generalizability across ancestries, integrate richer phenotyping modalities (e.g., imaging, audiometry, ophthalmology), refine functional consequence predictions, and harmonize drug datasets. These advances may further improve target identification and de-risk drug development for chronic diseases.
Limitations
- Cohort and ancestry: Genetic analyses were limited to UK Biobank participants of European ancestry, potentially restricting generalizability and missing ancestry-specific signals.
- Phenotype scope and selection: Semi-subjective manual screening may have excluded valid chronic disease phecodes; several rare or sensorimotor phenotypes were excluded due to insufficient predictive performance driven by limited phenotypic data.
- Data modality gaps: Lack of comprehensive imaging, audiometric, and ophthalmic data in most participants constrained ML phenotype accuracy for certain diseases.
- Drug data and ontology mapping: Differences and potential inconsistencies between Open Targets and SIDER, reliance on ICD-10/phecode mapping, and harmonization across ontologies may introduce misclassification.
- Feature absence not negative evidence: The absence of a genetic feature does not constitute evidence against a target; LoF/GoF and deleteriousness predictions carry uncertainty.
- Class imbalance: Greater prevalence of inhibitor vs activator indications required weighting; residual bias may persist despite mitigation.
- Comparative constraints: Direct comparison with the original GPS is limited by different phenotype definitions (child vs parent phecodes).
Related Publications
Explore these studies to deepen your understanding of the subject.