Introduction
Chronic non-communicable diseases are a leading cause of global morbidity and mortality, yet many lack effective treatments. High clinical trial failure rates, partly due to a lack of target evidence, highlight the need for improved drug discovery strategies. Genome-wide association studies (GWAS) using large biobank datasets have identified disease-associated variants, providing valuable insights. However, these studies are limited by their reliance on binary case/control models, which are susceptible to underdiagnosis and misdiagnosis and fail to capture disease risk stratification. Machine learning offers a promising approach to generate continuous representations of diseases, identifying additional disease-associated variants. This study builds upon the Genetic Priority Score (GPS) framework, which combines clinical variants with genetic associations, to develop a machine learning-assisted version (ML-GPS) for 112 chronic disease phenotypes. ML-GPS aims to improve target discovery by leveraging machine learning to generate continuous disease representations, predict drug indications, and integrate multiple sources of genetic evidence to assign a probability of a drug indication to each gene-phenotype pair. The original GPS demonstrated efficacy in identifying drug indications and improving clinical trial success rates, but relied on binary case/control models, limiting its power to detect disease-associated variants.
Literature Review
The literature extensively demonstrates the importance of genetic evidence in drug discovery. A significant proportion of newly approved drugs are supported by genetic evidence, and genetics-supported mechanisms show higher success rates in clinical trials. The original GPS framework successfully leveraged this evidence, but its reliance on binary case-control models presented limitations. Recent studies have shown the utility of machine learning in generating continuous representations of diseases, identifying more disease-associated variants and potentially enhancing target discovery. This existing literature sets the stage for the development and validation of a more sophisticated approach, ML-GPS.
Methodology
The study employed a multi-stage approach to construct ML-GPS. First, gradient boosting models (LightGBM) were trained on UK Biobank data to predict 112 chronic disease phenocodes. These models used a comprehensive set of features including age, sex, laboratory and vital measurements, lifestyle factors, medication usage, and diagnostic history, generating continuous probabilities of disease presence (0-1). Second, genome-wide association testing (GWAS) was performed on common, rare, and ultra-rare variants for each gene-phenocode pair, using both observed and predicted phenotypes. These associations were encoded continuously ( -log₁₀(p-values), number of variants) instead of the binary approach of the original GPS. Third, these genetic associations were integrated with existing evidence (EVA-ClinVar, HGMD, OMIM, L2G) using gradient boosting (LightGBM) to construct ML-GPS. The model was trained on Open Targets and externally tested on SIDER, predicting probabilities of a drug indication for 2,362,636 gene-phenotype pairs. Various model architectures and feature inputs were compared to optimize ML-GPS performance. The study also included a Shapley Additive exPlanations (SHAP) analysis to assess feature importance. Finally, a directional version of ML-GPS (ML-GPS DoE) was developed to predict the direction of drug effect (activator or inhibitor).
Key Findings
The inclusion of predicted phenotypes significantly improved ML-GPS performance compared to using only observed phenotypes. Predicted phenotypes identified substantially more genetic associations across all allele frequency spectrums. ML-GPS substantially increased coverage of drug targets, with the top 1% of scores supporting 15,077 gene-phenotype pairs previously lacking support. The model successfully identified known target-disease relationships and promising targets without indicated drugs. Analysis of the top-scoring gene-phenotype pairs revealed many well-known target-disease relationships and highlighted promising targets for diseases such as Parkinson’s disease and cardiovascular disease, including LRRK2 inhibitors for Parkinson’s disease and olaparib for cardiovascular disease. Comparisons between ML-GPS and a logistic regression model approximating the original GPS demonstrated significant improvements in AUPRC (Area Under the Precision-Recall Curve) and odds ratios for drug indications. SHAP analysis revealed that genetic associations with predicted phenotypes (B and C) were among the most important features in ML-GPS predictions. ML-GPS DoE also showed promising results in predicting the direction of drug effect (activator or inhibitor). The study found that incorporating the continuous model probabilities and binarized model probabilities significantly increased the number of genes associated with drug indications.
Discussion
The findings demonstrate that ML-GPS effectively integrates multiple sources of genetic evidence and machine learning to enhance drug target discovery for chronic diseases. The significant improvements in prediction accuracy and the identification of novel drug targets highlight the potential of ML-GPS to accelerate the drug development process. The use of predicted phenotypes mitigates the limitations of relying solely on observed phenotypes, improving the identification of relevant genetic associations. The continuous feature encoding and gradient boosting architecture allow for a more nuanced representation of the data and capture complex relationships between features and drug indications. The successful identification of known target-disease relationships and promising novel targets further validates the approach. The ML-GPS DoE also adds value by predicting the direction of drug effect, providing more comprehensive information for drug development. The results demonstrate that machine learning can greatly enhance the identification of novel drug targets.
Conclusion
ML-GPS represents a significant advancement in drug target identification for chronic diseases. By incorporating predicted phenotypes, continuous feature encoding, and gradient boosting, ML-GPS surpasses the capabilities of previous methods. Future research should focus on expanding its application to additional biobank datasets, including those with diverse ancestries, to enhance generalizability and address limitations related to sample representation. Further investigation into the identified novel targets and their potential clinical utility is also warranted. The integration of additional data types, such as genomic regulatory information, could further refine predictions.
Limitations
The study primarily used data from the UK Biobank, predominantly from participants of European ancestry, limiting the generalizability to other populations. The phenocode selection process involved some subjectivity, potentially excluding relevant phenotypes. The study relied on existing drug databases, which may have limitations in terms of accuracy and completeness. Furthermore, the model's performance is contingent upon the quality of the predicted phenotypes and the available genetic and clinical data.
Related Publications
Explore these studies to deepen your understanding of the subject.