logo
ResearchBunny Logo
Introduction
Drug discovery hinges on identifying 'druggable' targets—proteins amenable to modulation by therapeutic agents. The concept of the "druggable genome," introduced in 2002, refers to the subset of genes encoding proteins capable of binding modulators and eliciting a therapeutic effect. Estimates suggest a relatively small percentage of human genes (~22%) are druggable using conventional methods, and even fewer have demonstrated clear disease associations. Previous druggability prediction methods have relied on targets of approved drugs or broad gene classifications. However, advancements in large-scale genomics and systems biology offer the potential for improved predictive accuracy by leveraging diverse data sources such as population-based intolerance metrics, tissue expression data, and protein structures. This study aims to enhance our understanding of druggability by identifying key gene properties influencing binding affinity and therapeutic outcome, and to facilitate target identification by prioritizing genes with druggability profiles similar to successful drug targets. It's crucial to distinguish druggability from ligandability. Ligandability refers to a protein's ability to bind a modulator, while druggability encompasses the ability to achieve a therapeutic effect, including considerations like cellular accessibility of the modulator. This study prioritizes genes likely to be druggable, even if multiple targets show similar ligandability. Predicting druggable genes using standard machine learning (ML) is challenging due to the limited number of known druggable targets, class imbalance, and lack of reliable negative samples. To address these challenges, the researchers developed DrugnomeAI.
Literature Review
Several resources exist for assessing target druggability, including curated databases like Open Targets and TractaViewer, which integrate tractability data from various sources. Computational tools for predicting target druggability have also been developed. These tools utilize features derived from gene-level annotations, protein sequences, and systems-level data. Examples include TargetDB (using a random forest model), DrugMiner (employing a neural network), and a hybrid deep learning model by Yu et al. (combining convolutional and recurrent neural networks). Other approaches include bagging ensembles of support vector machines and decision tree-based meta-classifiers. Some tools focus on specific diseases or therapeutic areas (e.g., oncology), while others concentrate on predicting druggability for particular drug types (e.g., kinase inhibitors). Methods also exist for predicting druggability at binding sites. This work builds upon existing methods by integrating a comprehensive set of druggability-associated features and providing both disease-agnostic and domain-specific models to identify potential targets for various therapeutic areas and drug modalities. A particular focus is on PROTACs, a novel drug modality that shows potential advantages over traditional small-molecule therapeutics and could expand the space of druggable targets. Notably, this study aims to provide the first machine learning model for predicting the druggability of genes for PROTAC-based therapeutics.
Methodology
DrugnomeAI is built upon the mantis-ml framework, employing stochastic semi-supervised learning. The framework addresses the challenges of limited positive samples by incorporating unlabeled data. The process involves iteratively creating balanced datasets of positive (known druggable genes) and unlabeled genes. An ensemble of machine learning models is trained on each dataset using stratified 10-fold cross-validation. The final druggability score for each gene is an average of predictions across all models and iterations. DrugnomeAI integrates 324 gene-level features from 15 diverse data sources. These features include generic features from mantis-ml (e.g., genic intolerance, GWAS data) and druggability-specific features from resources like Pharos, STRING, CTD, InterPro, Reactome, DGIdb. The druggability-specific features capture information on protein-protein interactions, pathway associations, chemical-gene interactions, protein domains, and disease annotations. The methodology included testing several classifiers (Random Forest, Extra Trees, Support Vector Machine, Gradient Boosting) and feature sets. Gradient Boosting consistently outperformed other classifiers and the "Pharos + InterPro" feature set (combining features from Pharos and InterPro) offered a good balance of performance and avoiding redundancy. Feature selection using the Boruta algorithm identified protein-protein interaction features as the most important predictors of druggability. Multiple DrugnomeAI models were trained using different combinations of labeled datasets from Pharos and Triage, which provide varying levels of evidence of known drug targets. This approach provides a holistic view of druggability, incorporating diverse perspectives on a gene's potential as a drug target. The authors also created specialized models for different drug modalities (small molecules, monoclonal antibodies, PROTACs) and therapeutic areas (oncology and non-oncology) by training on datasets of genes already known to be amenable to each modality or area. The hyperparameters of the Gradient Boosting classifier were optimized using grid search.
Key Findings
DrugnomeAI demonstrated high predictive performance, achieving an AUC of 0.97 using the Tier 1 labeled dataset and 0.99 using the Tclin dataset. The top 5% of genes ranked by DrugnomeAI showed significant enrichment for genes already selected for clinical development programs (p<1×10<sup>-308</sup>), indicating strong validation of its predictive ability. Protein-protein interaction network features were consistently identified as top predictors across all models. The top-ranked genes also showed significant enrichment among genes reaching genome-wide significance in large-scale phenome-wide association studies (PheWAS) of the UK Biobank (p=1.7×10<sup>-5</sup> for binary traits, p=1.6×10<sup>-7</sup> for quantitative traits). The analysis of the top 5% of genes also revealed a high proportion (63%) overlapping between the models trained using Tclin and Tier1 datasets. Analysis of the top 5% genes revealed that a substantial proportion had progressed to phase IV clinical trials (51% for the Tclin-based model, 35% for the Tier1-based model). A significant number of genes among the top 5% predictions lacked prior clinical evidence but were nevertheless supported by multiple types of non-clinical evidence (e.g., genetic associations, animal models, pathway involvement, literature), strengthening their potential relevance for drug development. The modality-specific models revealed novel genes among the top-ranked predictions, particularly for antibody and PROTAC modalities. The analysis also showed that the antibody-specific model appropriately underrepresented intracellular proteins, which are typically difficult to target with antibodies. The oncology-specific models also highlighted distinct features compared to non-oncology models, indicating differences in target selection criteria for these therapeutic areas. Comparison with other published druggability prediction methods showed that DrugnomeAI consistently achieved superior enrichment with independent validation datasets of known drug targets, indicating the method's superior performance in drug target identification. The stepwise hypergeometric test further confirmed significant enrichment for DrugnomeAI predictions in multiple validation datasets.
Discussion
DrugnomeAI offers a valuable tool for prioritizing drug targets by providing comprehensive and reliable druggability predictions across the human exome. The model's high predictive accuracy, validated against known drug targets and PheWAS results, demonstrates its effectiveness in identifying potential candidates. The identification of protein-protein interaction features as key determinants of druggability highlights the importance of considering systems-level properties. The availability of specialized models tailored to different drug modalities and therapeutic areas makes DrugnomeAI exceptionally versatile and broadly applicable. The discovery of novel, highly-ranked genes with strong non-clinical evidence further expands the potential target space. While the models are trained on existing knowledge of successful drug targets, the identification of promising targets lacking prior clinical evidence shows that DrugnomeAI can also identify novel, potentially druggable targets that have not yet been explored clinically. This capability could accelerate drug discovery and lead to the identification of novel therapeutic strategies.
Conclusion
DrugnomeAI represents a significant advancement in predicting drug target druggability. Its high predictive power, integration of diverse data sources, and availability of specialized models make it a valuable tool for drug discovery. The identification of novel potentially druggable targets highlighted the method's ability to expand the known druggable genome. Future directions include incorporating higher-resolution information from protein-protein interaction networks using graph neural networks, and extending the model to predict druggability at the binding site level.
Limitations
Like any data-driven approach, DrugnomeAI may overlook understudied genes with limited annotations. The model's performance is heavily influenced by the training data, which primarily comprises genes with existing drugs or clinical trials. Druggability is not an intrinsic property of a gene, and is influenced by several in vivo pharmacodynamic and pharmacokinetic factors, safety, and regulatory considerations. Therefore, integrating additional information on these factors could improve the model's accuracy.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny