logo
ResearchBunny Logo
Introduction
Machine learning (ML) has revolutionized early drug discovery, enabling virtual screening of molecules for target-based activity and ADME/Tox properties. Traditional ML methods like Support Vector Regression (SVR), Random Forest, and deep neural networks often require large datasets for optimal performance. The emergence of large language models (LLMs) like MolBART and few-shot learning (FSLC) methods offers potential advantages, particularly for data-scarce scenarios. However, the 'no-free lunch' theorem suggests that no single ML algorithm universally outperforms others. This research investigates the relative strengths of classical ML, LLMs, and FSLC across datasets of varying sizes and chemical diversities to establish a model selection heuristic for drug discovery. The study aims to address the challenge of choosing the appropriate ML algorithm for a given drug discovery task, given the varying characteristics of available datasets. This is crucial for efficient and cost-effective preclinical drug development, allowing researchers to select the most appropriate method based on the specific dataset at hand rather than relying on general assumptions or trial and error.
Literature Review
The paper reviews the traditional ligand-based modeling approaches in drug discovery, which involve training ML algorithms on 2D structural fingerprints or physicochemical descriptors. These methods often necessitate substantial data, limiting their use with datasets containing only a small number of molecules. The rise of transfer learning and multi-task output techniques, boosted by the advancement of LLMs, provides an alternative for larger datasets. The use of SMILES strings as input for Seq2Seq and Transformer models is highlighted, recognizing the linear encoding of 2D structural information as analogous to the word and sentence structure in natural language processing. However, the need for large training sets (millions of compounds) often limits the applicability of transformer-based architectures in drug discovery. Pre-training LLMs on large datasets and fine-tuning them for specific targets, as well as using FSLC methods for smaller datasets, are mentioned as potential strategies to address the limitations of traditional approaches. The literature suggests a gap in direct comparisons between these methods across a wide range of dataset sizes, motivating the current study.
Methodology
The study evaluates three ML methods – Support Vector Regression (SVR), a pre-trained MolBART transformer model, and a few-shot learning classifier (FSLC) – using datasets of varying sizes and chemical diversities. For classical ML modeling using SVR, they employed two-dimensional molecular descriptors (ECFP6) to generate molecular fingerprints, using a nested 5-fold cross-validation strategy for hyperparameter optimization. The MolBART model, a pre-trained large language model, was fine-tuned on individual target datasets from ChEMBL. The FSLC model was developed to handle very small datasets, employing a prototypical network architecture trained on a subset of kinase datasets. The diversity metric used was calculated from the area under the Cumulative Scaffold Frequency Plot (CSFP) curve, reflecting the distribution of Murcko scaffolds within a dataset. The performance of each model type was assessed using R-squared (R²) values and other relevant classification metrics. A Fast Interpretable Greedy-Tree Sums (FIGS) model was then trained to predict the optimal ML model type (MolBART, SVR, or FSLC) based on dataset size and diversity. This model was trained on 95 kinase datasets and tested on a hold-out set to evaluate its predictive performance. Finally, the three ML methods were applied to five kinases implicated in Alzheimer's disease (GSK3β, ABL1, FYN, CDK5, and MARK1) with widely varying dataset sizes and complexities, using ChEMBL and BindingDB as data sources. For MARK1, a high-throughput screen of FDA-approved compounds was conducted to create a prospective test set to validate the model’s performance in a data-scarce setting.
Key Findings
The study revealed a clear relationship between dataset characteristics and optimal model type. For datasets with fewer than 50 molecules, the FSLC consistently outperformed both MolBART and SVR. In the 50-240 molecule range, the superior model depended on dataset diversity. With high diversity, MolBART performed best, benefitting from its transfer learning capabilities. For larger datasets (>240 molecules), SVR with ECFP6 features consistently outperformed the other methods. The analysis of correlation plots between MolBART R², SVR R², dataset size, and molecular diversity revealed that MolBART's performance was relatively insensitive to dataset size, showing the benefits of transfer learning. However, increasing diversity led to a decrease in SVR model performance, whereas MolBART performance remained relatively unaffected. A FIGS model, trained using dataset size and diversity as predictors of the optimal model type, achieved a respectable ROC value of 0.74, suggesting that this heuristic approach may be reasonably successful at predicting which ML algorithm would be preferable. The application of these models to the five kinases associated with Alzheimer’s Disease demonstrated the general utility of the “Goldilocks” heuristic. While larger datasets showed advantages to SVR, the small MARK1 dataset showed FSLC and MolBART to be the most effective for this case; this was validated by a high-throughput screen using FDA-approved compounds. Five novel MARK1 inhibitors were identified through the experimental screening (baricitinib, AT9283, ON123300, upadacitinib, and tofacitinib citrate). These molecules exhibited structural diversity compared to the MARK1 training set, highlighting the model’s ability to identify novel chemical structures.
Discussion
The findings support the existence of a 'Goldilocks zone' for ML model selection in drug discovery. The choice of the optimal model depends on the dataset size and diversity. Few-shot learning excels in data-scarce scenarios, while large language models leverage transfer learning to outperform classical methods in moderately sized and diverse datasets. Classical ML methods ultimately prevail with large datasets. The successful prediction of the optimal model type using dataset size and diversity as inputs demonstrates the generalizability of this heuristic approach across a range of datasets. The identification of novel MARK1 inhibitors underscores the practical applicability of the proposed heuristic and the ability of different ML methods to complement each other. The limitations of the study (focus on kinases, similarity between targets) are acknowledged, and future work could involve the application of this paradigm to a broader range of targets and the incorporation of additional molecular descriptors.
Conclusion
This study establishes a practical heuristic for selecting ML models in drug discovery based on dataset size and diversity. Few-shot learning is ideal for small datasets, LLMs excel in small-to-medium sized, diverse datasets, and classical methods outperform others with large datasets. This 'Goldilocks paradigm', validated by experimental identification of novel kinase inhibitors, offers guidance for efficient and effective model selection in drug discovery projects. Future research should explore this paradigm with more diverse targets and molecular descriptors.
Limitations
The study primarily focuses on kinase inhibitors, which may limit the generalizability of the findings to other target classes. The datasets used are predominantly from ChEMBL and BindingDB, and other sources could offer a more comprehensive representation. Experimental error associated with different assays in the compiled datasets may also impact the model performance and needs to be considered. Furthermore, the study does not explore all possible ML methods and molecular descriptors. Future work should address these limitations by testing this paradigm with additional target classes, datasets, ML approaches, and molecular descriptors.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny