Chemistry

Testing the predictive power of reverse screening to infer drug targets, with the help of machine learning

A. Daina and V. Zoete

This research, conducted by Antoine Daina and Vincent Zoete, explores the groundbreaking potential of ligand-based reverse screening to predict macromolecular targets for small molecule drugs. With a machine-learning model achieving over 51% accuracy on external datasets, this study underscores the approach's promise in drug discovery.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses how reliably ligand-based reverse screening can predict primary and secondary macromolecular targets of small molecules, a central need in modern drug discovery involving polypharmacology, specificity, and repurposing. While prior landmark work (e.g., Shoichet group) demonstrated target/off-target prediction with experimental validation, and many computational tools and ML comparisons have since emerged, actual predictive ability has often been under-evaluated due to difficulties assembling appropriate external test sets. This work proposes and conducts a large-scale, application-oriented external validation to quantify the predictive power of a similarity-based, machine-learning reverse screening method for estimating drug targets.

Literature Review

Earlier efforts established ligand-based reverse screening and validated predictions experimentally for a limited set of drugs and targets (e.g., Lounkine et al.). Numerous computational approaches and reviews have compared machine-learning algorithms and calibration methods for bioactivity/target prediction, highlighting concerns about black-box models and overfitting. Prior studies quantified the molecular similarity principle and related descriptors (e.g., ES5D, FP2) and developed web tools such as SEA and SwissTargetPrediction. However, comprehensive external validation on large, chemically distinct datasets and analysis of performance dependence on bioactivity knowledge volume and diversity have been lacking; this study fills that gap.

Methodology

Training and model: A binary logistic regression model combines 3D shape similarity (ElectroShape 5D, ES5D) and 2D chemical structure similarity (FP2 fingerprints). Training data were mined from ChEMBL v25: 501,959 unique compounds with bioactivity against 3,669 protein targets (actives at ≤10 μM; inactives ≥100 μM; gray area excluded). For each molecule, 20 low-energy conformers were generated; each conformer encoded as an 18-dimensional ES5D vector; chemical structure encoded as a 1024-bit FP2 fingerprint. Pairwise similarities produced two matrices: 3D-Score (highest Manhattan-based similarity across 20×20 conformer pairs) and 2D-Score (Tanimoto coefficient of FP2). To account for molecular size effects, data were split into 51 subsets by number of heavy atoms (≤10; 11–59 per-integer; ≥60). For each subset, logistic regression learned constant C and coefficients (c1 for 3D-Score, c2 for 2D-Score). Coefficient curves versus size were smoothed by third-degree polynomials to yield final size-dependent coefficients. Internal robustness was assessed via 10-fold cross-validation, reporting MCC, precision, and recall; lower performance observed for very small ligands due to poorer specificity and class sparsity. External validation and testing workflow: An external test set was constructed from Reaxys (version 03.2019) applying filters analogous to training: 364,201 small molecules (5–80 heavy atoms) active at ≤10 μM on 1,180 human protein targets shared with ChEMBL and confirmed non-overlapping with training/screening. The screening library comprised the active fraction of ChEMBL molecules active on human proteins: 405,544 actives on 2,069 proteins. For each test compound, ES5D and FP2 were compared against all known actives for each of the 2,069 targets to select the most similar active in 3D and in 2D (may be different molecules). The two maximal similarity values per target were fed into the size-appropriate logistic equation to compute a probability used as a ranking score, yielding a ranked list of 2,069 targets per query. Predictive ability was quantified by the rank positions of the known experimental targets. Applicability domain and chemical diversity: Physicochemical descriptor distributions (MW, WLOGP, TPSA, rotatable bonds, fCsp3, HBA, HBD) showed strong overlap between training and test (negative Z-factors), indicating similar applicability domains. Chemical diversity was characterized using Murcko frameworks and Oprea scaffolds. The training set contained 25,046 Murcko and 38,896 Oprea scaffolds; the test set 21,820 and 33,754, respectively, with less than half overlap, indicating chemical distinctness. A Distinct test subset of 32,748 molecules (9%) was defined having scaffolds absent from training under both definitions. Performance analyses were stratified by molecule size, target data volume, and target ligand scaffold diversity.

Key Findings

- Overall predictive performance on 364,201 external actives (1,180 targets): - Top-1: 51.2% (186,392 molecules) - Top-2: 56.7% (206,571) - Top-5: 65.0% (236,551) - Top-15: 72.9% (265,357) - Top-25: 76.6% (279,289) - Top-50: 81.3% (296,904) - Top-100: 85.9% (312,775) This far exceeds random expectation (~0.1% Top-1 among 2,069 targets). - Stability across molecular sizes: Success rates are broadly constant across heavy atom classes, with noisier performance for very small molecules and slight improvement for larger ones; most stable for 20–40 heavy atoms. - Distinct test set (32,748 molecules chemically distinct from training): - Top-1: 36.2% (11,843) - Top-2: 42.1% (13,794) - Top-5: 51.8% (16,964) - Top-15: 62.1% (20,337) - Top-25: 67.0% (21,957) - Top-50: 74.4% (24,365) - Top-100: 80.0% (26,198) - Single-target subset (300,978 molecules with exactly one reported target): - Top-1: 49.7% (149,476) - Top-2: 55.0% (165,871) - Top-5: 62.7% (188,953) - Top-15: 70.5% (212,237) → ~55-fold enrichment - Top-25: 74.0% (223,152) - Top-50: 79.3% (238,575) - Top-100: 83.4% (252,582) → ~10-fold enrichment - Dependence on target data volume (ChEMBL actives per target): - 0–10 actives (11% of targets): Top-1 11.4%; Top-100 19.8% (~2.3-fold enrichment) - 11–100 actives (36%): Top-1 40.0%; Top-100 72.5% (~8.5-fold enrichment) - >1000 actives (13%): Top-1 56.3%; Top-100 93.9% (~11-fold enrichment) - Dependence on target ligand scaffold diversity: - At rank 15, success ~65% for targets with 11–50 Murcko or Oprea scaffolds (~36% of proteins) - Drops to ~55% for ≤10 scaffolds; increases to ~80% for >500 scaffolds (1–4% of proteins).

Discussion

The study demonstrates that a size-aware logistic model combining 3D shape and 2D structure similarity can accurately infer protein targets by reverse screening at large scale. The high Top-1 and Top-15 success rates on a vast, chemically diverse external set indicate strong enrichment over random and practical utility for guiding experimental efforts, including phenotypic hit deconvolution, polypharmacology profiling, and repurposing. Performance is robust across molecular sizes and remains solid even for chemotypes absent from training (Distinct test set), underscoring generalization beyond the training chemical space. Crucially, predictive ability improves with the breadth and diversity of known ligands per target, quantifying the value of expanding high-quality, diverse bioactivity data resources. The findings validate application-oriented benchmarking using large, non-overlapping datasets and support deploying such models in accessible web tools for routine use.

Conclusion

This work offers the first large-scale, application-oriented external validation quantifying the predictive power of ligand-based reverse screening for target inference. Using a ChEMBL-trained logistic model (ES5D + FP2) and a large external Reaxys test set, the method achieves 51% Top-1 and ~73% Top-15 accuracy, with robust performance across sizes and acceptable accuracy for novel scaffolds. Performance scales with the number and scaffold diversity of known ligands per target, emphasizing the importance of curating large, diverse, high-quality bioactivity datasets. The authors recommend that future development and benchmarking adopt application-oriented validation with large, high-quality, non-overlapping datasets and continue enriching databases in both quantity and diversity. Further improvements can come from expanding and cleaning bioactivity data, balanced similarity thresholds in practical tools, and continued integration into fast, user-friendly web platforms.

Limitations

- The external test set is restricted to targets shared with the ChEMBL training set (1,180 human proteins), a small fraction of all human targets with reported bioactivity, potentially limiting generalizability beyond these proteins. - Only active compounds (≤10 μM) were included in the Reaxys test set; negative/ambiguous data were not used for testing. - Performance decreases for small ligands (fewer heavy atoms) and for targets with few known ligands or low scaffold diversity, reflecting limited similarity partners for reverse screening. - Similarity-based methods rely on the availability and quality of known actives; noisy or sparse data can hamper prediction. - The logistic output is used as a ranking score rather than a calibrated probability of activity in this study. - Chemical diversity, while substantial, still shows partial overlap between training and test sets; truly novel target classes or modalities are not assessed.

Related Publications

Explore these studies to deepen your understanding of the subject.

Food Science and Technology

Metabolomics integrated with machine learning to discriminate the geographic origin of Rougui Wuyi rock tea

Y. Peng, C. Zheng, et al.

Business

Decoding consumer purchase decisions: exploring the predictive power of EEG features in online shopping environments using machine learning

Z. Xu and S. Liu

Medicine and Health

Combining machine learning with high-content imaging to infer ciprofloxacin susceptibility in isolates of *Salmonella Typhimurium*

T. Tran, S. Sridhar, et al.

Political Science

How to convince in a televised debate: the application of machine learning to analyze why viewers changed their winner perception during the 2021 German chancellor discussion

F. Ettensperger, T. Waldvogel, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny