logo
ResearchBunny Logo
Introduction
Predicting the primary and secondary macromolecular targets of therapeutic compounds is crucial in drug research. Concepts like polypharmacology, specificity, and drug repurposing rely heavily on accurate target prediction. Early-stage drug discovery involves evaluating massive numbers of molecules, making fast and robust bioinformatic models essential. While previous studies explored target and bioactivity prediction, the predictive ability of ligand-based reverse screening has been largely overlooked due to the difficulty of creating suitable external test sets. This study addresses this methodological gap by evaluating the predictive power of ligand-based reverse screening using an extensive, diverse, curated external bioactivity dataset.
Literature Review
Earlier research on target and bioactivity prediction, notably the work of Shoichet and colleagues on ligand-based reverse screening, demonstrated the potential of computational methods. However, these studies often lacked comprehensive external validation. Subsequent research focused on comparing machine-learning algorithms and stratifying bioactivity knowledge, but the actual predictive power remained under-investigated due to challenges in building suitable external test sets. This research aims to fill this gap by providing a robust evaluation using a large external dataset.
Methodology
The study employed a logistic model combining shape and chemical similarity, trained on data from the ChEMBL database. This model leverages the Similarity Principle, quantified previously by the authors. The training data consisted of 501,959 compounds with experimental bioactivity against 3669 protein targets. Molecular descriptors, including ElectroShape 5D vectors and FP2 fingerprints, were used to represent compound shape and chemical structure. Pairwise comparisons generated similarity matrices, and 51 subsets were created based on the number of heavy atoms in the molecules. A binary logistic model was trained for each subset to predict target probability. The model's robustness was confirmed via 10-fold cross-validation using the Matthews correlation coefficient (MCC). External validation was performed using a Reaxys dataset of 364,201 small molecules, ensuring independence from the training set. The Reaxys molecules were reverse-screened against the ChEMBL screening set (a subset of ChEMBL active on human proteins shared with Reaxys) to identify the most similar known actives. The highest similarity values (3D-Score and 2D-Score) were input into the logistic equation to calculate the probability for each protein target. The predictive ability was assessed by ranking the predicted targets and comparing them to the known experimental targets. The study also investigated the physicochemical and chemical spaces of the training and test sets to ensure applicability domain and chemical diversity, using several molecular descriptors and scaffold definitions (Murcko and Oprea).
Key Findings
The study found that the machine-learning model accurately predicted the correct target with the highest probability for 51.2% of the test compounds. This success rate significantly exceeded the expected rate of 0.1% by random chance. The success rate improved to 72.9% when considering the top 15 predicted proteins. The predictive ability remained relatively consistent across different molecular sizes, with the most stable performance observed for molecules with 20–40 heavy atoms. Even when considering only compounds with exactly one experimental target, the success rate at the highest probability was 49.7%. A subset of the test set (32,748 molecules, termed the 'Distinct test set'), containing compounds with scaffolds not found in the training set, still showed a remarkable 36.2% success rate at the highest probability and 51.8% at rank #5, demonstrating the model's robustness to chemical novelty. The study also demonstrated a strong correlation between predictive performance and the amount of bioactivity knowledge available for each protein target. Targets with more known active compounds and greater chemical diversity among their ligands showed significantly higher success rates. This emphasizes the importance of populating databases with diverse and high-quality bioactivity data.
Discussion
The high predictive accuracy demonstrated in this study validates the potential of ligand-based reverse screening for identifying potential drug targets. The findings are particularly significant for applications such as deconvolution of phenotypic screens, the design of polypharmacology panels, and target repurposing. The consistent performance across diverse molecular sizes and the robustness to chemical novelty further strengthen the method's applicability. The observed relationship between predictive performance and the extent of bioactivity data highlights the importance of continued efforts to expand and diversify high-quality bioactivity databases. The methodology presented offers a valuable approach to assessing the predictive ability of target prediction methods and provides insights into factors influencing their performance.
Conclusion
This study provides the first large-scale validation of the predictive power of ligand-based reverse screening for drug target identification. The high success rate, even for chemically novel compounds, demonstrates the method's robustness and practical utility. Future research could focus on refining the model by incorporating additional data types or improving the representation of molecular features. Expanding and diversifying bioactivity databases will also be crucial for enhancing the method's predictive power and broadening its applicability.
Limitations
The study's reliance on existing databases introduces potential biases related to data coverage and quality. While efforts were made to ensure the independence of the training and test sets, some overlap might still exist at a deeper chemical level beyond the scaffold analysis. The use of a specific machine learning model may limit the generalizability of the findings. Further research may reveal if other algorithms could achieve similar or improved results.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny