logo
ResearchBunny Logo
Introduction
De novo drug design is a crucial aspect of pharmaceutical research, aiming to discover novel drug candidates by computationally designing molecules with desired properties. Traditional high-throughput screening methods, while effective, are limited by the vastness of chemical space and the potential for high false positive rates. Generative deep learning models, such as chemical language models (CLMs), offer a promising alternative by generating focused virtual libraries of compounds on demand. CLMs learn from textual representations of molecules, typically SMILES strings, enabling them to generate new molecules that share structural features with known bioactive compounds. However, most previous CLM applications primarily focused on structural information and often lacked the ability to directly incorporate bioactivity data into the molecule generation process. This study addresses this gap by developing a data-driven molecular design pipeline that leverages both the structural and bioactivity information of known ligands to generate bespoke molecules. The goal is to develop a method that can not only generate novel molecular structures but also predict their bioactivity, thus accelerating the drug discovery process and improving the efficiency of hit and lead identification.
Literature Review
Previous research has explored the use of generative deep learning models for de novo drug design, demonstrating success in creating focused virtual chemical libraries. CLMs, specifically, have shown promise in generating molecules with desired properties. Studies have reported the successful de novo design of bioactive molecules, including inhibitors of the vascular endothelial growth factor receptor 2 (VEGFR-2) and modulators of nuclear hormone receptors. These models typically involve pretraining on large datasets of molecules to learn the SMILES grammar and feature distribution, followed by transfer learning with a smaller set of molecules representing the target chemical space. However, these methods often lack direct integration of bioactivity data during the generation process, leading to a need for subsequent bioactivity prediction and filtering steps. This study aims to improve upon existing methods by incorporating bioactivity information directly into the CLM, resulting in a more targeted and efficient de novo design process. The research considers the limitations of existing approaches which may not capture the physicochemical properties of approved drugs when using only bioactivity data from databases like ChEMBL. Therefore, this work incorporates patented molecule data for CLM pretraining, hypothesizing that these molecules are more likely to be developed into marketable drugs.
Methodology
The study employed a two-step process using distinct CLMs. The first CLM was responsible for de novo molecular generation, while the second CLM acted as a bioactivity predictor to refine the generated library. For molecule generation, a long short-term memory (LSTM) based CLM was pretrained on a large dataset of 839,674 molecules from the US patent database, aiming to capture structural features of approved drugs. Transfer learning was then performed using 46 PI3Kγ inhibitors (IC50 ≤ 100 nM) from the Drug Target Commons (DTC) database. Nucleus sampling, a technique that biases the CLM towards more probable characters in the SMILES string, was utilized to improve the quality and novelty of generated molecules. The bioactivity prediction CLM was also pretrained on a large dataset, exploring both autoregressive and ELECTRA pretraining strategies. The ELECTRA method aimed to improve the model's ability to distinguish subtle structural changes which can drastically affect bioactivity. The resulting model, termed E-CLM, was then combined with a feedforward layer to perform ordinal classification, predicting the bioactivity of the generated molecules into three categories: inactive, moderately active, and highly active. To enhance confidence, a deep ensemble model comprising 100 E-CLMs was used, with majority voting determining the final prediction. Commercially available compounds from the generated library were initially screened for PI3Kγ binding, followed by synthesis and testing of top-ranked molecules and their derivatives.
Key Findings
The study successfully generated a focused virtual chemical library of PI3Kγ ligands using a CLM-based approach. Nucleus sampling significantly improved the quality and novelty of generated molecules compared to temperature sampling, demonstrating that the algorithm improved on the quality of generated molecules as measured by several metrics (validity, uniqueness, novelty, and Fréchet ChemNet Distance). The hybrid CLM classifier, particularly the E-CLM model, effectively predicted the bioactivity of the generated molecules, outperforming the standard CLM in identifying highly active molecules while minimizing false positives. Deep ensemble learning further enhanced the prediction confidence, leading to the identification of several highly active compounds. Commercially available compounds were tested, revealing a new PI3Kγ ligand (compound 1) with sub-micromolar activity. Furthermore, synthesis and testing of two top-ranked molecules (17 and 20, and their derivatives) confirmed their potent nanomolar inhibitory activity against PI3Kγ. Docking studies suggested key interactions with the kinase hinge residues Glu880 and Val882. In cell-based assays using human medulloblastoma cells, the most potent compounds (18 and 22) effectively repressed PI3K-AKT signaling, comparable to the pan-PI3K inhibitor copanlisib, without significant cytotoxicity. The study successfully demonstrated scaffold hopping, as many of the top-ranked molecules exhibited novel scaffolds compared to known PI3Kγ inhibitors. The use of patented molecule data, instead of focusing solely on molecules with known bioactivity, may have improved the algorithm's efficacy. This was noted but couldn't be proven definitively.
Discussion
The results of this study demonstrate the successful application of hybrid CLMs for de novo drug design, bridging the gap between structure-based and activity-based approaches. The integration of bioactivity data directly into the CLM generation process led to a more efficient and targeted search for novel ligands. The use of nucleus sampling and deep ensemble learning improved the quality and confidence of predictions, enabling the identification of highly active PI3Kγ inhibitors with novel scaffolds. The successful identification and validation of commercially available and subsequently synthesized potent PI3Kγ inhibitors demonstrate the practical applicability of the method. The findings showcase the method's potential for accelerating drug discovery by reducing the reliance on extensive experimental screening and improving hit-to-lead optimization. The success in both hit-finding and hit-expansion reinforces the versatility of the approach.
Conclusion
This study successfully demonstrated a novel computational pipeline for de novo drug design using hybrid CLMs. The integration of both structural and bioactivity information improved the efficiency and accuracy of the process. The findings highlight the potential of this approach to accelerate drug discovery and aid in both hit-finding and hit-to-lead expansion. Future studies should investigate the generalizability of the method to other drug targets and further explore the effect of different pretraining strategies and data augmentation techniques. The strategy of validating de novo designs using commercially available molecules before initiating synthesis is also suggested for future studies.
Limitations
While the study demonstrated the effectiveness of the method for PI3Kγ, further research is needed to assess its generalizability across different drug targets and target families. The reliance on commercially available compounds for initial validation might limit the exploration of chemical space. The accuracy of the bioactivity prediction model, especially for structurally similar molecules, could be further improved. Additional limitations inherent in deep learning methods are mentioned, such as the decrease in performance when applied to out-of-domain data. Despite these limitations, the study provided substantial proof-of-concept evidence supporting the effectiveness of the proposed method.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny