logo
ResearchBunny Logo
Introduction
The vast chemical space, estimated to contain 10^60 variations of small molecules, presents a significant challenge in drug discovery. Chemical latent spaces, which project compound structures into mathematical spaces based on molecular features, are crucial for exploring this space and generating novel drug candidates. Natural compounds, often possessing complex structures and high biological activity, represent a rich source of potential drug leads. This study aims to develop a deep-learning method capable of handling the complexity and size of natural compound datasets to generate novel, optimized drug candidates. Existing deep-learning models, including SMILES-based and graph-based VAEs, face limitations in handling large, complex molecules and often produce invalid outputs. SMILES-based methods struggle with generating valid SMILES strings and require post-processing filtering. Graph-based methods, while generating valid structures, often cannot handle large molecules. Flow-based models, while guaranteeing 100% reconstruction accuracy, can suffer from high-dimensionality and instability with large datasets. Therefore, a novel method capable of handling large, complex, and heterogeneous datasets is needed for efficient drug discovery.
Literature Review
Several deep learning models have been developed for constructing chemical latent spaces. Variational Autoencoders (VAEs) are prominent methods, categorized into SMILES-based and graph-based approaches. Early SMILES-based models like CVAE suffered from generating many invalid chemical structures. Subsequent models like GVAE and SD-VAE improved validity by focusing on SMILES grammar, while recent CLMs utilizing RNNs like LSTMs further advanced this but lack explicit latent space construction. Graph-based models such as CG-VAE and JT-VAE represent molecules as graphs, improving validity but struggling with large structures. HierVAE addressed this by handling large molecules in fragment units, but still couldn't fully address stereochemistry and diverse internal structures of natural compounds. Flow-based models like MoFlow guarantee 100% reconstruction accuracy but often result in high-dimensional latent spaces and can suffer from training instability. The existing methods lack the ability to effectively handle large, complex and heterogeneous compound structures, such as natural compounds, which often possess novel structures and exhibit high biological activity.
Methodology
The researchers developed NP-VAE (Natural Product-oriented Variational Autoencoder), a graph-based VAE designed to handle large, complex molecules, including natural products with chirality. NP-VAE incorporates an algorithm for decomposing compound structures into fragment units and converting them into tree structures, using Extended Connectivity Fingerprints (ECFP) and Tree-LSTM. The model comprises 12 million parameters, incorporating improvements over existing methods like JT-VAE and HierVAE. The study involved two main objectives: (1) constructing a highly interpretable chemical latent space encompassing large molecular structures like natural compounds from DrugBank and a large collection of natural compounds using NP-VAE; and (2) generating novel compound structures optimized for target functions by exploring the acquired latent space. The model was trained on a combined dataset of approximately 30,000 compounds from DrugBank and a project dataset rich in natural compounds. A restricted dataset was also created for comparison with existing methods. Model performance was evaluated using various metrics, including reconstruction accuracy, validity, uniqueness, novelty, logP, QED, SAscore, Filters, SNN, molecular weight, NP-likeness, and measures of fragment, scaffold, internal, and physicochemical diversity. Additionally, interpolation and Bayesian optimization were used to explore the latent space and generate novel compound structures with optimized properties. Docking analysis with EGFR was performed to assess the potential efficacy of generated compounds as molecular-targeted drugs.
Key Findings
NP-VAE demonstrated superior performance compared to existing methods. It achieved significantly higher reconstruction accuracy (0.813 for test compounds) and 100% validity in generating chemically valid structures, surpassing HierVAE (0.799), JT-VAE (0.585), CG-VAE (0.424), and CVAE (0.215). As a generative model, NP-VAE exhibited stable performance across various metrics. While MoFlow achieved 100% 2D reconstruction accuracy, NP-VAE excelled in 3D reconstruction accuracy (0.853), showcasing its ability to handle stereochemistry. NP-VAE outperformed HierVAE and MoFlow in several key metrics including QED (quantifying drug-likeness) and Filters (eliminating undesired structures), indicating superior drug-likeness of generated compounds. Incorporation of NP-likeness scores during training resulted in a chemical latent space where compounds clustered according to their NP-likeness, allowing for targeted generation of natural product-like compounds. Interpolation within the latent space produced novel compound structures with smooth transitions in properties and structural similarity between two chosen molecules. Bayesian optimization further enhanced the ability to generate compounds with improved QED scores. Docking analysis revealed novel compounds with improved binding scores compared to existing drugs (e.g., Gefitinib, Osimertinib) when exploring the latent space around existing drugs. Analysis of docking poses suggested interactions with key residues on EGFR, consistent with previous research on similar drugs. While NP-VAE had longer training times compared to non-VAE methods, it was faster than comparable VAE methods. The study also noted a tradeoff between NP-likeness and synthetic accessibility, a crucial aspect for future research.
Discussion
NP-VAE successfully addresses the limitations of existing methods by effectively handling large, complex molecular structures and generating chemically valid and functionally optimized compounds. The superior performance of NP-VAE in reconstruction accuracy and generative model evaluations demonstrates its effectiveness in exploring the chemical latent space and designing novel drug candidates. The ability to incorporate functional information, such as NP-likeness, enhances the targeted generation of compounds with desired properties. The application of interpolation and Bayesian optimization further expands the possibilities of NP-VAE in drug discovery. The discovery of novel compounds with improved docking scores compared to existing drugs through exploration of the latent space around known drugs highlights the potential of NP-VAE as an effective in silico drug discovery tool. The integration of docking analysis provides valuable insights into the potential binding mechanisms of generated compounds.
Conclusion
This study introduces NP-VAE, a powerful VAE model for handling large and complex molecular structures, including natural compounds. NP-VAE constructs a chemical latent space incorporating both structural and functional information, enabling efficient exploration and generation of novel drug candidates. Its superior performance in reconstruction accuracy, validity, and various generative model metrics, along with successful applications of interpolation and Bayesian optimization, establish NP-VAE as a valuable tool in drug discovery. Future research should focus on improving the synthetic accessibility of generated compounds and exploring the applicability of NP-VAE to other drug targets and therapeutic areas.
Limitations
While NP-VAE demonstrates significant improvements over existing methods, some limitations exist. The computational cost associated with training remains high, particularly compared to non-VAE methods. The synthetic accessibility of some generated compounds, especially those with high NP-likeness scores, might pose a challenge for experimental validation. Further research is needed to explore strategies for improving the synthetic accessibility of the generated molecules while maintaining their desirable properties. The availability of the training data is limited, which might influence the generalizability of the model.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny