Chemistry

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity

T. Ochiai, T. Inukai, et al.

Discover the groundbreaking NP-VAE, a novel deep-learning approach that excels in managing large molecular structures, particularly natural compounds with chirality. This innovative method not only achieves remarkable reconstruction accuracy but also enables the generation of novel compounds with enhanced functions. Research conducted by Toshiki Ochiai, Tensei Inukai, Manato Akiyama, Kairi Furui, Masahito Ohue, Nobuaki Matsumori, Shinsuke Inuki, Motonari Uesugi, Toshiaki Sunazuka, Kazuya Kikuchi, Hideaki Kakeya, and Yasubumi Sakakibara.

00:00

~3 min • Beginner • English

Index

Introduction

The chemical space of potential small molecules is astronomically large (estimated at ~10^60 for molecules ≤500 Da), making structural diversity within compound libraries critical for discovering new drugs. Chemical latent spaces project molecular structures into mathematical spaces based on features (e.g., fingerprints), enabling representation and exploration of library diversity and generation of novel structures. Natural products, often complex, chiral, and structurally unique compared to conventional drugs, frequently exhibit high bioactivity and have historically yielded many therapeutics. State-of-the-art generative AI can leverage heterogeneous data from approved drug databases and natural product libraries to design novel structures in the DMTA cycle. Prior VAE-based molecular models include SMILES-based approaches (CVAE, GVAE, SD-VAE, CLMs with RNNs), which suffer from invalid string generation and lack explicit, navigable latent spaces in many language-model-based variants. Graph-based VAEs (CG-VAE, JT-VAE) generate valid graphs; JT-VAE improves reconstruction by using junction-tree representations but is designed for small molecules. HierVAE scales to larger structures via motif-level generation but struggles with stereochemistry and highly diverse, complex natural-product-like molecules. Flow-based models (e.g., MoFlow) guarantee perfect reconstruction via invertible transformations but yield high-dimensional latent spaces with unstable training for large inputs, hindering exploration. Consequently, existing models have difficulty with large, complex, and heterogeneous molecules, particularly natural products. This study introduces NP-VAE, a graph-based VAE designed to handle such complexity, including chirality, and to integrate functional information into the latent space. The aims are: (1) to construct an interpretable chemical latent space encompassing large and natural-product-like molecules for comprehensive analysis; and (2) to generate novel structures optimized for target properties by navigating the latent space, including using docking for in silico validation.

Literature Review

SMILES-based VAEs such as CVAE pioneered latent-space construction but suffer from invalid SMILES outputs and post hoc filtering. Grammar-informed models (GVAE, SD-VAE) improve validity by enforcing syntactic constraints. Chemical language models (CLMs) with RNNs/LSTMs and transfer learning can model SMILES grammar but often do not form explicit, invertible latent spaces for exploration. Graph-based VAEs (CG-VAE, JT-VAE) represent molecules as graphs; JT-VAE uses junction trees for higher reconstruction accuracy but targets small molecules and struggles with larger, more complex structures. HierVAE extends scalability via hierarchical motifs, performing well on large, repeating structures but inadequately handles stereochemistry and diverse, complex architectures typical of natural products. Flow-based models (normalizing flows; MoFlow) ensure 100% reconstruction via invertible mappings but create high-dimensional latent spaces equal to input dimension, limiting efficient exploration and exhibiting training instability (e.g., gradient explosion) as input size grows. Collectively, these methods leave gaps in handling large, chiral, heterogeneous molecules and in integrating functional properties into latent representations.

Methodology

Model: NP-VAE (Natural-Product Compound Variational Autoencoder) is a graph-based VAE (≈12M parameters) combining: (1) preprocessing that decomposes molecules into chemically meaningful fragments (e.g., aromatic rings, amide, carboxyl groups), forming a loop-free tree representation; (2) fragment-level features via ECFP fingerprints with chirality; (3) a Tree-LSTM (Child-Sum) to encode the fragment tree; and (4) a decoder that sequentially generates the fragment tree (depth-first) and assembles the 2D structure, followed by stereochemistry assignment using predicted ECFP with chirality. - Preprocessing: Molecules are fragmented and mapped to a tree T whose nodes are substructures; edges connect neighboring fragments. Each node stores its substructure ECFP. The mapping targets simplification for large molecules and extraction of meaningful physicochemical features; chirality is preserved. - Encoder: Tree-LSTM computes node embeddings; in parallel, an ECFP for the whole molecule is transformed via fully connected layers. The root embedding and global features are combined, and the latent vector z is sampled via reparameterization from learned μ and σ. - Decoder: Seven procedures: (i) Root label prediction (initial fragment), (ii) Topological prediction (add child vs. backtrack), (iii) Bond prediction (bond type between fragments), (iv) Label prediction (child fragment identity), (v) z-update at each step using current node features, (vi) conversion of the generated fragment tree into a unique 2D molecular graph, and (vii) chirality assignment by predicting a target ECFP (with sigmoid output) and selecting the stereoisomer whose ECFP is closest to the prediction. - Learning: Multi-task loss combines cross-entropy terms for decoder tasks (root, topology, label, bond), KL divergence, binary cross-entropy for chirality/3D ECFP prediction, and MSE for property prediction when functional information is included. During training, even if decoding diverges, teacher-forcing with the correct structure allows backpropagation through the tree to stabilize learning. A property-prediction head f(z) is trained jointly (e.g., NP-likeness or activity values), incorporating functional gradients into the latent space. Data and baselines: - Datasets: A drug-and-natural-product dataset (~30,000 compounds) combining ~10,000 DrugBank entries and ~20,000 natural-product-rich structures from a project collection; a restricted dataset was derived by filtering to <100 heavy atoms and removing compounds causing HierVAE errors for fair baselines. Compared size ranges highlight the large-molecule coverage (max 551 heavy atoms; MW up to 8272) vs. ZINC typical small molecules (max ~38 heavy atoms; MW ≤500). - Baselines for reconstruction/generalization: CVAE, CG-VAE, JT-VAE, HierVAE, as reported in prior work. For generative benchmarking: HierVAE (VAE), MoFlow (flow-based), and SM-RNN (CharRNN over SMILES). Evaluation protocols: - Generalization (reconstruction/validity) on St. John et al. dataset split (76k train/5k val/5k test) following prior HierVAE work. Reconstruction accuracy via Monte Carlo (10 encodes × 10 decodes per test molecule; 100 samples per molecule). Validity via decoding 100× from 1000 random latent samples and RDKit validity checks. - Generative benchmarks: Sample 5×(5000) latent vectors from N(0,1) per model; compute metrics from MOSES/GuacaMol suites: Uniqueness, Novelty, logP, QED, SAscore, Filters, SNN, MolWt, NP-likeness, fragment/scaffold distribution similarity (Frag/Scaf), internal diversity (IntDiv), physicochemical KL divergence (Phys div). 2D and 3D reconstruction accuracies reported for models with encoders; only NP-VAE supports 3D via chirality. Latent space analyses and optimization: - Two latent spaces trained on the full drug-and-natural-product dataset: (a) structure-only; (b) structure + functional information (NP-likeness). Visualized via t-SNE; correlation between latent distance and NP-likeness differences quantified. - Interpolation: Linear scans between latent vectors of two drugs to generate smooth transitions in structure and NP-likeness. - Bayesian optimization (TPE) around a seed compound to maximize QED under similarity constraints, exploring narrow vs. wider neighborhoods. Docking and assay: - EGFR assay: Membrane fraction from A431 cells; inhibition of EGF-stimulated phosphorylation measured with [γ-32P]ATP; AG1478 as positive control. - Docking: Schrödinger Glide SP vs. EGFR (PDB 2ITY, chain A). Protein prepared via Protein Preparation Wizard; ligands via LigPrep (tautomers, 3D). Grid 20 Å cube centered on the original ligand; best Glide SP score recorded. Compared NP-VAE-generated compounds (from the vicinity of Gefitinib/Osimertinib) to SMILES RNN-based REINVENT v3.0 virtual compounds (rewarded by QEPPI).

Key Findings

- Generalization (St. John dataset test set): NP-VAE achieved the highest reported 2D reconstruction accuracy among VAEs while maintaining perfect validity: NP-VAE 2D reconstruction accuracy 0.813; validity 1.000. Baselines: HierVAE 0.799/1.000; JT-VAE 0.585/1.000; CG-VAE 0.424/1.000; CVAE 0.215/0.931. - Generative performance (restricted dataset; sampled from prior N(0,1)): - 2D/3D reconstruction (training set projection): NP-VAE 2D 0.871; 3D (with chirality) 0.853. HierVAE 2D 0.438; MoFlow 2D 1.000 (by invertibility; no 3D). SM-RNN N/A. - Diversity/quality metrics (mean±sd): Uniqueness NP-VAE 0.981±0.003; Novelty 0.983±0.002; QED 0.670±0.025 (highest among compared); SAscore 2.378±0.551 (lower indicates easier synthesis); Filters 0.824±0.002 (highest pass rate); logP 2.255±1.795; SNN 0.484±0.003; MolWt 255.98±85.64; NP-likeness −0.758±0.832; Frag 0.951±0.001; Scaf 0.444±0.015; IntDiv 0.877±0.001; Phys div 0.676±0.007. SM-RNN showed low novelty (~0.223), indicating memorization. - Large-molecule handling: The drug-and-natural-product dataset includes compounds up to 551 non-H atoms and MW up to 8272, far exceeding typical ZINC-like small molecules; NP-VAE successfully trained and generated on such data when existing methods could not. - Functional latent space: Incorporating NP-likeness into training produced a latent space with clearer gradients/clustering by NP-likeness (Pearson r between latent distance and NP-likeness difference increased from 0.14 to 0.19) and tighter clustering of anticancer drug classes, especially molecular-targeted drugs. - Discovery/annotation: Yessotoxin, located near molecular-targeted drugs in the functional latent space, exhibited weak EGFR inhibitory activity (>80% inhibition at 100 µg/mL vs. EGF-stimulated control), supporting the hypothesis from latent-space proximity. - Interpolation: Linear scans between a NAD derivative and Sorafenib yielded structurally gradual transitions with decreasing NP-likeness toward the targeted drug, illustrating continuous, interpretable latent trajectories. - Bayesian optimization around seeds: - With similarity ≥0.6: mean QED improvement 0.046 (sd 0.074), preserving characteristic motifs (e.g., Octreotide, Paclitaxel cases). - With similarity ≥0.2: mean QED improvement 0.538 (sd 0.022), allowing larger structural changes. - Docking-guided design (EGFR): ~10,000 NP-VAE compounds generated near Gefitinib/Osimertinib; ~5700 surpassed Osimertinib’s docking score; ~1600 surpassed Gefitinib’s. Top-scoring compounds often retained kinase-inhibiting heterocycles (e.g., quinazoline/pyrimidine) and introduced additional moieties that improved predicted binding. Docking poses showed interactions consistent with known EGFR inhibitor contacts (e.g., Met793; additional contacts at Arg841/Asn842; Met790/Cys797/Ala743). - Computational cost per training epoch (restricted dataset; Tesla P100 16GB): NP-VAE 1233.7±3.84 s; HierVAE 1895.5±51.73 s; MoFlow 305.1±0.61 s; SM-RNN 48.8±0.12 s.

Discussion

NP-VAE addresses two core challenges in de novo molecular design for complex chemotypes: (1) representing and generating large, heterogeneous, chiral molecules (e.g., natural products), and (2) constructing a navigable latent space enriched with functional information. By fragmenting molecules into chemically meaningful units and encoding them via a Tree-LSTM with ECFP (including chirality), NP-VAE achieves high 2D and 3D reconstruction, indicating a faithful, continuous representation suitable for exploration. Compared with prior VAEs, NP-VAE improves generalization reconstruction accuracy and, unlike SMILES language models, supplies an explicit latent space that supports interpolation, property-guided optimization, and local exploration around known drugs. Incorporating functional properties (e.g., NP-likeness) into training shapes the latent space to cluster molecules by desired attributes, facilitating targeted searches for candidates. The discovery that Yessotoxin sits near molecular-targeted drugs and exhibits EGFR inhibition exemplifies the model’s potential for annotation and hypothesis generation from latent proximity. Optimization experiments demonstrate that local Bayesian exploration can tune oral drug-likeness (QED) while preserving scaffold features when constrained, or achieve larger gains when broader similarity ranges are permitted. Docking studies around EGFR inhibitors show that latent-space sampling near known actives yields many candidates with improved predicted binding, often recapitulating key interactions while adding novel contacts, suggesting utility for lead diversification and potency improvement. Overall, NP-VAE combines structural and functional learning to enable both statistical analyses of complex libraries and practical generation of candidate structures for downstream in silico screening.

Conclusion

The study presents NP-VAE, a variational autoencoder tailored for large and complex molecular structures, including natural products, with explicit handling of chirality. NP-VAE constructs an interpretable chemical latent space that integrates structural and functional information, achieves superior reconstruction and robust generative performance across benchmarks, and enables latent-space exploration to generate novel molecules with desired properties. Applications demonstrated include interpolation between drugs, Bayesian optimization to enhance drug-likeness while controlling similarity, and docking-driven identification of EGFR inhibitor candidates with improved predicted binding. The framework supports comprehensive analysis of heterogeneous compound libraries and facilitates in silico selection of drug candidates. Future work includes validating synthesizability and bioactivity experimentally, refining property-conditioning and optimization strategies, and extending to additional targets and property types.

Limitations

- Synthetic accessibility: High NP-likeness correlates with higher SAscore (more difficult synthesis). Many generated natural product-like structures may be challenging to synthesize. The proposed strategy to search for simpler, smaller, equally bioactive analogs near complex natural products in latent space requires further experimental validation. - Computational efficiency: Although faster per epoch than HierVAE, NP-VAE remains slower than non-VAE baselines due to sequential LSTM components, which can be a bottleneck for very large datasets. - Benchmark constraints: Some existing models could not be executed on the full large-molecule dataset, necessitating a restricted dataset for head-to-head comparisons; thus, cross-model comparisons are partly limited to smaller molecules. - Validation scope: Many findings rely on in silico metrics (e.g., QED, docking). Broader experimental validation beyond the EGFR assay for Yessotoxin is needed to confirm predicted activities and developability.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

Leveraging molecular structure and bioactivity with chemical language models for de novo drug design

M. Moret, I. P. Angona, et al.

Chemistry

Representation of molecular structures with persistent homology for machine learning applications in chemistry

J. Townsend, C. P. Micucci, et al.

Engineering and Technology

Graphene-based 3D XNOR-VRRAM with ternary precision for neuromorphic computing

B. Alimkhanuly, J. Sohn, et al.

Engineering and Technology

Deep-learning-based image segmentation integrated with optical microscopy for automatically searching for two-dimensional materials

S. Masubuchi, E. Watanabe, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny