logo
ResearchBunny Logo
Autonomous design of new chemical reactions using a variational autoencoder

Chemistry

Autonomous design of new chemical reactions using a variational autoencoder

R. Tempke and T. Musho

This groundbreaking study by Robert Tempke and Terence Musho explores the bias in current chemical reaction datasets, introducing AGoRaS, an AI model that generates over 7 million synthetic reactions from a mere 7,000, enhancing diversity and applicability in molecular research.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the pervasive problem of inherited bias in chemical reaction datasets, which are often derived from published, optimized reactions and thus sparsely cover the continuous reaction space. The purpose is to create a method that can generate robust, less-biased synthetic datasets spanning wider regions of chemical reaction space to improve machine learning training and exploratory synthesis. The central hypothesis is that a variational autoencoder (VAE) can learn a compact latent representation—a "custom chemical compression intelligence"—from SMILES-encoded reactions, enabling efficient sampling of the latent space to autonomously generate new, chemically meaningful and balanced reactions beyond those observed in the training data. This approach contrasts with retrosynthetic methods by generatively proposing new reactions and species rather than decomposing known ones.
Literature Review
The paper surveys dataset bias in chemical reaction prediction (e.g., Jia et al.; Griffiths et al.; Kovács et al.), emphasizing how publication-driven datasets skew machine learning outcomes and limit generalizability. Prior work in reaction prediction includes machine learning for multistep mechanisms (Kayala & Baldi) and retrosynthesis with transformers and symbolic AI (Schwaller et al.; Segler et al.), but these approaches can inherit biases and often require manual curation. Generative models in other domains, notably GANs for medical records (MedGAN) and multi-categorical data, show promise for synthetic data generation but GANs are data- and tuning-intensive, and lack a usable latent space for sampling. VAEs have been shown to model long sequences and impute missing data, suggesting suitability for generating discrete and continuous chemical representations. Additional related efforts include latent space enrichment for molecular property prediction (Iovanac & Savoie), unsupervised feature construction for activation energy prediction (Zhang et al.), and Bayesian optimization for autonomous experiment design (Shields et al.), though these do not leverage a learned latent space to generate full reaction equations. The review motivates a VAE-based generative approach to create diverse, unbiased reaction datasets and enable targeted reaction discovery.
Methodology
Overview: The AGoRaS (Autonomous Generation of Reaction Systems) VAE framework is designed to learn from SMILES-encoded, balanced gas-phase reaction equations and generate new reactions by probabilistic sampling of a continuous latent space. The workflow includes data collection, species standardization to SMILES, reaction balancing checks, embedding and training, latent space sampling, and rigorous post-generation validation with cheminformatics and semi-empirical calculations. Data collection: Reactions were sourced from the open-access NIST Chemical Kinetics Database. Two CSVs were used: (1) species metadata (∼15,000 IDs with multiple nomenclatures) and (2) reactions defined by species IDs. Conversion to SMILES: Multiple naming conventions were programmatically queried and converted to SMILES using CADD Group Chemoinformatics Tools and PubChem. Entries with disagreements or failed conversions were discarded. A tertiary validity check used RDKit to ensure chemical stability/feasibility. Preprocessing and filtering: Remaining species were mapped back to reaction equations; any reaction with a missing SMILES was removed. Equations were checked to ensure atom balance. Reactions with more than three species on either side were removed to keep comparable sequence lengths and facilitate VAE convergence. Character-level embedding was chosen over word-level to allow generation of novel species; a universal alphabet was constructed from all equations. Model architecture (AGoRaS VAE): Input is a zero-padded character-level vector of length n (max sequence length). A TensorFlow embedding layer projects inputs to higher dimensional space (embedding dim = 500). Encoder: BiLSTM with recurrent dropout 0.2 outputs mean and log-variance for each latent dimension. A reparameterization sampler draws from Gaussian latent variables (latent dim = 350). Decoder: RepeatVector feeds an LSTM that outputs a length-n sequence; softmax activation produces character probabilities. Loss: sequence-to-sequence reconstruction plus KL divergence (KL weight = 0.1). Optimization: Adam (learning rate 1e-5), batch size 25, 500 epochs. Hydrogens are implicit in SMILES. Training/validation split: 70% train, 20% validation, 10% test. VAE reconstruction of validation sequences assessed the model’s encoding/decoding fidelity. Latent space sampling: After training, the encoder was detached and the decoder used with random samples from the learned Gaussian latent distributions to generate candidate reactions. Sampling was conducted extensively to explore the near-continuous latent space. Post-generation validation pipeline: Steps included (1) duplicate removal; (2) atom balance check; (3) RDKit filtering for chemical validity; (4) semi-empirical thermochemistry via a scripted Pipeline Pilot protocol using the VAMP package. Geometry optimization used NDDO/PM6 Hamiltonians with automatic spin-state selection (RHF, UHF, A-UHF) and Pulay/IIS convergence (tolerance 2×10^-4), escalating to less restrictive settings if needed. Thermodynamic properties at standard state were computed per species (enthalpy, entropy, total dipole), then reaction properties derived: ΔG_reaction = ΣG_products − ΣG_reactants, with G = H − T S (as implemented). ΔS_reaction and Δμ (dipole difference) were also computed. Reactions with |ΔG| > 5 eV were discarded for stability concerns. Visualization and analyses: Histograms compared generated versus training distributions for ΔG, ΔS, and Δμ using a random subset of 7000 generated reactions to match the 7000 training reactions. t-SNE on SMILES sequences mapped high-dimensional embeddings to 2D to visualize coverage; marker size proportional to ΔG. Targeted searches filtered reactions containing CO2 or CH4 and examined ΔG distributions.
Key Findings
- Scale of generation: From a core of ~7000 balanced gas-phase reactions (~2000 unique molecular species), AGoRaS generated 7,000,000 validated reactions (stopping criterion ≈1000× the training set size). - Novelty and diversity: ~20,000 new molecular species were produced, including subspecies and completely novel species, often larger than those in the training data. - Thermodynamics: Approximately 97% of generated reactions had ΔG within ±5 eV, despite no thermodynamic labels during training. Generated ΔG and ΔS histograms reflected tri-modal features inherited from training data but crucially filled in between peaks and extended beyond the training range, indicating broader coverage of reaction space. - Entropy trends: Generated reactions exhibited flat ΔS distributions in extreme regions (< −400 eV and > 400 eV), consistent with creation of larger molecular species on reactant or product sides, exceeding training set sizes. - Dipole differences: Both original and generated datasets peaked near zero Δμ, as expected for mostly neutral molecules. The generated set also contained cases with large negative Δμ (reactants with larger dipoles), suggesting the model can position ionic-tendency atoms farther from charge centers than seen in training. - Latent space coverage: t-SNE visualizations showed generated reactions filled gaps between training clusters and extended into new regions, with broader ΔG variation. - Targeted generation utility: For reactions containing CO2, training had ~150 unique reactions versus ~6000 in the generated set; for CH4, ~700 (training) versus ~91,000 (generated). Filtering to |ΔG| ≤ 5 eV demonstrated substantially more thermodynamically plausible candidates for downstream analysis. - Example generated reactions with semi-empirical ΔG (eV): CO2 + 2H2 → CH2O + H2O (ΔG = 0.204); CH4 + 2H2O → 4H2 + CO2 (ΔG = 0.047); C3H6O3 → C2H6O + CO2 (ΔG = −0.486).
Discussion
The findings support the hypothesis that a VAE can learn a smooth, informative latent representation of reactions from SMILES sequences and generate chemically meaningful, balanced reactions across a broader region of the reaction space than present in biased training data. AGoRaS reproduced key statistical features of the training set (e.g., tri-modal ΔG, ΔS) while filling in gaps and extending beyond observed ranges, thus mitigating sparsity and inherited bias. The ability to propose larger and more complex species, indicated by entropy trends and validated via RDKit and semi-empirical calculations, shows the model’s capacity to extrapolate beyond the molecular sizes seen during training. The predominance of near-zero Δμ confirms physical plausibility, with occasional large dipole differences revealing model-driven exploration of charge distributions not emphasized in the training data. t-SNE visualizations corroborate improved coverage and extension within the high-dimensional space. Practically, the approach enables targeted searches (e.g., CO2 or CH4 chemistry) with dramatically larger sets of plausible reactions for thermodynamic screening, thereby potentially accelerating hypothesis generation and experimental planning. While retrospective synthesis tools navigate known chemistry, AGoRaS complements them by probing novel regions informed by learned chemical patterns.
Conclusion
AGoRaS demonstrates an autonomous, VAE-based method to generate vast numbers of balanced, chemically plausible reaction equations from limited, biased datasets. Trained on ~7000 gas-phase reactions, it produced 7,000,000 reactions and ~20,000 new species, many larger and more diverse than those in the training set. The generated reactions maintained physical plausibility (e.g., ~97% within |ΔG| ≤ 5 eV) and expanded coverage by filling gaps and extending beyond the training distribution, as evidenced by thermodynamic histograms and t-SNE mapping. The framework supports targeted reaction discovery, substantially increasing candidate reactions for specific species like CO2 and CH4. Future directions include: (1) iterative debiasing by analyzing residual biases and retraining with augmented synthetic data; (2) applying higher-fidelity quantum methods (e.g., DFT) to prioritize promising generated reactions; (3) extending beyond gas-phase to other phases and broader chemistries; (4) relaxing sequence-length constraints to include more complex stoichiometries; (5) integrating synthesizability metrics and experimental feedback to refine generative proposals.
Limitations
- Experimental validation: Generated reactions and species have not been experimentally synthesized; some molecules proposed by generative models may be impractical to make. Semi-empirical validation adds confidence but is not definitive. - Computational approximations: Semi-empirical methods (e.g., VAMP with NDDO/PM6) trade accuracy for scale; more precise methods (DFT) were not feasible for 7M reactions and would be needed for high-confidence screening. - Residual bias: Synthetic data may inherit distributional biases from the training set; although AGoRaS fills gaps and extends coverage, complete debiasing is not guaranteed without iterative refinement. - Dataset and scope constraints: Training restricted to gas-phase reactions and to equations with ≤3 species per side to stabilize sequence lengths, potentially limiting complexity. Filtering removed reactions with |ΔG| > 5 eV, excluding extreme thermodynamics. - Representation limits: Character-level SMILES encoding may miss stereochemistry or specific chemical constraints; implicit hydrogens and validation via RDKit mitigate but do not eliminate representation issues.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny