Chemistry

Autonomous design of new chemical reactions using a variational autoencoder

R. Tempke and T. Musho

This groundbreaking study by Robert Tempke and Terence Musho explores the bias in current chemical reaction datasets, introducing AGoRaS, an AI model that generates over 7 million synthetic reactions from a mere 7,000, enhancing diversity and applicability in molecular research.

00:00

Playback language: English

Index

Introduction

Artificial intelligence (AI) is revolutionizing chemistry, particularly in reaction design. Machine learning (ML), a subset of AI, accelerates reaction exploration and provides unique insights. However, current chemical reaction datasets suffer from inherent bias due to their reliance on published, optimized reactions, neglecting the complete solution space. This bias impacts the accuracy and generalizability of AI models trained on these datasets. This study aims to overcome this limitation by using a VAE to generate a less biased, more robust, and continuous dataset of chemical reactions. The hypothesis is that a VAE can learn a compressed representation of chemical reactions and generate new, diverse reactions by sampling its latent space. This approach differs from retrosynthesis methods, which rely on deconstructing known reactions. The generation of a broader range of chemical reactions is expected to lead to better machine learning models and enhance the efficiency of experimental research by directing efforts towards more promising reaction pathways.

Literature Review

Existing research demonstrates the potential of AI in reaction chemistry. Kayala and Baldi used ML to predict multistep reactions, but their work highlighted the problem of dataset bias. Kovács et al. later showed the Clever Hans effect in this work, revealing a bias stemming from the training data. Similarly, Mater and Coote emphasized the pervasive issue of bias in chemical reaction datasets. While generative techniques have shown success in other fields with data scarcity issues (like MedGAN in medical data generation), their application to chemical reactions remains limited. Existing approaches like retrosynthesis offer a continuous space of intermediate species but struggle to generate novel molecules beyond those in the initial database. While methods such as Bayesian optimization can fine-tune neural networks, they don't create new parameters or solve the underlying data bias problem. Previous work has also explored using encoder-decoder networks to represent continuous chemical latent spaces, but they haven't tackled the scale and complexity of reaction equation generation presented in this study. The use of K-means clustering and support vector machines to generate latent representations shows potential but lacks the continuous nature and scalability of a VAE.

Methodology

The proposed method, AGoRaS, leverages a VAE to generate synthetic chemical reaction data. The workflow begins with data collection from the NIST chemical kinetics database. The data is preprocessed by converting chemical species into SMILES notation using PubChem and CADD Group databases, ensuring consistency and chemical validity through RDKit checks. Unbalanced reactions and those with more than three species (reactants or products) are eliminated to enhance VAE training stability. Character-level embedding is employed, utilizing TensorFlow’s embedding techniques. The AGoRaS VAE architecture consists of an encoder (BiLSTM with recurrent dropout) and a decoder (LSTM). The encoder compresses the input SMILES string into a latent space represented by a probability distribution, allowing for continuous sampling. The decoder reconstructs the SMILES string from the sampled latent space. The network is trained using a sequence-to-sequence loss function and KL loss. After training, the latent space is sampled to generate new reactions, which undergo a rigorous validation process including balance checks, RDKit stability assessment, and semi-empirical calculations using the VAMP software package within Pipeline Pilot. Thermodynamic properties (Gibbs free energy, entropy, and dipole moment) are calculated to assess the feasibility of the generated reactions. t-Distributed Stochastic Neighbor Embedding (t-SNE) is employed to visualize the latent space and demonstrate the expansion of the solution space beyond the training data. Targeted reaction searches are conducted to show AGoRaS’s capability to generate reactions containing specific molecular species, such as CO2 and CH4.

Key Findings

AGoRaS, trained on ~7,000 reactions and 2,000 molecular species, successfully generated 7,000,000 new reactions and ~20,000 new molecular species. The generated reactions exhibit a wide range of Gibbs free energies (±5 eV), exceeding the range of the training data, indicating the model's ability to generate chemically stable reactions beyond the training set's bias. Analysis of entropy and dipole moment differences reveals that AGoRaS generates larger molecular species than those present in the training data. t-SNE visualization confirms that the generated data points populate the latent space beyond the original training data, illustrating the filling and expanding of the chemical reaction solution space. Targeted searches for reactions involving CO2 and CH4 demonstrate AGoRaS's capability to significantly increase the number of viable reactions compared to the original dataset, showing its utility in targeted reaction discovery. The generated equations closely resemble the trimodal distribution of the training data but also successfully populate the gaps between the peaks, filling in the solution space. The network's ability to generate larger molecular species is highlighted by the observed flat distribution of entropy differences beyond the values observed in the training data. The study demonstrates that AGoRaS can generate reactions with a large range of dipole moments, particularly reactions where the reactants have a larger dipole moment than the products, suggesting the generation of molecules with atoms placed beyond the range of atomic positions presented in the training data.

Discussion

AGoRaS effectively addresses the issue of dataset bias in chemical reaction prediction by generating a significantly larger and more diverse dataset. The VAE's ability to learn a probabilistic representation of the chemical reaction space allows for the generation of new and chemically stable reactions, extending beyond the limitations of retrosynthesis methods. The rigorous validation pipeline ensures the chemical feasibility and stability of the generated reactions. The ability to expand beyond the training data demonstrates the value of AGoRaS over existing methods. While the generated dataset may still reflect some biases from the original data, iterative retraining and data augmentation could further reduce bias. The methodology offers a scalable and efficient approach to data generation, making it readily applicable to various chemical domains. This work contributes to the development of more robust AI models for predicting and designing chemical reactions and could prove invaluable in streamlining experimental research by guiding the selection of promising reactions for further investigation.

Conclusion

AGoRaS represents a significant advancement in the autonomous design of chemical reactions. By using a VAE to generate a large, diverse, and less-biased dataset, it overcomes limitations of existing methods. The model's ability to generate novel molecular species and fill gaps in the reaction space demonstrates its potential to accelerate chemical research and discovery. Future work could focus on expanding the dataset size, incorporating additional reaction types (e.g., solution-phase reactions), and refining the validation pipeline for even greater accuracy and broader applicability. Integrating this approach with advanced quantum chemistry calculations for high-fidelity validation is another potential avenue for future development.

Limitations

While AGoRaS generates a substantial number of new reactions, the generated molecules have not been experimentally verified. The generated data, although diverse, may still reflect some biases present in the training dataset. Semi-empirical calculations, while computationally efficient, offer a lower level of accuracy compared to more resource-intensive first-principles methods like DFT. The current model is primarily focused on gas-phase reactions, limiting its immediate applicability to solution-phase chemistry. Further refinement of the data cleaning process is required to ensure completeness, though this is also limited by the continuous and possibly infinite nature of the solution space.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Design and Analysis of a Deep Learning Ensemble Framework Model for the Detection of COVID-19 and Pneumonia Using Large-Scale CT Scan and X-ray Image Datasets

X. Xue, S. Chinnaperumal, et al.

Biology

Design of a recombinant asparaginyl ligase for site-specific modification using efficient recognition and nucleophile motifs

J. Tang, M. Hao, et al.

Education

Design and assessment of a teaching program to address temporal concepts in Early Childhood Education and Primary Education using stories

V. Vivas-moreno, P. Miralles-martínez, et al.

Chemistry

AlphaFlow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning

A. A. Volk, R. W. Epps, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny