Chemistry

Diffusion-based generative AI for exploring transition states from 2D molecular graphs

S. Kim, J. Woo, et al.

Discover the groundbreaking approach of TSDiff, a generative model that predicts transition state geometries directly from 2D molecular graphs, offering unmatched accuracy and efficiency in understanding reaction pathways. This innovative research was conducted by Seonghwan Kim, Jeheon Woo, and Woo Youn Kim.

00:00

Playback language: English

Index

Introduction

Transition states (TSs), transient molecular configurations atop energy barriers on the potential energy surface (PES), are critical for understanding chemical reactions. Identifying TSs is vital for kinetic modeling, mechanism studies, and catalyst design. Traditional TS optimization methods, categorized as single-ended (using reactant geometries) or double-ended (using both reactant and product geometries), are computationally expensive and prone to convergence issues. Recent advancements leverage machine learning (ML) to predict TS geometries, often utilizing 3D geometries of reactants and products. While showing promise, these ML methods inherit the input sensitivity of traditional methods; the quality of predictions heavily depends on the careful selection and alignment of 3D input structures. This sensitivity necessitates substantial user effort and restricts exploration of diverse TS conformations, potentially overlooking more favorable reaction pathways. To address these limitations, this research proposes TSDiff, a novel ML model that directly maps 2D molecular graphs to 3D TS conformations, eliminating the need for 3D structure preparation and enabling the generation of multiple TS conformations.

Literature Review

Existing methods for transition state (TS) identification include single-ended and double-ended approaches. Single-ended methods, such as the Berny algorithm, AFIR, ADDF, and single-ended GSMs, start from a single 3D geometry (often the reactant) and optimize towards a saddle point. Double-ended methods, like nudged elastic band and double-ended GSMs, use both reactant and product geometries to find the minimum energy path and identify the TS. While widely used, these methods are computationally expensive and can suffer from convergence issues. Machine learning (ML) has emerged as a powerful tool for accelerating TS prediction. Several ML models have been developed to directly estimate barrier heights or predict TS geometries using 3D reactant and product geometries. However, these methods still suffer from input sensitivity, requiring careful preparation and alignment of 3D structures, which limits their efficiency and the exploration of diverse TS conformations. A concurrent study by Duan et al. developed OA-ReactDiff, a diffusion model for predicting the highest energy image of the DFT-based climbing image NEB, also using a double-ended approach.

Methodology

TSDiff utilizes the stochastic denoising diffusion method. The model learns the reverse process of a noise process that gradually adds noise to a TS geometry. Inference involves generating TS geometries from a completely noisy initial state through an iterative denoising process guided by the 2D reaction information. The input is a 2D reaction graph (G_rxn) derived from SMARTS, capturing bond changes in reactants and products. This graph is constructed from molecular graphs of reactants (G_R) and products (G_P), obtained from SMILES, representing atoms as atom-feature vectors and edges as extended graph edges (including node-pair indices within a 3-hop graph distance). TSDiff uses graph neural network (GNN) layers based on SchNet for its denoising neural network. The geometric reaction graph integrates bond, graph distance, and spatial distance information as edge-features. The model learns to approximate a score function (gradient of log-likelihood for noisy TS conformations), guiding the denoising process by updating noisy positions towards the correct TS geometry. The model was trained and validated using Grambow's dataset, a diverse set of gas-phase organic reactions. An ensemble of eight models was used, each trained for 22 hours on a single RTX 2080 Ti GPU. Inference involves 5000 denoising steps, taking a few seconds per reaction, negligible compared to DFT calculations.

Key Findings

TSDiff achieved higher accuracy than existing methods using 3D geometric information, despite its simplified 2D input. Quantum chemical calculations (DFT) validated the generated TS conformations. A high success rate (90.6%) was achieved in validating generated geometries as saddle points with a single imaginary vibrational frequency, followed by intrinsic reaction coordinate (IRC) calculations to verify that they connect the correct reactants and products. TSDiff generated 2303 new TS conformations at saddle points different from the reference data, some with lower barrier heights, suggesting more favorable reaction pathways. The model's ability to generate diverse conformations was demonstrated through an experiment where 100 samples for a single reaction were successfully optimized to saddle points, resulting in nine distinct TS conformations. The mean absolute error (MAE) of interatomic distances (D-MAE) between generated and optimized geometries was low (0.045 Å). The coverage (COV) and matching (MAT) scores improved significantly with increased sampling rounds. With ten sampling rounds, 84% of reference TSs were covered with a D-MAE ≤ 0.1 Å. Compared to existing models, TSDiff achieved lower D-MAE values (0.137 Å without conformer matching, 0.063 Å and 0.067 Å with conformer matching for one and eight sampling rounds, respectively). Analysis of failure cases revealed that many failures were due to inconsistencies in the reference data’s IRC validation. After refining the test set, TSDiff achieved a success rate of 97.4% in saddle point optimization and 90.6% in IRC validation. Further analysis on four reactions revealed that TSDiff could capture TSs with different reaction coordinates and bond breaking/formation sequences, showcasing its ability to explore multiple reaction pathways.

Discussion

TSDiff's ability to predict TS geometries from 2D molecular graphs represents a significant advance in computational chemistry. Its efficiency and accuracy stem from its ability to avoid the computationally expensive and user-intensive process of 3D structure preparation and alignment. The ability to generate multiple TS conformations is crucial for identifying more favorable reaction pathways that might be missed using traditional methods. The high success rates in both saddle point optimization and IRC validation demonstrate TSDiff's reliability as an initial TS guesser. The discovery of numerous new TS conformations with lower barrier heights than those in the reference database highlights the potential for discovering novel reaction pathways. This approach has implications for accelerating reaction mechanism elucidation and catalyst design. The demonstrated transferability to another benchmark dataset further supports the general applicability of this method.

Conclusion

This study introduces TSDiff, a novel diffusion-based generative AI model for exploring transition states using only 2D molecular graphs. TSDiff outperforms existing methods in accuracy and efficiency, demonstrating its potential to accelerate reaction mechanism studies. The ability to generate diverse TS conformations enables the identification of more favorable reaction pathways. While currently limited to organic reactions, future work could extend its applicability to inorganic reactions with the availability of larger and more comprehensive datasets. Further improvements could focus on refining the model's ability to handle reactions with more complex mechanisms.

Limitations

The current version of TSDiff is primarily focused on organic gas-phase reactions. The availability of large, high-quality datasets for other reaction types (inorganic, solution-phase, etc.) is limited, hindering the immediate applicability of TSDiff to a broader range of chemical systems. The accuracy of TSDiff relies on the accuracy of the underlying quantum chemical calculations used to generate the training data. Future work could explore more robust methods to address potential errors or biases in the training data. The computational cost of the diffusion model, while significantly lower than DFT calculations, is still higher than simpler ML models. Optimizations to reduce inference time are possible areas for future improvement.

Related Publications

Explore these studies to deepen your understanding of the subject.

Chemistry

A generative artificial intelligence framework based on a molecular diffusion model for the design of metal-organic frameworks for carbon capture

H. Park, X. Yan, et al.

Medicine and Health

Imaging and AI based chromatin biomarkers for diagnosis and therapy evaluation from liquid biopsies

K. Challa, D. Paysan, et al.

Environmental Studies and Forestry

Niche level investment challenges for European Green Deal financing in Europe: lessons from and for the agri-food climate transition

T. B. Long and V. Blok

Engineering and Technology

Anomalous water molecular gating from atomic-scale graphene capillaries for precise and ultrafast molecular sieving

Q. Zhang, B. Gao, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny