Introduction
Transition states (TSs) are critical for understanding chemical reactions and networks, as they represent the highest energy point along the reaction pathway. While computational methods exist to locate TSs, these methods often require precise chemical knowledge to construct an initial structure and are computationally expensive. The high sensitivity to initial structures means that even experienced chemists might require many attempts to find a suitable starting point for quantum chemical calculations. This trial-and-error process, coupled with the high computational cost, represents a significant bottleneck. Machine learning (ML) offers a potential solution, as it can learn patterns from data to predict complex properties. Several ML models have been proposed for TS structure prediction, but many are limited to specific reaction types or lack the ability to accurately predict bond formation and breakage, crucial aspects of TS characterization. This study aims to overcome these limitations by developing a more general and accurate ML model capable of predicting TS structures for diverse organic reactions.
Literature Review
Existing ML models for TS prediction include TSGen, which uses a graph neural network and internal optimization, and TSNet, employing a tensor-field network. However, TSGen and TSNet, while mathematically satisfying the requirements for determining TS structures, often demonstrate limited performance and have difficulties predicting rare events such as bond formation or breakage. Specialized models for specific reaction types have also been proposed, but these suffer from limited applicability outside their specific domain. The development of a general-purpose ML model to predict TS structures accurately across diverse chemical reaction types remains a significant challenge.
Methodology
The proposed ML model predicts interatomic distances of TS structures using three input structures: reactants, products, and their linear interpolation. Each structure is represented by atomic pair features, combining atomic numbers and interatomic distances. These features are processed through a novel "pair sequence interaction" (PSI) layer, combining transformer encoders and bidirectional gated recurrent units (GRUs) to capture both intra- and inter-molecular interactions while maintaining permutation invariance and size extensivity. The model predicts normalized interatomic distances relative to the interpolated structures. These distances are then used as input for nonlinear optimization to generate the 3D atomic positions of the TS structure, minimizing the difference between predicted and reconstructed interatomic distances. An ensemble approach, combining predictions from multiple trained models (90 in this study), was used to improve robustness. Test-time augmentation (TTA), utilizing reversed reaction directions, was also employed to further enhance accuracy and eliminate directional bias. The accuracy of the model was evaluated using molecular mean absolute error (MAE) and molecular mean absolute percentage error (MAPE). Quantum chemical calculations (ωB97X-D3/def2-TZVP) were performed to validate the predicted TS structures, assessing convergence in saddle point optimizations, frequency calculations, and intrinsic reaction coordinate (IRC) calculations to verify the connectivity to reactants and products. For exploring multiple reaction paths, normal mode sampling (NMS) was used to generate various reactant and product conformations, followed by ML inference and quantum chemical refinement. Clustering was employed to reduce the number of quantum chemical calculations required.
Key Findings
The proposed model exhibits superior accuracy compared to TSGen and TSNet, especially for predicting atomic pairs undergoing bond formation or breakage. For the test set, the ensemble prediction achieved a molecular MAPE of 3.407% and MAE of 10.70 pm, significantly outperforming the comparison models (TSGen: MAPE 7.738%, MAE 22.46 pm; TSNet: MAPE 9.229%, MAE 24.37 pm). TTA further reduced the errors (MAPE 3.404%, MAE 10.69 pm). Nonlinear optimization further improved accuracy to MAPE 3.083% and MAE 9.53 pm. Analysis showed that the model maintains high accuracy even for rare atomic pairs in the chemical bonding region, unlike the comparison models. Quantum chemical validation revealed a high success rate (93.8%) for saddle point optimizations, with 88.8% of converged structures having energy errors below 0.1 kcal/mol. The exploration of multiple reaction paths using NMS and ensemble prediction identified four distinct TS conformations for a challenging reaction, demonstrating the model's ability to uncover alternative reaction pathways. Even cases with large energy errors initially showed correct reactant and product structures after IRC calculations, indicating potential inaccuracies in the reference database rather than failure of the ML model. The model’s high performance is maintained even when training with a significantly smaller dataset (25% of the original).
Discussion
The results demonstrate the effectiveness of the proposed ML model for accurate and efficient prediction of TS structures in general organic reactions. The model's superior performance, particularly for bond formation/breakage prediction, significantly improves upon existing methods. The high success rate of quantum chemical optimization further validates the accuracy and reliability of the model's predictions. The capability to explore multiple reaction paths highlights the potential of this approach for broader applications in chemical reaction pathway discovery. The use of TTA and ensemble methods significantly mitigates the variance and limitations of a single model. While the study employed a well-curated database, the robustness of the model suggests that it can be adapted to other reaction types and datasets with potentially similar success.
Conclusion
This study presents a novel ML model for predicting TS structures, showcasing significant improvements in accuracy and generalizability compared to existing methods. The model's high performance in predicting bond formation/breakage, coupled with its efficient exploration of multiple reaction paths, opens new avenues for accelerating reaction mechanism elucidation and catalyst design. Future research could focus on expanding the database to include more diverse reaction types and exploring the applicability of this approach to more complex reaction systems. Improving methods for sampling initial configurations for bimolecular reactions would also be beneficial.
Limitations
While the model shows high accuracy, the reliance on a pre-existing database with a potentially limited representation of chemical space needs to be considered. The accuracy of the reference data itself is a limitation, as highlighted by some inconsistencies found during IRC validation. Although the proposed model successfully identified multiple reaction paths, a comprehensive exploration of the entire reaction space is still computationally challenging. Further research is needed to explore strategies for more efficient and exhaustive exploration of chemical reaction pathways.
Related Publications
Explore these studies to deepen your understanding of the subject.