
Chemistry
Machine learning assisted prediction of organic salt structure properties
E. P. Shapera, D. Bučar, et al.
This groundbreaking research conducted by Ethan P. Shapera, Dejan-Krešimir Bučar, Rohit P. Prasankumar, and Christoph Heil introduces a machine learning approach that predicts the properties of crystal structures after relaxation from their unrelaxed forms. The models demonstrate remarkable capability in screening organic salt crystal structures, facilitating the identification of promising candidates for crystal structure prediction workflows.
Playback language: English
Introduction
Organic crystals are fundamental to various products, including pharmaceuticals, pesticides, and pigments, and hold promise in emerging technologies like thin-film semiconductors, catalysts, and optoelectronics. A key challenge in materials design is engineering molecular crystals with specific properties. This involves identifying suitable molecular targets and predicting/controlling their crystal structures. Crystal Structure Prediction (CSP) of organic solids uses advanced algorithms and considerable computational resources. However, the conformational flexibility of organic molecules leads to numerous polymorphs with closely spaced energy minima, making CSP exceptionally complex. As algorithms improve, increasingly complex organic systems become amenable to investigation, highlighted by the recent Seventh CSP Blind Test. Current CSP approaches often require tens of thousands to hundreds of thousands of force field, molecular dynamics, or density functional theory (DFT) calculations to relax and evaluate crystal energies and properties. This large number of structures and their marginally different energies make identifying plausible structures challenging. While advancements have improved organic crystal structure prediction, further improvements are needed for versatility, accuracy, affordability, and routine use. This research develops a machine learning approach to reduce the number of energy calculations, thereby lowering the computational cost of CSP, which can otherwise consume over 100,000 CPU hours. Machine learning offers a rapidly evolving tool for predicting organic crystal structure properties, but its success depends on training data quality, crystal representation methods, and the choice of machine learning algorithm. Existing studies have shown success but often face limitations such as training data containing crystals not in energetic local minima or requiring computationally expensive ab initio methods for relaxation.
Literature Review
Numerous studies have successfully predicted properties of organic crystals using various machine learning approaches. However, common limitations include training data containing crystals that are not in energetic local minima, and the need to relax generated structures using computationally expensive ab initio methods like DFT. Related work by Honrao et al. employed support vector regression models for binary Al-Ni and Cd-Te systems using radial and angular distribution functions. Gibson et al. used crystal graph convolutional neural networks to predict formation energies from inorganic structures in the Materials Project Database. This work builds upon these efforts by introducing two key advancements: a crystal graph singular value representation for significantly reducing the number of descriptors, and the use of random forest models, known for their efficiency and reduced hyperparameter tuning compared to neural networks.
Methodology
This study develops a machine learning approach to predict the properties of DFT-relaxed organic crystals using randomly generated unrelaxed crystal structures. This allows for the downselection of promising structures for further, more expensive DFT calculations. The approach is broadly applicable and makes no assumptions about chemical composition, structure range, structure generation methods, or structure optimization techniques. The method involves generating random crystal structures using the AIRSS software package, specifying the numbers of protonated and unprotonated organic molecules and chloride ions (or other anions). Structures are initialized with volumes smaller than close-packed volumes to allow expansion during relaxation. The conjugate gradient algorithm in VASP (version 5.4.4) is used for simultaneous optimization of unit cell shape, molecular positions, orientations, and geometries. Three sets of descriptors are compiled for each crystal structure: crystal graph singular values, Coulomb matrix eigenvalues, and crystal structure parameters. The crystal graph singular values represent local chemical environments as graphs. A block diagonal matrix B is constructed from the crystal graph matrices for each atom, and its singular values are used as descriptors. This approach significantly reduces the number of descriptors compared to the full crystal graph representation. The Coulomb matrix captures Coulomb interactions between nuclei, and its eigenvalues are used as descriptors. Crystal structure descriptors include unit cell edge lengths and angles. Model extension is tested by incrementally adding structures from different chemical systems to the 1,3,5-triazine HCl training set. The models are evaluated using nested cross-validation, with the accuracy assessed using mean absolute error (MAE), mean absolute fractional error (MAFE), Spearman coefficient (ρ), and average precisions (AP). Random forest models are used due to their efficiency and minimal hyperparameter tuning.
Key Findings
Descriptor selection was optimized by minimizing random forest regressor model training errors. An initial decision tree regressor was fit, and features with Gini importances greater than 0.1 times the average Gini importance were retained. The optimal number of descriptors varied for each target quantity (volume, enthalpy, phase), ranging from 45 to 274. The strongest correlations were observed between descriptors and DFT-calculated volumes. The models showed consistent performance in the interpolative regime, with testing and validation error distributions closely matching the fitting error distributions. The volume model achieved MAEs of 45 ų (fitting), 50 ų (validation), and 49 ų (testing). The enthalpy per atom model had MAEs of 0.044 eV/atom (fitting), 0.048 eV/atom (validation), and 0.047 eV/atom (testing), and significantly lower MAFE values compared to the volume model. Considering only the 1000 lowest enthalpy structures in the testing set further improved the model’s accuracy. The metal vs. insulator phase model performed well for the semiconductor/insulator phase but struggled with the minority metal phase, reflecting class imbalance. A comparison with CGCNN revealed that the random forest regressor achieved lower MAE values with fewer structures and significantly faster fitting times. Model extension tests showed that adding 2000-10,000 structures from new chemical systems improved model performance, although the Spearman coefficient decreased in the extrapolative regime, especially for piperidine HCl. The study highlighted the tradeoff between the number of added structures and model accuracy. The model performs well in interpolative regime but the extrapolative results were mixed.
Discussion
The developed machine learning models effectively predict properties of DFT-relaxed organic salt crystals based on unrelaxed structures. The use of crystal graph singular values and random forest algorithms significantly accelerates model construction and improves efficiency compared to existing methods like CGCNN. The models, while not stand-alone CSP algorithms, serve as valuable filtering steps in CSP workflows. The ability to rank structures based on predicted volume and enthalpy allows for the prioritization of structures likely to relax into favorable configurations. The consistent performance in the interpolative regime demonstrates the model's reliability within the training data space. However, the inconsistent performance in the extrapolative regime highlights the limitations of directly extrapolating to new chemical compositions. The decreased Spearman coefficients indicate a reduced ability to rank structures accurately outside the training data distribution. This limitation can be addressed in future work by expanding the initial training set and using advanced transfer learning techniques.
Conclusion
This study presents a novel machine learning approach for accelerating CSP of organic salts. The use of crystal graph singular values and random forest models significantly reduces computational cost and improves efficiency. The models serve as a powerful filtering step in CSP workflows, enabling the efficient selection of promising candidates for further investigation. Future research should focus on expanding the training set to include a broader range of chemical systems and molecular flexibilities, improving extrapolation capabilities, and exploring integration with other CSP methods.
Limitations
The accuracy of the model is limited in the extrapolative regime, especially when predicting the properties of structures from new chemical systems substantially different from the initial training data. The model assumes that experimentally observable polymorphs are primarily determined by thermodynamic considerations, neglecting kinetic factors and solvent effects. The study primarily focused on small, rigid organic molecules; further investigation is needed to assess the model's performance with larger, more flexible molecules. Class imbalance in the phase prediction model affected the accuracy of metal phase prediction.
Related Publications
Explore these studies to deepen your understanding of the subject.