Chemistry
Machine learning assisted prediction of organic salt structure properties
E. P. Shapera, D. Bučar, et al.
This groundbreaking research conducted by Ethan P. Shapera, Dejan-Krešimir Bučar, Rohit P. Prasankumar, and Christoph Heil introduces a machine learning approach that predicts the properties of crystal structures after relaxation from their unrelaxed forms. The models demonstrate remarkable capability in screening organic salt crystal structures, facilitating the identification of promising candidates for crystal structure prediction workflows.
~3 min • Beginner • English
Introduction
The study targets accelerating crystal structure prediction (CSP) for organic salts by predicting properties of DFT-relaxed structures directly from unrelaxed, randomly generated crystal structures. Organic crystals underpin many applications (pharmaceuticals, agrochemicals, pigments, electronics) but CSP is challenging due to conformational flexibility and dense polymorphic landscapes with small energy differences. Contemporary CSP typically requires tens to hundreds of thousands of expensive relaxations (force fields, MD, or DFT), making it difficult and costly to identify plausible structures. The authors propose ML models that map unrelaxed structures to relaxed properties—volume, enthalpy per atom, and electronic phase (metal vs. semiconductor/insulator)—to filter out thermodynamically unfavorable candidates before expensive calculations. The approach is intended to reduce computational cost and improve workflow efficiency without imposing chemical or structural assumptions, aiming for broad applicability to organic (and potentially inorganic) crystals. The work focuses on small-ring organic molecules (1,2,3-triazine, 1,2,4-triazine, 1,3,5-triazine, pyridine, thiophene, piperidine) forming salts with HCl (and HBr in one extension), exploring performance both within the base chemical space and when extending to related compositions.
Literature Review
Prior ML efforts for predicting properties of organic crystals have used diverse representations and models, but face trade-offs: training sets may include non-minimum structures or require costly relaxations. Honrao et al. used support vector regression with radial/angular distribution function features to predict properties for binary inorganic systems (Al–Ni, Cd–Te) from unrelaxed structures. Gibson et al. trained crystal graph convolutional neural networks (CGCNN) on inorganic structures to predict formation energies of relaxed structures. Xie and Grossman introduced crystal graph descriptors enabling CGCNNs with high accuracy and interpretability; these have been widely used but produce very high-dimensional representations. The present study extends this literature with two contributions: (1) a compact crystal-graph singular value (CGSV) representation that reduces descriptor count by orders of magnitude compared to full crystal graphs, and (2) the use of random forests, which require fewer hyperparameters and avoid iterative backpropagation, offering faster training while maintaining accuracy. The work also positions the models as filters within CSP workflows rather than standalone predictors, aligning with recent trends toward integrating ML with force-field and ab initio methods.
Methodology
Data generation: Random crystal structures were generated using AIRSS. Each input specified counts of protonated and unprotonated organic molecules with Cl− (or other anion) matching the number of protonated molecules. 2D-fixed molecular structures were arranged in approximately close-packed, symmetry-unconstrained unit cells initialized with volumes below close-packed volumes to encourage expansion upon relaxation. Structure relaxations were performed with VASP 5.4.4 using the conjugate gradient algorithm, optimizing cell shape, molecular positions/orientations, and intramolecular geometries simultaneously. While the ML method is general, each trained model depends on the specific structure generation and relaxation protocols.
Targets: Models predict (i) relaxed unit cell volume V, (ii) enthalpy per atom h (used to compute relative enthalpic stability ΔH), and (iii) electronic phase (metal vs. semiconductor/insulator) of the relaxed structure.
Datasets: Base models used four CSP runs for 1,3,5-triazine HCl (total 89,949 randomly generated structures). Model extension was tested by incorporating data from additional CSP runs: Extension A (1,3,5-triazine HBr; 1,2,3-triazine HCl; 1,2,4-triazine HCl), Extension B (pyridine HCl), and Extension C (thiophene HCl, piperidine HCl). Thiophene HCl was included as a likely high-energy, non-protonating case; piperidine HCl introduces conformational flexibility (only one conformer used to generate starting structures).
Descriptors: Three sets were compiled per structure. (1) Crystal Graph Singular Values (CGSV): For each atom, construct a local crystal graph matrix C_i (12×41 per atom per prior work), assemble block-diagonal B = C_1 ⊕ C_2 ⊕ … ⊕ C_a, and take the singular values of B as descriptors; this reduces representation from thousands to <300 values per structure. (2) Coulomb matrix eigenvalues: C_ij = 2Z_i Z_j / r_ij for i≠j and 2Z_i^2 for i=j; descriptors are sorted eigenvalues plus counts of positive/negative eigenvalues, trace Tr(C)=Σ 2Z_i^2, and determinant det(C). Tr(C) uniquely encodes stoichiometry and serves as a categorical composition identifier. (3) Basic crystallographic parameters: cell edges (a,b,c) and angles (α,β,γ). The descriptors remain applicable to 3D and flexible molecules despite the current focus on rigid, flat molecules.
Modeling: Random forest regressors/classifiers (scikit-learn style) were used. Descriptor selection proceeded by first fitting a depth-20 decision tree regressor on 506 initial descriptors, ranking features by Gini importance, and keeping those with importance > 0.1×average importance. Hyperparameters (e.g., maximum tree depth, minimum samples for split) were selected via nested cross-validation. Classifier models for phase underwent grid search over max depth (8–20) and min samples split (5–20), optimizing mean Average Precision (mAP) on precision-recall curves; best found: max depth 10, min split 10.
Validation: Nested cross-validation with separate fitting, validation, and testing sets; 10 random fitting/validation splits. Metrics: For regressors, MAE, MAFE (mean absolute fractional error), and Spearman correlation ρ; for classifiers, per-class Average Precision (AP) and mAP. Descriptor count optimization evaluated error vs. number of retained descriptors and ρ (e.g., volume model minimized test MAE at ~70 descriptors).
Extension protocol: From the four 1,3,5-triazine HCl runs, data were split into fitting/validation/testing. For extension, N structures from a new CSP run were added to the fitting set with weighting (weight 4 for base, 1 for added run), and remaining extension structures went to testing. Performance was reported separately on base vs. added-system test sets.
Key Findings
- Descriptor efficiency: CGSVs reduce descriptor count from thousands (e.g., 5412 for a 1 f.u. CH4N3Cl cell) to fewer than 300. Feature selection typically retained 60–70 descriptors for volume, ~110–122 for enthalpy, and ~252–274 for phase (Table 1), relatively insensitive to which extension set was added, except thiophene/piperidine enthalpy models needing fewer descriptors (56 and 45).
- Correlation structure: Of 506 candidate descriptors, 380 had |ρ|≥0.8 with V_DFT; only 6 had |ρ|≤0.2. For h and phase, no descriptors had |ρ|≥0.8, explaining the need for more features and yielding lower ρ than for volume.
- Base 1,3,5-triazine HCl models:
• Volume regressor: MAE ≈ 45 Å^3 (fit), 50 Å^3 (val), 49 Å^3 (test); test ρ ≈ 0.95; MAFE ≈ 0.098. Error distributions are non-Gaussian but consistent across splits.
• Enthalpy per atom regressor: MAE ≈ 0.044 eV/atom (fit), 0.048 (val), 0.047 (test); MAFE ≈ 0.0070 (fit), 0.0077 (val), 0.0076 (test); ρ ≈ 0.88 (fit), 0.87 (val). On the 1000 lowest-h test samples, MAE improves to 0.026 eV/atom and MAFE to 0.0040. Slight bias toward overestimation of h (ML > DFT in 57–58% of samples).
• Phase classifier: High AP for semiconductor/insulator (0.94 fit/val; 0.97 test) but low for metal (0.59 fit; 0.24 val; 0.26 test); mAP: 0.77 (fit), 0.59 (val), 0.61 (test). Class imbalance likely contributes to lower performance on metallic class.
- Training set size: For volume, testing MAE/MAFE and ρ stabilized for N_fitting from ~5,000 to 70,000; increasing N_fitting reduced overfitting (fit vs. test gap) while slightly increasing fit MAE (more regularization effect) and aligning correlations.
- Comparison with CGCNN: Random forests achieved lower MAE than CGCNN across N_fitting; CGCNN matched RF accuracy only beyond ~70,000 samples and required two orders of magnitude more training time. Both scaled roughly linearly with number of fitting samples, but RF remained substantially faster; RF results were averaged over 10 forests, whereas CGCNN was a single model.
- Extension to new chemistries (10,000 added structures; metrics on added-system test sets):
• Volume (Table 2): MAE ~16–19 Å^3 (A/B), 18 Å^3 (thiophene), 16 Å^3 (piperidine); MAFE ~0.090–0.100 for A/B, 0.095 (thiophene), 0.046 (piperidine). Spearman ρ decreased notably: ~0.72–0.73 (A/B), 0.62 (thiophene), 0.40 (piperidine).
• Enthalpy (Table 3): MAE typically 0.035–0.044 eV/atom for A/B, 0.065 (thiophene), 0.015 (piperidine); MAFE ~0.0057–0.0072 (A/B), 0.012 (thiophene), 0.0028 (piperidine). Spearman ρ dropped to 0.41–0.49 (A/B) and 0.32–0.38 (C).
• Phase (Table 4): mAP decreased relative to base; testing mAP ~0.40–0.44 (A), 0.30 (pyridine), 0.48 (thiophene), 0.22 (piperidine).
- Data requirement for extension: For 1,2,3-triazine HCl, MAE/MAFE converged with as few as ~2,000 added samples; Spearman ρ required ~10,000 to approach ~0.73, with limited gains beyond that. Overall, extending models to new chemistries was feasible with 2,000–10,000 structures, but ranking ability (ρ) remained the main challenge.
- Practical utility: Even with modest absolute accuracy (especially for volume), the models effectively filter high-volume or high-enthalpy candidates, improving CSP efficiency by downselecting promising starting structures before costly DFT relaxations.
Discussion
The research question—whether properties of relaxed crystal structures can be predicted directly from unrelaxed structures to accelerate CSP—was addressed by combining compact crystal representations (CGSV) with efficient random forest models. The approach accurately ranks and estimates relaxed volumes and enthalpies within the base chemical space, showing close agreement between fitting, validation, and testing performance and enabling effective screening. The phase classifier is useful for identifying semiconductor/insulator outcomes but struggles on minority metallic cases due to class imbalance. In extrapolation to new chemistries, absolute errors (MAE/MAFE) remained comparable to base performance, but rank correlations dropped substantially, limiting reliable prioritization across new chemical spaces. Nonetheless, adding modest data (2,000–10,000 structures) per new system provided workable extensions, suggesting a scalable path: train a large base model, then cheaply adapt to related systems. The method outperforms CGCNN at small-to-moderate data sizes and trains far faster, making it attractive as a filter in CSP workflows or alongside ML-trained force fields. Overall, the models provide a practical, low-cost screening tool that reduces the number of expensive DFT relaxations by removing structures likely to relax into high-volume or high-enthalpy configurations.
Conclusion
This work introduces a fast, scalable ML framework for predicting relaxed properties (volume, enthalpy per atom, and electronic phase) of organic salt crystals from unrelaxed structures. Key contributions include a compact crystal graph singular value representation and random forest models that achieve strong interpolative performance and train orders of magnitude faster than CGCNNs. The models effectively filter candidate structures, enabling rapid downselection prior to costly DFT relaxations. Extensions to related chemistries are feasible with 2,000–10,000 additional structures, though ranking fidelity (Spearman ρ) degrades, particularly for more distant chemical systems and for minority metallic phases. Future work should broaden the base training chemical space, apply transfer learning strategies, address class imbalance for phase prediction, and include flexible molecules and multiple conformers to improve generality. Integrating these models into state-of-the-art CSP workflows or force-field-based pipelines could substantially reduce computational cost while maintaining coverage of plausible polymorphs.
Limitations
- Extrapolation: Significant drops in Spearman correlation when extending to new chemistries limit ranking reliability; particularly weak for piperidine HCl and thiophene HCl.
- Phase classification imbalance: Poor performance on the minority metallic class reduces reliability for metal/insulator discrimination without balanced data.
- Dependence on generation/relaxation protocols: Trained models are tied to how structures are generated (AIRSS settings, conformers) and relaxed (DFT setup), which may hinder transferability across workflows.
- Thermodynamic focus: The approach assumes thermodynamics dominates polymorph observability; kinetic factors may prevent experimentally relevant forms from being identified by enthalpy/volume alone.
- Limited molecular flexibility: Current datasets emphasize small, rigid, planar molecules; flexible molecules with multiple conformers require broader sampling and may exhibit rapidly varying, discontinuous structure–property relationships.
- Accuracy ceiling for volume: Volume predictions, while useful for screening, are not sufficiently accurate to replace explicit relaxations.
- Data requirements for extension: Although reduced compared to full retraining, 2,000–10,000 relaxed structures per new chemistry are still needed to achieve useful performance.
Related Publications
Explore these studies to deepen your understanding of the subject.

