Introduction
Determining the synthesizability of materials is a major challenge in materials science. While online databases like the Materials Project, AFLOW, OQMD, and C2DB provide thermodynamic stability information (energy above the convex hull, Eh), they lack a readily available filter for vibrational stability. Vibrationally unstable materials, possessing imaginary phonon modes, cannot exist at a potential energy minimum, regardless of their thermodynamic stability. Calculating vibrational spectra using density functional theory (DFT) for all materials in these databases is computationally prohibitive. This study aims to address this gap by developing a machine learning (ML) model that can rapidly predict vibrational stability. Previous work has focused on predicting vibrational properties of *stable* materials, while only limited efforts have been made to predict stability itself, especially for a broad range of materials. The authors leverage a newly generated dataset of ~3100 materials to train and validate their ML model, aiming to provide a much-needed filter for high-throughput screening of synthesizable materials.
Literature Review
Several online material databases exist, containing vast numbers of hypothetical materials. While thermodynamic stability (low Eh values) is a common filter used, it does not guarantee vibrational stability. Previous studies have utilized machine learning to predict vibrational properties (like entropy) of *stable* materials. However, predicting vibrational *stability* itself remains a challenge due to the computational cost of DFT calculations. The authors cite existing datasets of vibrational spectra for a limited number of materials, highlighting the need for a larger, more comprehensive dataset to train an accurate ML model.
Methodology
The authors created a dataset of ~3112 materials (including metals, semiconductors, and insulators) from the Materials Project, focusing on materials with 4 or fewer atoms in the unit cell (with bandgap >0.5 eV for 4-atom materials). Vibrational stability was determined using the finite difference method with DFT calculations (VASP) to obtain the force matrix and vibrational density of states (VDOS). The presence of a significant density of imaginary phonons in the VDOS indicates vibrational instability. The calculated vibrational stability labels for this data were made publicly available on a GitHub repository. Material features were extracted and used to train machine learning models. These features included symmetry functions, basic atomic properties descriptors (BACD), and rapid one-shot ab initio descriptors (ROSA). Random Forest (RF) and Gradient Boosting (GB) classifiers were used. To address the class imbalance (fewer unstable than stable materials), the authors employed SMOTE (Synthetic Minority Over-sampling Technique) and mixup techniques to generate synthetic data for the unstable class in the training folds. Fivefold stratified cross-validation was used to train and evaluate the model. The model’s calibration was assessed by comparing predicted class distributions to true class distributions. Feature importance analysis was performed to identify the most influential features in predicting stability.
Key Findings
The initial random forest (RF) model, trained without synthetic data, exhibited lower f1-scores for the unstable class (around 0.53) due to the class imbalance. After incorporating SMOTE and mixup to generate synthetic data for the unstable class, the performance significantly improved. The average recall for the unstable class increased from 42% to 68%, and the average f1-score rose from 53% to 63%. The mean AUC across fivefolds reached 0.73, indicating good overall model performance. The model was well-calibrated, with a less than 5% difference between the predicted and true class distributions. Feature importance analysis revealed that BACD and ROSA features, along with SG features, were the most significant predictors of vibrational stability. Using only the top 30 features produced comparable results to using all 1147 features, suggesting these top features capture most of the predictive information. Evaluation at different confidence levels showed improved performance (recall, precision, and f1-score for the unstable class) at higher confidence levels (above 0.65).
Discussion
This work successfully demonstrates the application of machine learning to predict vibrational stability of inorganic materials, a task previously limited by computational cost. The improved performance after incorporating synthetic data highlights the importance of addressing class imbalance in ML models. The identified significant features (BACD, ROSA, SG) provide insights into the structural and electronic properties related to vibrational stability. The well-calibrated model ensures that predictions reliably reflect the true class distribution. The ability to evaluate model performance at different confidence levels allows for a more nuanced interpretation of results and informed decision-making. The model’s speed provides a significant advantage over first-principles calculations for high-throughput screening.
Conclusion
This study presents a robust machine learning workflow for predicting the vibrational stability of inorganic materials. The model, trained on a diverse dataset and improved through data augmentation techniques, achieves reasonable accuracy in identifying vibrationally unstable materials. This work demonstrates the potential of this ML model as an efficient screening tool for material databases, enabling more informed material selection and potentially accelerating the discovery of new, synthesizable materials. Future work could involve expanding the training dataset to include materials with larger unit cells and a broader range of chemical compositions to further enhance the model’s generalizability and predictive capabilities. Exploring different ML algorithms and feature engineering techniques could also lead to further performance improvements.
Limitations
The current ML model was trained on a dataset sampled based on lattice size, potentially limiting its generalizability to materials outside this specific size range. The dataset's statistical distribution might differ from that of the larger set of materials in MaterialsProject. Extrapolating the model's predictions to arbitrary materials could lead to less accurate results. Expanding the training set with larger unit cells and a more diverse chemical composition would be necessary to address this limitation and improve the model's extrapolation capabilities.
Related Publications
Explore these studies to deepen your understanding of the subject.