Introduction
The rapid advancement of machine learning (ML) in materials science has yielded impressive results in predicting material properties, often achieving accuracy comparable to density functional theory (DFT) calculations. However, a significant concern arises from the inherent redundancy present in many materials databases like the Materials Project and OQMD. These databases, built through a historical "tinkering" approach to material design, contain numerous highly similar materials. This redundancy introduces a bias when using standard random data splitting for model evaluation, leading to artificially inflated performance metrics that do not reflect the model's true generalization capability, particularly its ability to extrapolate to out-of-distribution (OOD) samples—a crucial aspect for material discovery. The problem is analogous to issues encountered in bioinformatics with protein function prediction, where tools like CD-HIT are employed to mitigate redundancy. While some studies have investigated the extrapolation performance of ML models in materials science, highlighting the overestimation of performance using traditional cross-validation methods, and suggesting alternative evaluation techniques like leave-one-(material)-cluster-out cross-validation (LOCO CV) and k-fold forward cross-validation (FCV), a systematic method for addressing dataset redundancy to improve the objectivity of performance evaluation remains lacking. This paper addresses this gap by introducing MD-HIT, a novel algorithm designed to reduce redundancy in materials datasets, thereby enabling a more accurate and reliable assessment of ML model performance, particularly for extrapolation to OOD samples.
Literature Review
Recent literature showcases remarkable achievements in ML-based material property prediction. Studies report DFT-level accuracy for various properties, sometimes with surprisingly small training datasets. For instance, impressive results have been reported for formation energy prediction, achieving mean absolute errors (MAEs) significantly lower than those obtained through DFT calculations. Similar success has been observed for band gap prediction and other properties using diverse ML techniques like graph neural networks (GNNs) and deep transfer learning. However, these successes often rely on datasets with significant redundancy. Previous research has begun to address this issue, investigating the extrapolation performance of ML models and proposing novel cross-validation strategies. These studies underscore the tendency for ML models to exhibit high interpolation performance (accurate prediction within the distribution of the training data) but poor extrapolation performance (accurate prediction outside the training data distribution). One study specifically highlighted the problem of overrepresented material types leading to redundancy and poor OOD performance, suggesting pruning algorithms to improve data efficiency. Other approaches focus on adaptive sampling strategies to generate more informative training datasets. While these works acknowledge the importance of data efficiency and redundancy, they lack a general-purpose solution applicable across diverse material properties and fail to fully address the issue of overestimated performance due to redundancy in standard model evaluation.
Methodology
To address the issue of dataset redundancy and its impact on the performance evaluation of ML models for material property prediction, this paper introduces MD-HIT, a novel redundancy reduction algorithm. MD-HIT comprises two versions: MD-HIT-composition and MD-HIT-structure. MD-HIT-composition focuses on reducing redundancy in datasets based on material composition, leveraging composition similarity metrics. These metrics include both linear measures (e.g., Mendeleev, petti, atomic, mod_petti) and chemically derived measures (e.g., oliynyk, oliynyk_sc, Jarvis, jarvis_sc, magpie, magpie_sc) along with machine-learned measures (e.g., cgcnn, element, mat2vec, matscholar, megnet16). The algorithm uses a greedy incremental approach, similar in spirit to CD-HIT used in bioinformatics, starting with a seed material and iteratively adding materials based on a specified distance threshold. Materials within the specified distance of already selected materials are considered redundant and discarded. For composition-based similarity, the Element Movers Distance (EIMD) package is utilized, offering a choice of various similarity measures. In this study, Mendeleev similarity and MatScholar similarity are primarily employed. MD-HIT-structure addresses redundancy in datasets containing crystal structures. Given the challenge of comparing structures with varying numbers of atoms, two structure-based distance metrics are used: one based on smoothed X-ray diffraction (XRD) features (using Pymatgen’s XRDCalculator module, smoothing the XRD data and sampling 900 evenly distributed points between 0 and 90 degrees) and another based on OrbitalFieldMatrix (OFM) features, which encode electronic structure information. The algorithm operates similarly to MD-HIT-composition, iteratively selecting structures based on a distance threshold using Euclidean distance. The Materials Project database, containing crystal structures and their properties, serves as the source dataset. The algorithms generate non-redundant datasets by varying the distance thresholds for both composition and structure-based approaches. Composition datasets are generated using Mendeleev and MatScholar distances with different thresholds resulting in varied dataset sizes. Similarly, structure datasets are created using XRD and OFM distances and varying thresholds. After generating the non-redundant datasets, two state-of-the-art composition-based algorithms, Roost and CrabNet, and two structure-based algorithms, ALIGNN and DeeperGATGNN, are evaluated on the generated datasets for formation energy and band gap prediction. Model performance is assessed using MAE and R². To investigate OOD performance, an EmbeddingOOD test set is created using latent representations obtained from a pretrained Roost model. The OOD performance is compared for models trained on redundant and non-redundant datasets. Additionally, parity plots and analysis of prediction errors based on sample density are conducted to further elucidate the effect of redundancy removal. Random Forest (RF) models are also evaluated on the same datasets.
Key Findings
The study's key findings demonstrate the significant impact of dataset redundancy on the performance evaluation of ML models for material property prediction. Results consistently show that removing redundancy leads to a decrease in performance metrics (R² values decrease and MAE values increase) when models are evaluated on a randomly split test set. This is because the models are not biased by the presence of highly similar samples in both the training and test sets. The observed performance degradation, however, is not a sign of model failure but a more realistic reflection of their true predictive capability when applied to novel, unseen materials. The use of different distance metrics (Mendeleev and MatScholar for composition, XRD and OFM for structure) revealed varying degrees of impact on model performance, with some metrics showing higher correlations between performance and threshold values than others. The study found that for band gap prediction, the presence of a large number of zero band gap samples in the dataset can lead to inconsistencies in the trend of MAE values, highlighting the complexity of this specific prediction task. Crucially, the study showcases the benefit of redundancy reduction for OOD prediction. Models trained on non-redundant data exhibit superior performance on an OOD test set compared to models trained on redundant data. Parity plots visually demonstrate the improved prediction accuracy for OOD samples in models trained with non-redundant datasets. Analysis of prediction error versus sample density further confirms that lower prediction errors are consistently associated with higher sample density, indicating that redundancy allows models to 'cheat' by overfitting to highly populated regions of the feature space. When comparing different model types, deep learning models (Roost, CrabNet, DeeperGATGNN) and Random Forests (RF) showed contrasting behavior when evaluated on the MatscholarOOD test set. The Deep Learning models performed worse when trained on non-redundant data, likely due to discrepancies between the feature space used for OOD sample selection and the latent representation space used by the models. However RF models showed improved OOD performance when trained on non-redundant data, because they operate directly in the feature space used for OOD sample selection without a latent representation space.
Discussion
The findings of this study highlight the crucial role of dataset redundancy control in achieving objective performance evaluation of ML models for material property prediction. The consistently observed performance decrease upon redundancy removal indicates that previous reports of high accuracy may have been overly optimistic due to this bias. This has important implications for the materials science community as it emphasizes the need to focus on developing models with robust extrapolation capabilities rather than simply optimizing for interpolation within densely populated regions of the materials space. The improved OOD performance of models trained on non-redundant datasets underscores the importance of addressing this issue for material discovery and design. The different behavior observed between deep learning models and RF models with respect to OOD performance suggests that the choice of model and feature space must be carefully considered when evaluating the impact of redundancy reduction. This work underscores that the selection of OOD samples is critical. This research provides a valuable framework for more rigorous benchmarking and evaluation of ML methods in materials science and suggests that using MD-HIT or similar methods can lead to more robust and reliable models for material discovery.
Conclusion
This research introduces MD-HIT, a novel algorithm for reducing redundancy in materials datasets, leading to more realistic and reliable evaluations of ML models for material property prediction. Results consistently demonstrate that while redundancy removal degrades in-distribution performance, it significantly improves out-of-distribution performance. This is a crucial factor for material discovery efforts. Future research could focus on developing more efficient algorithms for redundancy reduction, exploring alternative distance metrics, and investigating the optimal balance between data efficiency and model generalization ability. The open-source availability of MD-HIT facilitates wider adoption and further advancements in the field.
Limitations
The current implementation of MD-HIT algorithms is relatively slow, particularly for large datasets. Further optimization is needed to improve computational efficiency. The selection of the seed material in the MD-HIT algorithms might influence the composition of the non-redundant dataset and therefore the final results. Future work could investigate the effects of different seed material choices. The definition and selection of OOD samples requires careful consideration and may affect the interpretation of results. Different methods of choosing OOD samples might lead to varying conclusions.
Related Publications
Explore these studies to deepen your understanding of the subject.