Introduction
Machine learning (ML) is increasingly used in materials science, but data scarcity remains a challenge in many areas. However, the emergence of large datasets from high-throughput density functional theory (DFT) calculations, such as the Open Catalyst datasets (over 260 million data points), signals a shift towards "big data" for certain material properties. While substantial effort has been devoted to collecting vast amounts of data, the information richness of these datasets has received less attention. This is crucial because current data acquisition strategies, often involving exhaustive enumerations or random sub-sampling of chemical combinations and structural prototypes, may be inefficient due to unrecognized redundancy. This redundancy, present in existing datasets, might propagate to future datasets, hindering efficient data acquisition. The large volume of data also poses significant challenges in ML model development, requiring substantial computational resources that are often inaccessible to most researchers. This necessitates exploring data reduction techniques to improve training efficiency. While research in other fields like image and natural language processing has demonstrated the potential of training effective models with smaller, carefully selected subsets of data, the presence and extent of redundancy in materials science datasets remains largely unexplored. Addressing this gap can lead to smaller, more efficient benchmark datasets, reducing training costs and accelerating model development. Furthermore, understanding and eliminating data redundancy can enhance active learning algorithms, which are increasingly used in ML-driven materials discovery. This study systematically investigates data redundancy across multiple large materials datasets, evaluating the impact of reduced training set sizes on the performance of various ML models and employing a pruning algorithm to identify informative data. It also compares uncertainty-based active learning strategies with the pruning algorithm to optimize data acquisition and model development.
Literature Review
The authors review existing literature on large materials datasets generated by high-throughput DFT calculations, highlighting databases like JARVIS, Materials Project (MP), and OQMD. They cite studies that utilized these databases for materials property prediction. The literature also supports the use of active learning techniques in materials discovery, emphasizing the need for efficient data acquisition strategies. The authors note the lack of prior research specifically addressing data redundancy in materials science datasets, setting the stage for their investigation.
Methodology
The study employs a standard hold-out method to evaluate ML model performance. A dataset is randomly split into a training set (pool) and a hold-out test set. The in-distribution (ID) performance is assessed on the hold-out test set, while out-of-distribution (OOD) performance is evaluated using data from a newer version of the database. A pruning algorithm progressively reduces the training set size from 100% to 5% of the pool. Three widely used DFT databases (JARVIS, MP, and OQMD) are used, considering two release versions of each to assess OOD performance. Two conventional ML models (XGBoost and random forests) and a state-of-the-art graph neural network (ALIGNN) are employed to ensure model-agnostic evaluation. The root mean square error (RMSE) is used as the primary performance metric. The pruning algorithm aims to identify informative data by systematically removing data points while monitoring the impact on model performance. The study defines a quantitative threshold (10% relative increase in RMSE) to determine the amount of redundant data. In addition to the pruning algorithm, uncertainty-based active learning strategies are used to select informative data. These strategies utilize the prediction uncertainty of the models to guide the selection of the next data points. Three active learning algorithms are considered, each employing a different uncertainty measure: the width of the prediction intervals for random forest and XGBoost models, and query by committee (QBC) based on the disagreement between random forest and XGBoost predictions. The performance of models trained on data selected using these active learning algorithms are compared with those trained on data selected using the pruning algorithm and random sampling.
Key Findings
The study finds substantial redundancy across multiple large materials datasets. Using a 10% RMSE increase as a threshold for significant performance degradation, it shows that only a small fraction of the data is truly informative. For example, for formation energy prediction with random forest models, only 13% of JARVIS18 data and 17% of MP18 and OQMD data are considered informative. Similar results are observed for other models and properties, indicating that a large portion of the data is redundant. The redundant data are largely associated with over-represented material types. Models trained on pruned datasets show comparable in-distribution performance to those trained on much larger datasets. The performance on unused data further confirms the redundancy, with the RMSE on unused data often being lower than the RMSE on the ID test set when a sufficient amount of informative data is included in the training set. However, out-of-distribution performance is significantly impacted by training set size, highlighting the importance of data diversity and the limitations of relying solely on ID performance for evaluating model robustness. The pruning algorithm effectively identifies informative materials, showing good transferability across different ML architectures but limited transferability across different material properties. Uncertainty-based active learning, particularly the QBC algorithm, proves highly effective in identifying informative data, requiring only 30-35% of the data to achieve comparable performance to the pruning algorithm and significantly outperforming random sampling.
Discussion
The findings challenge the prevailing "bigger is better" mentality in materials data acquisition, emphasizing the importance of information richness over sheer data volume. The study demonstrates that data redundancy is prevalent in existing datasets, primarily due to over-representation of certain material types. The results highlight the need for more sophisticated data acquisition strategies that focus on diversity and information content, such as uncertainty-based active learning. The limited transferability of pruned datasets across different material properties suggests that optimizing data selection for multiple properties simultaneously may be more effective than individual property optimization. This research provides crucial insights for efficient data acquisition and the development of more robust ML models for materials science, shifting the paradigm from systematic high-throughput studies to targeted data acquisition strategies. The identified redundancy may reflect the biases in existing datasets caused by the methods utilized in their creation. Therefore, it is imperative to explore diverse datasets or to develop methods that mitigate these biases.
Conclusion
This research reveals significant data redundancy in large materials datasets, demonstrating the potential for substantial data reduction without sacrificing in-distribution prediction accuracy. The study advocates for a shift in focus from data volume to information richness in materials data acquisition and model development. Uncertainty-based active learning is presented as an efficient alternative to collecting massive datasets. Future work could focus on developing more advanced active learning strategies that can handle high-dimensional data, multi-task learning scenarios, and the incorporation of domain knowledge. This study strongly suggests developing methods for purposefully seeking materials where existing models fail catastrophically in order to expand databases in a meaningful way.
Limitations
The study focuses on specific material properties (formation energy, band gap, and bulk modulus) and ML models. The findings may not generalize perfectly to other properties or model architectures. The definition of "informative data" based on a 10% RMSE increase threshold is somewhat arbitrary and may influence the results. The active learning algorithms employed might still introduce biases in the selected datasets. Further investigation is needed to understand the full implications of data redundancy across a broader range of materials and ML techniques.
Related Publications
Explore these studies to deepen your understanding of the subject.