Engineering and Technology

Exploiting redundancy in large materials datasets for efficient machine learning with less data

K. Li, D. Persaud, et al.

Discover groundbreaking research by Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, and Jason Hattrick-Simpers, revealing that up to 95% of materials dataset can be eliminated without sacrificing prediction accuracy. This study challenges conventional wisdom by demonstrating that less can indeed be more when it comes to machine learning datasets.

00:00

~3 min • Beginner • English

Index

Introduction

The study addresses whether large materials datasets contain significant redundancy and how that affects machine learning training efficiency and generalization. Although materials science is increasingly data-rich (e.g., millions of DFT-calculated entries for energies and band gaps, and hundreds of millions of catalyst data points), little attention has been paid to the information richness versus sheer volume of data. The authors posit that redundancy—often arising from exhaustive enumeration of chemical/structural prototypes and random subsampling—may inflate datasets without adding commensurate information, leading to inefficient training and limited robustness to distribution shifts. They aim to quantify redundancy by measuring performance as training size is reduced, examine impacts on in-distribution (ID) and out-of-distribution (OOD) performance, and explore active learning as a route to build compact yet informative datasets. The work is motivated by escalating compute demands for training on massive datasets, the need for efficient benchmarks, and the importance of robust generalization in materials discovery.

Literature Review

The paper situates its contribution within several threads: (1) the rise of big materials databases from high-throughput DFT (e.g., JARVIS, Materials Project, OQMD) and Open Catalyst datasets; (2) prior evidence from computer vision and NLP that selective pruning of training data can retain or even improve performance relative to using all data; (3) widespread use of active learning and uncertainty-driven sampling in materials discovery for efficient data acquisition; (4) state-of-the-art materials ML baselines (e.g., Matbench, ALIGNN) and concerns about model robustness and generalizability in materials property prediction. The authors note that, despite these developments, the presence and degree of redundancy in materials datasets had not been systematically assessed, and data acquisition strategies may propagate existing redundancy to subsequent datasets.

Methodology

Redundancy evaluation relies on hold-out assessments across multiple datasets, properties, and models. For a given database version S0, the dataset is split into a pool (training candidates) and an ID test set via a 90/10 random split. An OOD test set is constructed from new materials appearing in a later database version S1. Training set size is progressively reduced from 100% to as low as 5% of the pool using a pruning algorithm (details in Methods), and models are trained at each size. Performance is evaluated on the ID test set, the unused portion of the pool, and the OOD test set. Redundancy is defined operationally: data pruned without exceeding a 10% relative increase in RMSE compared to training on the full pool are considered redundant. Datasets and properties: formation energy (Ef), band gap (Eg), and bulk modulus (K) from JARVIS18/22, MP18/21, and OQMD14/21 (with entry counts summarized in the paper; K not available for OQMD). Two versions per database enable OOD testing. Models: two descriptor-based models—Random Forests (RF) and XGBoost (XGB)—and a graph neural network, ALIGNN. Performance is primarily assessed via RMSE and scaling with training set size. Unused-data analysis: Models trained on pruned subsets are evaluated on the unused pool data to test whether the discarded data contain new information. A steeper performance improvement on unused data with small training fractions and eventual saturation indicates redundancy. OOD analysis: Models trained on older versions (e.g., JARVIS18, MP18, OQMD14) are tested on new materials from newer versions (JARVIS22, MP21, OQMD21) to assess robustness and whether increased training volume mitigates distribution shift. Active learning: Three uncertainty-based selection strategies are evaluated for constructing compact, informative datasets: (i) RF uncertainty via width of 90% prediction intervals, (ii) XGB uncertainty via 90% interval width, and (iii) Query by Committee (QBC) using disagreement between RF and XGB predictions. These are compared against pruning and random sampling in terms of ID RMSE versus training size. Transferability: Pruned material sets are tested for transferability across ML architectures (e.g., RF-pruned sets used to train XGB) and across different target properties (e.g., formation energy versus band gap).

Key Findings

- Substantial redundancy exists: for many combinations, up to 95% of training data can be pruned while keeping ID RMSE within 10% of the full-data model. - Formation energy ID performance: Reduced RF models show <6% RMSE increase relative to full RF across datasets; reduced XGB models typically show 10–15% increase (with a 3% decrease observed in OQMD21); reduced ALIGNN models show 15–45% increase. Similar trends hold for Eg and K. - Informative fractions under 10% RMSE increase criterion: RF requires roughly 13% (JARVIS18) and 17% (MP18, OQMD14); XGB requires about 20–30%; ALIGNN requires about 55% (JARVIS18), 40% (MP18), and 30% (OQMD14). For OQMD14 band gap, only ~5% (RF/XGB) or ~10% (ALIGNN) of data are sufficiently informative. - Pruned datasets skew toward underrepresented and harder-to-predict materials (e.g., large formation energies and large band gaps), improving data diversity and efficiency versus random sampling. - Unused pool performance: Models trained on small pruned fractions can predict unused data with errors lower than ID test errors; with 30–40% training fractions, unused-data RMSE can be half of ID RMSE, indicating the unused data add little new information. - OOD performance: Training on pruned subsets yields OOD RMSE scaling similar to ID, confirming redundancy; e.g., using 20% (JARVIS18), 30% (MP18), or 5–10% (OQMD14) achieves OOD performance within ~10% of full-data models. However, absolute OOD degradation is often severe: OOD RMSE can increase >200% versus ID, up to 640% for ALIGNN on MP formation energy. - Distribution coverage matters: A heavily pruned MP21 training pool can yield dramatically better performance on the MP21 OOD test set than training on 10× larger MP18 data, while maintaining similar performance on the MP18 test set, underscoring the value of information richness over volume. - Transferability: Pruned sets transfer reasonably across ML architectures (e.g., XGB trained on RF-pruned sets outperforms training on twice as much random data) but show limited transfer between properties (e.g., formation energy versus band gap), consistent with modest cross-property correlations (Spearman ~−0.5). - Active learning: QBC outperforms single-model uncertainty baselines and approaches pruning efficiency. For formation energy, 30–35% of QBC-selected data matches the performance of 20% pruned data and 70–90% random data. For OQMD14 band gap, ~10% QBC-selected data suffices without performance loss. Overall, selecting ~30% via active learning can retain near-full ID accuracy while freeing resources to expand material diversity.

Discussion

The results directly address the research question by demonstrating widespread redundancy across large DFT-derived materials datasets. Comparable ID and OOD performance can be achieved with a small, informative subset, indicating that much of the remaining data is over-represented and adds little new information. Evaluations on unused pool data confirm that pruned models generalize well to the discarded samples, reinforcing that performance saturation stems from redundancy rather than model incapability. Nevertheless, severe OOD degradation even with full datasets highlights that simply increasing volume does not ensure robustness; prioritizing information richness and diversity is essential. The findings suggest that constructing compact benchmark datasets and leveraging uncertainty-driven active learning can dramatically reduce training costs while improving robustness. The limited transfer of informative sets across properties implies that compact datasets should be tailored per target property, whereas cross-architecture transferability indicates pruned datasets remain useful as models evolve. Strategically expanding datasets to include materials where current models fail may further improve OOD generalization.

Conclusion

The study demonstrates that large materials datasets contain substantial redundancy: up to 95% of data can be removed with little loss (<10% RMSE increase) in ID performance, and 70–95% can be pruned without exceeding a 10% degradation on OOD tests. Informative subsets primarily include underrepresented materials and enable models to accurately predict unused data, confirming redundancy due to over-representation. Pruned sets transfer well across ML architectures but only weakly across different properties. Uncertainty-based active learning, especially QBC, can construct compact datasets (≈30% of data, and as low as 10% for certain tasks) that match or exceed the performance of much larger randomly selected sets. The work advocates focusing on information-rich, diverse data rather than sheer volume and suggests future directions including deliberate search for catastrophic failure regions to expand datasets, multitask-aware pruning, entropy-informed acquisition strategies, and using pruning to prioritize high-fidelity recalculations.

Limitations

- Severe OOD degradation persists even with full training sets, indicating that volume alone cannot ensure robustness and that conclusions depend on the choice of OOD constructed from newer database versions. - ALIGNN results were not reported for some large datasets (e.g., OQMD21) due to high training cost, potentially limiting neural network comparisons at the largest scales. - The redundancy threshold is defined via a 10% relative RMSE increase; different thresholds or metrics could change the quantified redundancy levels. - Limited transferability of pruned sets across different properties indicates task-specificity; results may not generalize to properties with different underlying physics without tailored selection. - Active learning may introduce selection bias in constructed datasets; while more efficient, it requires careful design to maintain broad coverage. - Analyses rely on DFT-computed properties and the specific datasets considered; experimental datasets or other properties may exhibit different redundancy characteristics.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Engineering and Technology

Topographic design in wearable MXene sensors with in-sensor machine learning for full-body avatar reconstruction

H. Yang, J. Li, et al.

Medicine and Health

HIDDEN: a machine learning method for detection of disease-relevant populations in case-control single-cell transcriptomics data

A. Goeva, M. Dolan, et al.

Medicine and Health

Machine learning-based prediction of in-hospital death for patients with takotsubo syndrome: The InterTAK-ML model

O. D. Filippo, V. L. Cammann, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny