This study reveals significant redundancy in large materials datasets, showing that up to 95% of data can be removed from machine learning training with minimal impact on in-distribution prediction performance. Redundant data is linked to over-represented material types and doesn't improve out-of-distribution performance. Uncertainty-based active learning efficiently creates smaller, informative datasets. The research challenges the "bigger is better" approach, emphasizing information richness over data volume for efficient data acquisition and machine learning.
Publisher
Nature Communications
Published On
Nov 10, 2023
Authors
Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, Jason Hattrick-Simpers
Tags
data redundancy
machine learning
active learning
materials datasets
prediction performance
Related Publications
Explore these studies to deepen your understanding of the subject.