logo
ResearchBunny Logo
Exploiting redundancy in large materials datasets for efficient machine learning with less data

Engineering and Technology

Exploiting redundancy in large materials datasets for efficient machine learning with less data

K. Li, D. Persaud, et al.

Discover groundbreaking research by Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, and Jason Hattrick-Simpers, revealing that up to 95% of materials dataset can be eliminated without sacrificing prediction accuracy. This study challenges conventional wisdom by demonstrating that less can indeed be more when it comes to machine learning datasets.

00:00
00:00
~3 min • Beginner • English
Abstract
Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the "bigger is better" mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.
Publisher
Nature Communications
Published On
Nov 10, 2023
Authors
Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, Jason Hattrick-Simpers
Tags
data redundancy
machine learning
active learning
materials datasets
prediction performance
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny