logo
ResearchBunny Logo
Abstract
This study reveals significant redundancy in large materials datasets, showing that up to 95% of data can be removed from machine learning training with minimal impact on in-distribution prediction performance. Redundant data is linked to over-represented material types and doesn't improve out-of-distribution performance. Uncertainty-based active learning efficiently creates smaller, informative datasets. The research challenges the "bigger is better" approach, emphasizing information richness over data volume for efficient data acquisition and machine learning.
Publisher
Nature Communications
Published On
Nov 10, 2023
Authors
Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, Jason Hattrick-Simpers
Tags
data redundancy
machine learning
active learning
materials datasets
prediction performance
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs—just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny