logo
ResearchBunny Logo
Exploiting redundancy in large materials datasets for efficient machine learning with less data

Engineering and Technology

Exploiting redundancy in large materials datasets for efficient machine learning with less data

K. Li, D. Persaud, et al.

Discover groundbreaking research by Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, and Jason Hattrick-Simpers, revealing that up to 95% of materials dataset can be eliminated without sacrificing prediction accuracy. This study challenges conventional wisdom by demonstrating that less can indeed be more when it comes to machine learning datasets.

00:00
00:00
Playback language: English
Abstract
This study reveals significant redundancy in large materials datasets, showing that up to 95% of data can be removed from machine learning training with minimal impact on in-distribution prediction performance. Redundant data is linked to over-represented material types and doesn't improve out-of-distribution performance. Uncertainty-based active learning efficiently creates smaller, informative datasets. The research challenges the "bigger is better" approach, emphasizing information richness over data volume for efficient data acquisition and machine learning.
Publisher
Nature Communications
Published On
Nov 10, 2023
Authors
Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, Jason Hattrick-Simpers
Tags
data redundancy
machine learning
active learning
materials datasets
prediction performance
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny