Engineering and Technology

Exploiting redundancy in large materials datasets for efficient machine learning with less data

K. Li, D. Persaud, et al.

Discover groundbreaking research by Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, and Jason Hattrick-Simpers, revealing that up to 95% of materials dataset can be eliminated without sacrificing prediction accuracy. This study challenges conventional wisdom by demonstrating that less can indeed be more when it comes to machine learning datasets.

00:00

Playback language: English

Index

Abstract

This study reveals significant redundancy in large materials datasets, showing that up to 95% of data can be removed from machine learning training with minimal impact on in-distribution prediction performance. Redundant data is linked to over-represented material types and doesn't improve out-of-distribution performance. Uncertainty-based active learning efficiently creates smaller, informative datasets. The research challenges the "bigger is better" approach, emphasizing information richness over data volume for efficient data acquisition and machine learning.

Publisher

Nature Communications

Published On

Nov 10, 2023

Authors

Kangming Li, Daniel Persaud, Kamal Choudhary, Brian DeCost, Michael Greenwood, Jason Hattrick-Simpers

Related Publications

Explore these studies to deepen your understanding of the subject.

Engineering and Technology

Topographic design in wearable MXene sensors with in-sensor machine learning for full-body avatar reconstruction

H. Yang, J. Li, et al.

Computer Science

Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing

D. Rankin, M. Black, et al.

Computer Science

On responsible machine learning datasets emphasizing fairness, privacy and regulatory norms with examples in biometrics and healthcare

S. Mittal, K. Thakral, et al.

Computer Science

Towards provably efficient quantum algorithms for large-scale machine-learning models

J. Liu, M. Liu, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny