logo
ResearchBunny Logo
MD-HIT: Machine learning for material property prediction with dataset redundancy control

Computer Science

MD-HIT: Machine learning for material property prediction with dataset redundancy control

Q. Li, N. Fu, et al.

Discover how Qin Li, Nihang Fu, Sadman Sadeed Omee, and Jianjun Hu tackle the challenge of redundancy in materials datasets. Their innovative MD-HIT algorithm offers a fresh perspective on machine learning performance evaluations in materials science, ensuring more realistic outcomes in formation energy and band gap predictions.

00:00
00:00
~3 min • Beginner • English
Abstract
Materials datasets often contain many redundant (highly similar) materials due to historical tinkering in materials design. This redundancy biases performance evaluation under random splits, overestimating predictive accuracy and harming out-of-distribution (OOD) generalization. Inspired by CD-HIT in bioinformatics, we propose MD-HIT, a redundancy reduction algorithm for material datasets, with composition- (MD-HIT-composition) and structure-based (MD-HIT-structure) variants. On formation energy and band gap prediction tasks using composition-only and structure-based models, we show that enforcing redundancy control reduces apparent test performance compared to highly redundant settings but yields evaluations that better reflect true predictive capability.
Publisher
npj Computational Materials
Published On
Oct 18, 2024
Authors
Qin Li, Nihang Fu, Sadman Sadeed Omee, Jianjun Hu
Tags
materials datasets
machine learning
redundancy reduction
MD-HIT
performance evaluations
formation energy
band gap prediction
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny