Materials datasets often contain redundant materials, skewing machine learning (ML) model performance evaluations. This paper surveys overestimated ML performance in materials science and proposes MD-HIT, a redundancy reduction algorithm. Applying MD-HIT to formation energy and band gap prediction, the study demonstrates that redundancy control leads to lower, but more realistic, performance evaluations.
Publisher
npj Computational Materials
Published On
Oct 18, 2024
Authors
Qin Li, Nihang Fu, Sadman Sadeed Omee, Jianjun Hu
Tags
materials datasets
machine learning
redundancy reduction
MD-HIT
performance evaluations
formation energy
band gap prediction
Related Publications
Explore these studies to deepen your understanding of the subject.