
Computer Science
MD-HIT: Machine learning for material property prediction with dataset redundancy control
Q. Li, N. Fu, et al.
Discover how Qin Li, Nihang Fu, Sadman Sadeed Omee, and Jianjun Hu tackle the challenge of redundancy in materials datasets. Their innovative MD-HIT algorithm offers a fresh perspective on machine learning performance evaluations in materials science, ensuring more realistic outcomes in formation energy and band gap predictions.
~3 min • Beginner • English
Introduction
The paper addresses a pervasive issue in materials informatics: redundancy in large materials databases (e.g., Materials Project, OQMD) leads to information leakage under random train/test splits and inflated performance claims for ML models. Many reported near-DFT accuracies rely on test samples highly similar to training data, obscuring models’ ability to extrapolate to genuinely novel materials—a key goal in materials discovery. The study’s purpose is to quantify and mitigate redundancy-driven overestimation by introducing MD-HIT algorithms to construct non-redundant datasets for objective evaluation, particularly emphasizing extrapolative/OOD performance. The authors hypothesize that controlling dataset redundancy will decrease in-distribution (ID) metrics but improve the realism of evaluations and may enhance performance on true OOD samples.
Literature Review
Prior works have highlighted overestimation and limited extrapolative power in materials ML. Meredig et al. introduced LOCO-CV to assess extrapolation across material clusters, showing models struggle to generalize across families. Xiong et al. proposed K-fold forward CV (FCV), revealing weak performance when extrapolating across property ranges; similar observations were made for thermal conductivity prediction. Omee et al. benchmarked OOD performance of structure-GNNs, finding significant degradation out of distribution. On data efficiency and redundancy, Magar and Farimani proposed adaptive sampling using high-error samples to build smaller, more informative training sets; Li et al. reported that large fractions of data (up to 95%) can be pruned with minor impact on random-split performance but little benefit for OOD generalization. However, these approaches are property-specific and use varying thresholds, complicating standardized benchmarks. In bioinformatics, CD-HIT enforces sequence-identity thresholds (e.g., 95%) to limit redundancy; analogous controls have been lacking in materials science. This study fills that gap with property-agnostic redundancy control for both composition and structure datasets and rigorous OOD definitions.
Methodology
MD-HIT algorithms adopt a greedy, incremental selection inspired by CD-HIT. Starting from a seed material (default H2O), candidates are sorted (by number of atoms) and added to the non-redundant set only if their distance to all current representatives exceeds a threshold; otherwise, they are discarded as redundant.
- Composition similarity (MD-HIT-composition): Distances computed using the EIMD package across linear, chemically derived, and machine-learned descriptors. Defaults: Mendeleev similarity and MatScholar embedding distance (Euclidean in MatScholar feature space). Other options include mendeleev, petti, atomic, mod_petti, oliynyk, oliynyk_sc, Jarvis/jarvis_sc, magpie/magpie_sc, cgcnn, element, mat2vec, matscholar, megnet16.
- Structure similarity (MD-HIT-structure): Two fixed-length descriptors are used to define distances between crystals with varying atom counts: (i) XRD-based features from Pymatgen’s XRDCalculator, Gaussian-smoothed and sampled into 900-d vectors, using Euclidean distance; (ii) OFM (OrbitalFieldMatrix) 1024-d electronic-structure descriptors, also with Euclidean distance.
Dataset preparation: 125,619 MP structures (89,354 unique compositions) were collected; polymorphs were averaged for properties except formation energy (minimum used). Formulas with >50 atoms were removed, yielding 86,741 composition entries. Redundancy-controlled datasets were generated at multiple thresholds: Mendeleev (e.g., 0.5–3.0), MatScholar (e.g., 0.05–0.4), XRD (0.5–0.9), OFM (0.15–0.7). t-SNE maps (using MatScholar descriptors) visualize how redundancy removal reduces crowding while preserving broad coverage.
Evaluation tasks and models: Two properties—formation energy (eV/atom) and band gap (eV). Composition-only models: Roost and CrabNet. Structure-based models: ALIGNN and DeeperGATGNN. Datasets are split 8:1:1 into train/val/test. For OOD analysis, two OOD sets were defined:
- MatscholarOOD: 1000 samples in sparse regions per MatScholar density; used for RF baselines and initial deep models (not truly OOD for deep latent spaces).
- EmbeddingOOD: 1000 samples far from their three nearest neighbors in a Roost latent space (pretrained encoder), ensuring deep-model-relevant OOD.
Metrics: MAE and R².
Bias checks: For OFM-based band gap datasets, the fraction of near-zero band gap samples (<0.01 eV) was quantified to explain counterintuitive trends. Density–error analyses (50 bins) were conducted using model latent-space densities to relate MAE to local sample density. Code and non-redundant datasets are available at https://github.com/usccolumbia/MD-HIT.
Key Findings
- Redundancy control lowers apparent ID performance: As Mendeleev or MatScholar thresholds increase (fewer, less redundant samples), Roost and CrabNet show decreasing R² and increasing MAE for formation energy; for band gap, R² generally decreases while MAE trends can be noisier due to outliers and class imbalance (many zero-gap samples).
- Structure-based trends mirror composition-based: With higher XRD or OFM thresholds, ALIGNN and DeeperGATGNN exhibit lower R² and higher MAE for formation energy. For band gap with OFM thresholds, R² decreases as expected, but MAE paradoxically decreases due to selection bias toward near-zero band gap samples.
- Quantified band gap class imbalance under OFM control: Whole dataset includes 48.64% near-zero band gap samples; MD-HIT-OFM selects 64.09% (τ=0.15), 67.81% (0.2), 84.52% (0.45), and 92.43% (0.7), explaining lower MAEs at higher thresholds.
- OOD performance improves with non-redundant training (deep model): On EmbeddingOOD for formation energy, Roost trained on non-redundant data (Roost_nr) outperforms redundant-trained (Roost_red):
• Roost_nr: MAE 0.3679 eV, R² 0.6998 (random-split: MAE 0.1994, R² 0.8544)
• Roost_red: MAE 0.4089 eV, R² 0.5318 (random-split: MAE 0.1278, R² 0.9168)
This corresponds to ~10.03% MAE improvement and 31.6% R² improvement for OOD when training on non-redundant data.
- RF models benefit on MatScholar-defined OOD: For 1000 MatScholar-OOD samples:
• RF_nr: random-split MAE 0.4330, R² 0.6382; OOD MAE 0.6427, R² 0.2668.
• RF_red: random-split MAE 0.4014, R² 0.7094; OOD MAE 0.7562, R² 0.0527.
Non-redundant training yields substantially better OOD R² (0.2668 vs 0.0527) and lower OOD MAE.
- Error–density relationship: Across models and test sets, MAE decreases with increasing local sample density in latent space; OOD MAEs exhibit larger variance. Models trained on non-redundant data show more consistent performance between ID and OOD density curves.
- MatScholarOOD not truly OOD for deep models: Deep models’ latent spaces differ from MatScholar, explaining why non-redundant deep models did not outperform on MatScholarOOD, while RF (using MatScholar features directly) did.
Discussion
Findings confirm that redundancy in materials datasets inflates ID performance under random splits by populating test sets with near-duplicates of training samples. Enforcing distance-based redundancy control produces more realistic estimates and shifts attention from interpolation to extrapolation, aligning evaluation with materials discovery goals. The observed gains on true OOD (EmbeddingOOD) for models trained on non-redundant data demonstrate that pruning redundancy can improve generalization to novel regions by preventing models from overfitting to dense clusters. The analysis of OFM band gap bias underscores the need to monitor label and class distributions when applying redundancy filters; seed selection and thresholding can inadvertently skew datasets. The density–error analyses provide mechanistic evidence that lower local density correlates with higher errors, reinforcing the rationale for redundancy-aware benchmarking. In practice, redundancy thresholds can be tuned as hyperparameters via validation to balance data coverage, bias, and evaluation realism.
Conclusion
The paper introduces MD-HIT-composition and MD-HIT-structure, greedy redundancy-reduction algorithms for materials datasets using composition and structure distances, respectively. Applying MD-HIT across formation energy and band gap tasks with state-of-the-art composition-only (Roost, CrabNet) and structure-based (ALIGNN, DeeperGATGNN) models shows consistent degradation of ID metrics with reduced redundancy, offering more realistic performance estimates. Crucially, training on non-redundant data improves OOD generalization for both deep and classical models when OOD is defined in the models’ relevant feature spaces. The authors release code and non-redundant datasets to facilitate standardized, redundancy-aware benchmarking. Future work should further quantify the linkage between redundancy level and generalization, refine OOD definitions, mitigate selection biases (e.g., seed choice and class imbalance), and optimize the MD-HIT implementation for scalability.
Limitations
- Potential selection bias: Choice of seed material (e.g., SrTiO3) and distance thresholds can bias property distributions (e.g., increased near-zero band gap prevalence at high OFM thresholds), impacting MAE trends.
- OOD definition sensitivity: OOD sets defined in MatScholar space are not necessarily OOD in deep models’ latent spaces, complicating cross-model comparisons.
- Property-agnostic thresholds lack a universally accepted standard; threshold tuning is required per task/dataset.
- Computational efficiency: The current multi-threaded MD-HIT implementation remains slow for very large datasets and may require further optimization.
- Averaging polymorph properties (except formation energy) may obscure structure-specific effects for certain properties.
Related Publications
Explore these studies to deepen your understanding of the subject.