logo
Loading...
Efficient data-driven exploration of the stable material space

Chemistry

Efficient data-driven exploration of the stable material space

H. Wang, S. Botti, et al.

Discover a groundbreaking method developed by H.-C. Wang, S. Botti, and M.A.L. Marques for scanning the vast landscape of stable materials using data mining. This innovative approach integrates component prediction with chemical similarity to identify thousands of previously unknown stable compounds—transforming our understanding of material science.... show more
Introduction

The study addresses the challenge of exploring the vast composition and structure space of inorganic materials, where exhaustive searches are infeasible. Prior strategies have used chemical constraints to filter unlikely compounds and machine learning models trained on DFT data to predict stability and properties. Building on component substitution concepts and inspired by data-mined ionic substitutions, the authors propose scanning the space of possible stable materials using a data-driven chemical similarity approach. The core hypothesis is that elements with high chemical similarity can substitute within known crystal prototypes to yield new stable compounds. This approach aims to efficiently generate candidate materials across dimensionalities (including bulk and reduced-dimensional systems) while mitigating combinatorial explosion.

Literature Review

Seminal prior work used machine learning to predict stability via ionic substitutions, with DFT validation a posteriori (Hautier et al., 2011). Other studies have combined DFT with machine learning to assess thermodynamic stability across families (e.g., perovskites) and intermetallics. Databases such as AFLOW and Materials Project enable high-throughput exploration. Chemical similarity scales derived from data mining (a modified Pettifor scale) provide statistically grounded element replaceability beyond heuristic periodic table proximity. Prior high-throughput efforts have also targeted Heuslers and perovskites; band-gap calculations with PBE are known to underestimate gaps; magnetic ground state identification typically requires larger supercell calculations and higher accuracy than standard high-throughput workflows; hardness models based on bond strength and graph Laplacians extend earlier formulations but can overestimate hardness for certain molecular crystals.

Methodology
  • Candidate generation via chemical similarity: Starting from an initial set of elemental, binary, ternary, quaternary, and quinary compounds from a materials database, the approach generates new compositions by substituting elements according to a data-mined chemical similarity scale (ref. 33). An element replaceability threshold of approximately 5% is adopted as a compromise between discovery rate and total yield (higher thresholds, e.g., ~20%, increase the fraction of stable hits per iteration but reduce the total number of stable compounds by ~3x).
  • Iterative transmutation: Multiple iterations (three reported) of substitutions produce successive generations of candidate structures. Before calculation, entries already present in the database (same composition and space group) are filtered; however, geometry relaxations can change space group, potentially coinciding with database entries post-relaxation.
  • DFT calculations and stability: For each new candidate, geometry optimization and total energies are computed; thermodynamic stability is assessed via the distance to the convex hull (E_hull), constructed from compounds in the database only (so previously unknown stable structures can appear with negative E_hull until the hull is recomputed including them). Calculations that failed to converge are excluded.
  • Property evaluation: Electronic band gaps are computed using the PBE functional (noting typical ~45% underestimation). Magnetic calculations start from an initial ferromagnetic configuration, consistent with common high-throughput practice, acknowledging that true AFM/FM/ferrimagnetic ground states are not resolved. Mechanical hardness is estimated using the Zhang et al. model that applies a Laplacian matrix-based extension of bond strength models (Šimůnek–Vackář), with bulk and shear moduli computed (PBE) to contextualize hardness.
Key Findings
  • Scale of exploration: 189,981 new structures (not in the database) were calculated across three iterations: Loop 1: 59,853; Loop 2: 50,917; Loop 3: 79,211. Composition breakdown (Totals): elemental 459 (0.24%), binary 26,571 (14.0%), ternary 139,550 (73.5%), quaternary 22,725 (12.0%), quinary 676 (0.36%).
  • Prototype prevalence: First generation included 713 distinct prototypes; the most common was cubic full-Heusler (ABC2), totaling 10,653 systems. Second generation remained Heusler-dominant (4,238 systems). Third generation’s most common prototype shifted to the hexagonal ZrNiAl–Fe2P structure (5,009 compounds).
  • Stability distribution: The E_hull histogram for all 189,981 compounds shows an enhanced fraction with negative distances (relative to a hull built from the existing database only) and a rapid decay for positive distances, indicating enrichment in likely stable materials. A large peak at zero corresponds to substitutions that reproduce database materials (often via relaxation to more symmetric space groups present in the database).
  • Stable set size: The stable set (on the convex hull) contains 18,479 compounds.
  • Band gaps: 4,840 systems have PBE gaps > 0.1 eV, representing 26.1% of stable systems. Anticipated false negatives from PBE underestimation are ~5–10% (≈250–500 metals likely misclassified).
    • Largest PBE gaps (~7.8 eV) occur in tetragonal ternary fluorides LiLnF4 (Ln = Tm, Dy, Ho, Tb, Er, Sm, Nd, Pr; decreasing gap in that order). Nonmetal element occurrence in semiconductors/insulators is dominated by O and F, followed by other halogens and chalcogens; among metals, heavy alkali metals Cs, Rb, K are most common in semiconductors/insulators.
  • Magnetism: 22.6% of stable systems are magnetic (4,187 out of 18,479), with counts rapidly decreasing at higher magnetization. The most common metallic elements in magnetic systems are Pu, Eu, Gd, Mn, Fe, Np, Ge, Ce, Ni, Co; the most common nonmetals are O and F. The highest magnetization is ~0.2 μB/Å^3 for cubic SnGd3. Magnetic systems span 253 prototypes; the cubic double-perovskite is most common (215 magnetic systems).
  • Magnetic semiconductors: 884 systems (4.8%), close to the product of independent probabilities (22.6% × 26.1% = 5.9%), suggesting little correlation between magnetism and semiconducting behavior. In the Materials Project database, a similar lack of correlation is observed (expected 14.4% vs. observed 12.1%). The most common prototype among magnetic semiconductors is the cubic double-perovskite (75 systems).
  • Hardness and elastic moduli: Most systems are soft; a handful are hard/superhard (H_V > 40 GPa). Reported examples with PBE bulk (B) and shear (G) moduli (GPa): VRu2Sn (H_V 41.5; B 210.8; G 85.8), CrGeRu2 (H_V 58.3; B 235.3; G 117.3), MnH2 (H_V 64.4; B 133.6; G 49.6). Several oxides show likely overestimated hardness due to model limitations and unusual bonding.
  • Threshold choice: A ~20% replaceability threshold would maximize the fraction of stable hits per iteration but reduce the total number of stable compounds by ~3×; a ~5% threshold provides a better overall yield.
Discussion

The non-Gaussian E_hull distribution with enrichment near and below the hull, contrasted with systematic substitution baselines, supports the effectiveness of data-mined chemical similarity for targeting likely stable compositions. The iterative substitution strategy efficiently explores prototype spaces (e.g., Heuslers, ZrNiAl–Fe2P) and uncovers a substantial stable set (18,479 compounds), including thousands of semiconductors and magnetic materials. Band-gap trends align with electronegativity expectations (fluorides/oxides/chlorides exhibiting larger gaps). The independence between magnetism and band-gap presence indicates that co-discovery of magnetic semiconductors scales with the product of individual probabilities; nevertheless, the identified 884 candidates provide fertile ground for discovering transparent ferromagnets/antiferromagnets with potentially high critical temperatures, pending refined magnetic ground-state determination and improved gap predictions. Mechanical screening highlights a few promising hard/superhard candidates (e.g., VRu2Sn, CrGeRu2, MnH2), while also illustrating limitations for oxides and molecular crystals. Overall, the approach balances breadth and hit-rate by tuning the replaceability threshold and demonstrates broad applicability across composition and prototype spaces.

Conclusion

The study introduces a data-driven exploration of inorganic materials via chemical similarity-guided component substitution. Applied at scale, it generated 189,981 new candidate structures, yielding 18,479 predicted-stable compounds, 4,840 semiconductors/insulators (Eg > 0.1 eV), 4,187 magnetic systems, and 884 magnetic semiconductors. It identified expected and unexpected prototype distributions (e.g., Heuslers, ZrNiAl–Fe2P) and highlighted several potential hard/superhard materials. The method’s success, evidenced by stability distributions enriched near/under the convex hull, underscores the value of data-mined chemical similarity for efficient materials discovery. Future work should: (i) refine magnetic phase identification (include AFM/ferrimagnetic configurations and larger supercells), (ii) improve band-gap accuracy (e.g., beyond-PBE methods) and reassess properties, (iii) recompute convex hulls including newly predicted stable phases, and (iv) further validate hardness predictions, particularly for oxides and molecular crystals.

Limitations
  • Band-gap underestimation: PBE gaps are underestimated by ~45% on average, leading to ~5–10% false negatives for semiconductors.
  • Magnetic states: Calculations initialize ferromagnetically and do not systematically resolve AFM/ferrimagnetic ground states; energy differences among magnetic orders can be meV/atom, but accurate resolution requires larger supercells and higher computational cost.
  • Convex hull reference: Hulls are built from the existing database; newly discovered stable structures may initially appear with negative E_hull until hulls are updated.
  • Hardness model: The Zhang et al. Laplacian-based model can overestimate hardness for molecular crystals and certain oxides with unusual oxidation/bonding; oxides’ stability may also be overestimated in standard GGA/GGA+U frameworks.
  • Computational failures: Some calculations did not converge and were excluded, potentially biasing composition/prototype coverage.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 22+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny