logo
Loading...
Machine-learned metrics for predicting the likelihood of success in materials discovery

Engineering and Technology

Machine-learned metrics for predicting the likelihood of success in materials discovery

Y. Kim, E. Kim, et al.

This paper by Yoolhee Kim, Edward Kim, Erin Antono, Bryce Meredig, and Julia Ling presents groundbreaking metrics designed to enhance materials discovery. Discover how the predicted fraction of improved candidates (PFIC) and cumulative maximum likelihood of improvement (CMLI) can fast-track your understanding of design spaces in materials discovery, providing high precision for optimal outcomes.... show more
Introduction

The study addresses the underexplored question of selecting high-quality design spaces for materials discovery, i.e., determining whether a given "haystack" contains enough potential "needles" (improved materials) to warrant investment. While prior work has emphasized model accuracy for predicting material properties, the authors hypothesize that the inherent quality of the design space—quantified by the fraction of improved candidates (FIC)—critically determines discovery success and the number of experiments required. They aim to develop predictive, machine-learned metrics that estimate design space quality a priori, enabling researchers to prioritize projects likely to yield improvements efficiently.

Literature Review

Previous research has highlighted the role of machine learning in navigating vast materials spaces and optimizing experimental campaigns. Model accuracy has been a common success proxy (e.g., Ward et al., 2016; Ward et al., 2018), but recent studies (Jia et al., 2019; Kauwe et al., 2019) indicate that the underlying difficulty and bias in chemical and materials spaces strongly influence discovery outcomes, sometimes allowing random strategies to perform comparably to human-guided ones. Work by Meredig et al. (2018) and others has shown that training/test selection and clustered, human-biased data distributions can significantly affect generalization and discovery practicality. Despite these insights, a general quantitative framework to assess design space quality in terms of discovery likelihood remained lacking, motivating the metrics proposed here.

Methodology

The authors simulate sequential learning (active learning) for materials discovery using benchmark datasets spanning computational and experimental sources: Materials Project, Harvard Clean Energy Project, Melting Points, Superconductors, UCSB Thermoelectrics, and Strehlow & Cook. A novel k-means clustering-based data initialization partitions datasets into training sets and design spaces to mimic realistic, clustered experimental histories. Clusters are ranked by best-performing candidate; subsets are assigned to training versus design clusters, and training clusters are partially subsampled so the design space includes both interpolative and extrapolative candidates. Difficulty is modulated by varying the fraction of improved clusters placed in the design space and by whether strong candidates from training clusters are held out. Machine learning models: all simulations use the open-source lolo random forest, fitting a linear model at each leaf to enable extrapolation beyond the training range. Default hyperparameters are used with minimum 20 samples per leaf; number of trees equals training size, max depth 30, and L2-regularized leaf linear models. Uncertainty estimates combine jackknife-based methods with an explicit bias model. Acquisition functions for simulated sequential learning include maximum likelihood of improvement (MLI) and maximum expected improvement (MEI). For each dataset, 50–100 different initializations are tested; for each split, 20 stochastic trials are run, each up to 50 iterations or until an improved candidate is found. Design space quality metric (ground truth): FIC, the fraction of design-space candidates outperforming the best training point (for maximize/minimize/tune objectives as appropriate). Predictive metrics computed from initial training only (without iterative acquisition): - PFIC: predicted fraction of improved candidates, defined as the fraction of design-space points whose predicted performance exceeds the best training performance (for maximization; inverted appropriately for minimization). This requires a model capable of extrapolation (here, linear-at-leaf random forests). - CMLI: cumulative maximum likelihood of improvement for the top n candidates (n=10 in benchmarks), defined as 1 minus the product over the top-n candidates of (1 - L(x_i)), where L(x_i) is the predicted probability (under the model’s predictive mean and uncertainty, assumed normal) that candidate i exceeds the best training performance. Independence of top-n events is assumed, acknowledging potential overestimation. Both metrics are evaluated across data splits; correlations with true FIC, precision/recall tradeoffs, and ROC/AUC are analyzed. A combined evaluation system is proposed: compute PFIC (extrapolative model) and CMLI (uncertainty-calibrated model), then threshold at user-chosen t_PFIC and t_CMLI to designate high-, low-, or unknown-quality design spaces.

Key Findings
  • Sequential learning success (fewer iterations to first improvement) is strongly and monotonically related to true design space quality (FIC): higher FIC design spaces require fewer iterations and exhibit lower variance; lower FIC spaces require many more iterations and show high variance. Sequential learning generally outperforms random selection (baseline expected iterations ≈ 1/FIC), with an observed outlier (formation energy) showing bimodality due to local poor-performance neighbors.
  • Predictive metric correlations with true FIC across 10 properties and 6 datasets (50–100 initializations each, 10 trials per initialization): PFIC Pearson r ≈ 0.38; Top-10 CMLI Pearson r ≈ 0.31; others were lower (e.g., PFIC−10σ: 0.20; ratio of extrapolated/all: 0.29; top-5 CMLI: 0.27; predicted fraction of improved interpolated candidates: 0.11).
  • PFIC performance: ROC AUC ≈ 0.62 for classifying high-quality design spaces (FIC > 0.04). Precision-recall tradeoff shows that increasing t_PFIC improves precision but reduces recall; example threshold t_PFIC = 0.2 is highlighted.
  • CMLI performance: ROC AUC ≈ 0.65 for classifying low-quality design spaces (FIC < 0.04). Precision-recall tradeoff shows ability to flag some low-quality spaces; example threshold t_CMLI = 0.7 is highlighted.
  • Combined design space evaluation system (example thresholds t_PFIC = 0.2, t_CMLI = 0.7) acts as a high-precision, low-recall classifier: High-quality identification precision ≈ 0.94, recall ≈ 0.06; Low-quality identification precision ≈ 0.96, recall ≈ 0.23. Many design spaces fall into an "unknown" category, but those flagged as high or low quality are usually correct.
  • Model quality had a weaker relationship to iterations-to-improvement compared with design space quality (per supplementary analyses), underscoring the dominant role of FIC in discovery difficulty.
Discussion

The findings validate the central hypothesis that design space quality (FIC) is a key determinant of discovery difficulty in sequential learning, often eclipsing the impact of model accuracy. By introducing PFIC and CMLI, the work provides practical, model-agnostic tools to estimate design space quality from initial training data, enabling researchers and organizations to triage projects, allocate resources, and balance portfolios by predicted difficulty and likelihood of success. PFIC helps surface discovery-rich spaces with many potential improvements, while CMLI flags discovery-poor spaces likely to be unproductive. The combined system offers high-precision identification of promising or unpromising design spaces, even if recall is limited, thereby reducing wasted experimental effort and accelerating ML-driven materials development.

Conclusion

This work introduces two machine-learned, predictive metrics—PFIC and CMLI—that estimate design space quality a priori and correlate with the true fraction of improved candidates. Through realistic clustering-based data initialization and extensive simulated sequential learning across diverse datasets, the study shows that these metrics can identify both discovery-rich and discovery-poor design spaces and, when combined, provide a high-precision framework for design space selection. The approach advances materials informatics from predicting properties to predicting project difficulty and likelihood of success. Future research directions include: exploring additional predictive metrics; directly mapping metrics to expected iterations-to-improvement; updating metrics throughout sequential learning; studying the effects of model class, accuracy, and uncertainty calibration; and extending to multi-objective optimization.

Limitations

The evaluation relies on in silico simulations and specific datasets; generalization to experimental practice, while plausible, remains to be fully validated. The CMLI calculation assumes independence among top-n candidates and normally distributed uncertainties, which can overestimate improvement likelihood in correlated or non-Gaussian settings. Reported classifier performance is moderate (AUC ~0.62–0.65) with high precision but low recall, leaving many design spaces as unknown quality. Results depend on data initialization choices (clustering, splits) and the specific modeling approach (random forest with linear leaves and jackknife+bias uncertainties); alternative models may change performance. The current framework addresses single-objective optimization; multi-objective scenarios are not handled. True design space FIC is unknown in practice, and thresholds (e.g., 4% improvement rate) are application-dependent.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny