logo
ResearchBunny Logo
A general approach for determining applicability domain of machine learning models

Computer Science

A general approach for determining applicability domain of machine learning models

L. E. Schultz, Y. Wang, et al.

Discover a new, general method to determine where machine-learning predictions are trustworthy by measuring feature-space distance with kernel density estimation. This approach identifies chemically dissimilar groups, links high dissimilarity to large prediction errors and unreliable uncertainty estimates, and includes automated tools to set dissimilarity thresholds for in-domain versus out-of-domain decisions. Research conducted by Lane E. Schultz, Yiqi Wang, Ryan Jacobs, and Dane Morgan.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the challenge of determining when predictions from machine learning (ML) models can be trusted, especially when inputs differ from the training distribution. In materials science, ML usage has grown rapidly since ~2015, and reliable deployment requires quantifying prediction quality: (i) accuracy (low residuals), (ii) calibrated uncertainty, and (iii) the ability to classify inputs as in-domain (ID) or out-of-domain (OD). While domain adaptation can sometimes adjust models to shifted distributions, it is often costly, limited in scope, and may not generalize to unknown domains. The authors therefore focus on domain classification: given a trained property predictor M^prop and a test feature vector x, predict whether x is ID or OD. They frame this as a supervised learning problem requiring ground truth labels of ID/OD, despite no universal definition of domain. Existing domain methods (thresholded descriptors, single continuous regions, convex hull-based boundaries) can be complex, may exclude valid disjoint ID regions, or include sparse regions. The paper proposes a simple, general approach grounded in kernel density estimation (KDE) to quantify dissimilarity in feature space, hypothesizing that low-density regions correspond to OD behavior and degraded model reliability. The study defines several plausible ground truths (chemistry-based, residual-based, RMSE-based, and uncertainty-calibration-based) and evaluates whether a KDE-derived dissimilarity can predict these labels across multiple models and datasets.
Literature Review
Prior applicability-domain approaches include: (1) per-feature thresholding based on error reduction (effective but complex, sensitive to feature prioritization and thresholds); (2) defining a single connected low-error region balancing coverage and error (can miss multiple disjoint reliable regions); and (3) PCA to 5D, convex-hull boundary using a low-error percentile, then distance to hull (can include large sparse regions as ID). Distance-based methods (e.g., nearest-neighbor distances) correlate with errors but suffer from non-uniqueness in distance definitions and aggregation across many points, and often ignore data sparsity. KDE and density-based methods naturally account for sparsity and complex topologies; Gaussian process methods share some benefits but can be costlier to fit. Prior KDE use in materials showed that many supposed extrapolations were actually interpolations and that residuals grow in low-density regions, but did not directly classify ID/OD. For uncertainty, extensive work on calibration and evaluation exists for regression (e.g., ensemble-based calibration, z-score diagnostics), with metrics like sharpness and dispersion; this work employs a miscalibration area comparing empirical z CDFs to a standard normal. Overall, the literature suggests potential but also limitations in prior AD methods, motivating a simple, fast, and general KDE-based classifier.
Methodology
Model components and notation: M_prop is a regression model predicting y from X; M_unc calibrates prediction uncertainties from ensembles; M_disc computes a dissimilarity score d(x) from training features using KDE; M_dom is a domain classifier that labels ID/OD based on a threshold d*. Training data are denoted In-The-Bag (ITB); evaluation uses Out-Of-Bag (OOB) data from cross-validation and cluster-based splits. Property error metrics: (1) Per-point normalized absolute residual E^{|y-ŷ|/MAD_y} = |y−ŷ|/MAD_y; (2) Group-wise normalized RMSE E^{RMSE/σ_y} = RMSE/σ_y with RMSE over a set of points. A naïve predictor that always outputs the mean yields both metrics equal to 1.0, forming baseline cutoffs. Uncertainty model and metric: Ensembles produce mean ŷ and uncalibrated spread σ_u; calibration yields σ_c via Palmer et al. Repeated 5-fold CV provides residuals and σ_u for calibration. Uncertainty quality is assessed by miscalibration area E^{area} = ∫|Φ(z) − Φ(0,1)| dz with z = (y−ŷ)/σ_c, comparing the empirical z CDF to the standard normal CDF; lower is better. A naïve baseline uses ŷ = mean(y) and σ = σ_y to set E_c^{area} thresholds on evaluation sets. KDE-based dissimilarity (M_disc): Fit KDE on standardized X_ITB using scikit-learn with StandardScaler, Epanechnikov kernel, and bandwidth estimated automatically (sklearn.cluster.estimate_bandwidth). Convert log-likelihoods to likelihoods and define d(x) = 1 − KDE(x)/max_a KDE(a) over ITB. Thus, d ∈ [0,1], with 0 at densest region and 1 in zero-density regions. Domain classifier (M_dom): Predict ID if d < d' else OD. Train by selecting a single cutoff d* to maximize F1 (F1_max) on OOB labels from ground truths (chemistry, E^{|y-ŷ|/MAD_y}, E^{RMSE/σ_y}, E^{area}). Precision-recall and AUC-Baseline (improvement over a naïve always-ID predictor) quantify performance. If all OOB are OD, set d* < 0; if all ID, set d* > 1. Ground-truth definitions: Four label sources: (i) chemical intuition (E^{chem}): curated ID (chemically similar) vs OD (dissimilar groups); (ii) per-point residuals E^{|y-ŷ|/MAD_y} with cutoff 1.0 (better than naïve = ID); (iii) grouped RMSE E^{RMSE/σ_y} with cutoff 1.0 on d-binned OOB; (iv) uncertainty miscalibration E^{area} with cutoff equal to baseline miscalibration for the set. Labels: ID if E < E_c, OD otherwise. Note: RMSE and area require binning OOB by d into N_bins (10 used) for robust statistics; per-point residuals do not. Data splitting to generate OOB: Repeated 5-fold CV and Bootstrapped Leave-One-Cluster Out (BLOCO). Pre-cluster with agglomerative clustering; iteratively hold out one cluster (with bootstrapped replicates) as OOB. Nested CV calibrates uncertainties on inner folds while training M_prop, M_unc, and M_disc on outer train splits. For chemical assessment, Leave-One-Out CV on the Original set with fixed non-Original OOB. Class imbalance in chemical labels is addressed via resampling to balance ID/OD for evaluation. Models and datasets: M_prop types include Random Forest (RF), bagged support vector regressor (BSVR), bagged neural network (BNN), and bagged ordinary least squares (BOLS). Datasets: (1) Diffusion activation energies (DFT) with elemental features; (2) Reactor pressure vessel (RPV) Fluence embrittlement with composition and irradiation features; (3) Steel Strength yield strength with elemental-fraction features; (4) Superconductor T_c with three classes (Cuprates, Iron-Based, Low-T_c); (5) Synthetic Friedman function. Features were generated/selected via MAST-ML where applicable. Implementation details: scikit-learn for models/KDE, Keras for NNs, UMAP for visualization, MAST-ML for feature engineering/selection. Time complexity measured empirically: overall assessment about O(n^2) in data points (due to end-to-end workflow), approximately linear in number of features for RF; KDE backends scale with KD/Ball trees as O(log n) or O(n log n) for queries, but the pipeline includes additional steps.
Key Findings
- KDE dissimilarity d correlates strongly with domain-relevant behavior across datasets and models: as d increases, chemical dissimilarity grows, residuals increase, RMSE rises, and uncertainty calibration deteriorates. - Chemical domain (A^{chem}): Violin plots show ID groups have lower d than OD groups. Reported F1_max values range from 0.71 to 1.00 across materials (e.g., Fluence achieves near-perfect separation with d_c ≈ 0.99; Steel Strength achieves precision 1.00 and recall 0.98 at F1 ≈ 0.99). For Diffusion, a conservative threshold d_c = 0.45 attains high precision (~0.95 target) at lower recall; F1_max occurs near d_c just below 1.0 with precision ~0.64 and recall ~0.80. - Residual-based per-point assessment (A^{|y-ŷ|/MAD_y}): AUC-Baseline is positive for nearly all model–dataset pairs, indicating added information over a naïve baseline. Nearly all F1_max values exceed 0.7 (18/20 ≥ 0.7), with many optimal thresholds at d = 1.00 effectively filtering zero-density points that concentrate large residuals. Example ranges include precision ~0.71–0.99 and recall ~0.68–1.00 depending on dataset/model. - Grouped RMSE assessment (A^{RMSE/σ_y}): Very strong separability; in most pairs F1_max = 1.00 (16/20 perfect, remaining >0.7 except a few lower in Superconductor with BSVR/RF). Bins at higher d reliably exceed the baseline threshold (ID vs OD split), enabling clear domain thresholds (e.g., d_c ≈ 0.85 in an example cleanly separates ID/OD bins). - Uncertainty quality assessment (A^{area}): E^{area} generally increases with d. AUC-Baseline is positive in 18/20 cases; about half of combinations achieve precision and recall of 1.00 with F1_max ≥ 0.7 in 11/20 (four slightly below). Two combinations (Steel Strength with BNN; Diffusion with BOLS) underperform naïve baselines, reflecting dependence on the quality of uncertainty estimation. - Failure/caution scenarios demonstrated: (i) Unlearnable targets (shuffled y) lead to all OD, preventing meaningful d*; (ii) Uncalibrated uncertainties (σ_u) degrade A^{area}; (iii) BLOCO can fail if clusters are not sufficiently distinct (FWODC), breaking OOB label generation and d–error monotonicity; (iv) High-dimensional feature spaces can reduce KDE effectiveness—empirically, A^{|y-ŷ|/MAD_y} F1_max declines modestly as features increase to 1000; ≤50 features performed best. - Practical scalability: End-to-end assessment scales roughly O(n^2) with number of data points (due to full pipeline), and approximately linearly with number of features (RF case).
Discussion
The results validate the core hypothesis that KDE-derived density in feature space is an effective proxy for a model’s domain of applicability. Across diverse materials datasets and model classes, higher dissimilarity (lower density) is consistently associated with poor predictive accuracy and degraded uncertainty calibration. This enables a simple domain classifier, M_dom, based on a single threshold d*, to flag likely OD inputs at inference using only features. Compared to prior methods requiring complex region construction, convex hulls, or multiple thresholds, KDE naturally handles disjoint ID regions and accounts for data sparsity. The quantitative gains over naïve baselines (positive AUC-Baseline values and high F1_max across assessments) indicate that d captures much of the privileged information embedded in chemical intuition, residual magnitudes, and uncertainty validity. The approach further provides actionable thresholds (e.g., d near 1.0 to discard zero-density points, dataset-specific d_c for precision-recall trade-offs) that can serve as automated guardrails. While binning and uncertainty calibration choices can influence outcomes, and high-dimensional feature spaces may degrade KDE efficacy, the method remains robust in typical tabular regression applications in materials science, and is readily deployable (including via MAST-ML).
Conclusion
This work introduces a general, simple, and effective approach to determine the applicability domain of ML regression models using a KDE-based dissimilarity measure and a single learned threshold for ID/OD classification. The method consistently correlates feature-space density with chemical similarity, prediction errors, and uncertainty calibration quality across multiple datasets (Diffusion, Fluence, Steel Strength, Superconductor, Friedman) and model types (RF, BSVR, BNN, BOLS), delivering strong F1 scores and improvements over naïve baselines. The approach is fast to fit, topology-agnostic, and practical for inference-time screening of new inputs, with code and tools provided (including MAST-ML integration). Future directions include: (i) exploring more expressive M_dom models leveraging d and additional features; (ii) optimizing or adapting KDE bandwidths and addressing high-dimensional settings (e.g., via feature selection or dimensionality reduction); (iii) automating bin selection or devising bin-free strategies to avoid potential leakage in grouped metrics; (iv) extending the approach to unstructured data by operating in learned latent spaces; and (v) investigating domain thresholds under application-specific precision–recall requirements.
Limitations
- No universal ground truth for ID/OD: chemical labels rely on expert curation; residual/RMSE/uncertainty labels use baselines and, for grouped metrics, binning, which can introduce minor data leakage and sensitivity to bin count. - Dependence on uncertainty quality: A^{area} performance can degrade if uncertainty calibration is poor (e.g., using σ_u instead of σ_c), limiting utility for uncertainty-based domain assessments. - Splitting strategy sensitivity: BLOCO requires sufficiently distinct clusters; if clustering fails (e.g., FWODC), OOB labels and d–error trends degrade. - High dimensionality: KDE can struggle in very high-dimensional feature spaces; empirical results suggest performance drops with many features, recommending feature selection (≈50 or fewer relevant features when feasible). - Scope: Demonstrations focus on tabular regression problems; extension to unstructured data requires defining an appropriate feature (latent) space. Overall pipeline scalability (approximately O(n^2) with dataset size) may limit very large-scale assessments.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny