logo
ResearchBunny Logo
Quantifying Distribution Shifts and Uncertainties for Enhanced Model Robustness in Machine Learning Applications

Computer Science

Quantifying Distribution Shifts and Uncertainties for Enhanced Model Robustness in Machine Learning Applications

V. Flovik

This intriguing study by Vegard Flovik explores the challenges posed by distribution shifts in machine learning. Using synthetic data generated through the van der Waals equation, the research reveals innovative methods to enhance model adaptation and generalization, specifically highlighting the significance of Mahalanobis distance in improving model robustness and tackling uncertainties.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the challenge of distribution shift—differences in statistical properties between training and test data—that undermines model reliability and generalization in real-world applications (e.g., wind turbines, medical imaging, autonomous driving, recommender systems). It reviews prior evidence that distributional disparities degrade performance and outlines adaptation approaches (adversarial training, discrepancy-based methods, data augmentation) and uncertainty quantification tools (Bayesian methods, conformal prediction). The study’s purpose is to systematically quantify distribution shifts and their impact on accuracy and uncertainty using synthetic datasets. Research questions: (1) How to quantify distribution shift and data similarity? (2) How well do models generalize across datasets with different distributions? (3) What uncertainties arise from train–test distribution differences and how can they be quantified? The paper proposes a controlled synthetic-data framework based on the van der Waals equation to evaluate metrics (KL divergence, Jensen–Shannon distance, Mahalanobis distance) and to relate them to prediction accuracy and uncertainty.
Literature Review
The paper situates its work within literature on dataset shift and domain adaptation, highlighting empirical evidence that distribution mismatch harms generalization (Shimodaira; Liu et al.). It surveys adaptation techniques such as adversarial domain training (Ganin et al.) and discrepancy-based approaches (Long et al.), and data augmentation strategies like mixup (Zhang et al.) and GANs (Goodfellow et al.). For uncertainty quantification, it contrasts Bayesian methods and conformal prediction (Gawlikowski et al.; Brown and White), and emphasizes the importance of evaluating uncertainty under dataset shift (Ovadia et al.). Visualization and divergence-based similarity measures (t-SNE; KL/JS divergence) are commonly used to understand shifts and their effects.
Methodology
Study design: two controlled synthetic experiments using gas laws to modulate distribution properties while measuring similarity and uncertainty. Data generation: - Equation: van der Waals (P + aN^2/V^2)(V − Nb) = NRT, with R = 0.0821 L·atm/(mol·K). Gas-specific constants (a, b) taken from standard references for: Ideal gas (0,0), H2, He, Ne, Ar, Xe, N2, O2, CO2, CH4. - Features: Temperature (T), Volume (V), Moles (N) drawn from Gaussian distributions; target is Pressure (P) computed via the equation. Experiment 1 (changes in feature–target correlations/target drift): - Purpose: assess how deviations from ideal gas behavior (via different a, b) affect similarity and model accuracy when training on ideal gas data. - Inputs: same feature distributions for all gases: T ~ N(μT=300 K, σT=25 K), V ~ N(μV=50 L, σV=5 L), N ~ N(μN=15 mol, σN=1 mol). - Datasets: one training set (ideal gas) and multiple test sets (real gases with different a, b). An in-distribution test subset from the ideal gas data is kept out for evaluation. Approximately 10,000 samples were generated (noted in the discussion of finite-sample variability). - Similarity metrics: KL divergence and Jensen–Shannon (JS) distance computed between training (ideal gas) and each gas dataset (based on pressure distributions). - Evaluation: train a regression model on ideal gas data; evaluate on each gas with Mean Absolute Percentage Error (MAPE) and analyze correlation with KL/JS. Experiment 2 (covariate shift/feature distribution drift): - Purpose: assess how shifts in feature distributions (T, V, N) impact accuracy and uncertainty when correlations to P remain ideal-gas consistent (a=b=0 in both train and test). - Dataset 1 (train/in-distribution subset): T ~ N(273 K, 50 K), V ~ N(10 L, 1 L), N ~ N(10 mol, 1 mol). - Dataset 2 (test/out-of-distribution): T ~ N(300 K, 50 K), V ~ N(9 L, 1.5 L), N ~ N(11 mol, 1 mol). - Similarity metrics: KL and JS between train and each test set; additionally compute Mahalanobis distance of each test point to the training feature distribution (using train mean and covariance). - Uncertainty: Monte Carlo Dropout during inference with N=100 stochastic forward passes; collect predictive mean and standard deviation per sample; analyze relationships among prediction error, predictive uncertainty, and Mahalanobis distance. The 95th percentile of the train Mahalanobis distribution is used as a cutoff to flag extrapolation. Model architecture: - Input layer matching feature shape. - Dense layers: 64, 64, 32 neurons with ELU activations. - Custom Dropout layer (rate 0.1) enabled during inference to support MC Dropout. - Output layer: single linear neuron for regression. - Implementation code: https://github.com/veflo/uncert_quant Analysis: - Visualizations (t-SNE; distributions of features/pressure; predicted vs. real). - Correlation analyses: KL/JS vs. MAPE; Mahalanobis distance vs. MAPE; MC Dropout predictive standard deviation vs. MAPE.
Key Findings
Experiment 1 (target drift via gas-specific deviations from ideal behavior): - Pressure distributions across gases differ subtly to substantially; t-SNE visualization showed no strong clustering, consistent with modest deviations for many gases. - Quantitative similarity correlated with accuracy: higher KL divergence generally aligned with higher MAPE; JS distance also correlated but less linearly (likely due to its bounded [0,1] range). - Representative results (KL, JS, MAPE) relative to the ideal gas training distribution: • Ideal Gas (subset): KL 0.10, JS 0.14, MAPE 0.48. • Neon: KL 0.14, JS 0.17, MAPE 0.54 (close to training distribution, low error). • Nitrogen: KL 0.33, JS 0.26, MAPE 0.73; Oxygen: KL 0.57, JS 0.33, MAPE 0.86; Argon: KL 0.52, JS 0.32, MAPE 0.84; Helium: KL 0.55, JS 0.33, MAPE 0.76; Hydrogen: KL 0.31, JS 0.25, MAPE 0.64; Methane: KL 1.75, JS 0.49, MAPE 1.56. • Carbon Dioxide: KL 5.49, JS 0.73, MAPE 3.30; Xenon: KL 6.11, JS 0.75, MAPE 3.55 (largest shift and highest errors). - A clear near-linear trend was observed between KL divergence and prediction error (MAPE); the trend with JS distance was weaker. Experiment 2 (covariate shift in feature distributions): - Train vs. test distribution differences (ideal gas in both) yielded: In-distribution subset KL 0.26, JS 0.16, MAPE 0.26; Out-of-distribution KL 1.20, JS 0.47, MAPE 0.60. - Prediction deviations were most pronounced in low/high pressure regions that were under- or unrepresented in training data. - Mahalanobis distance strongly correlated with prediction error: samples beyond the 95th percentile of the train Mahalanobis distribution showed increased errors and uncertainty. - MC Dropout predictive standard deviation correlated with error but less consistently than Mahalanobis distance; Mahalanobis provided a computationally efficient, per-sample reliability indicator. Overall: Divergence-based measures (KL, JS) track dataset-level distribution shifts and anticipate accuracy degradation; Mahalanobis distance provides an effective per-point indicator of extrapolation risk and elevated error/uncertainty, complementing MC Dropout.
Discussion
The findings address the research questions by demonstrating that: (1) Distribution shift can be quantified effectively using KL divergence and JS distance at the dataset level, and Mahalanobis distance at the per-sample level; (2) Generalization degrades predictably as datasets diverge from the training distribution—larger KL/JS values correspond to higher MAPE in Experiment 1; (3) Uncertainty increases with distributional departure—both predictive spread from MC Dropout and per-sample Mahalanobis distance rise with error, with Mahalanobis showing stronger correlation and offering practical thresholds (e.g., 95th percentile) to distinguish interpolation vs. extrapolation regimes. These results underscore the value of integrating distribution shift monitoring with uncertainty estimation to guide deployment decisions, trigger safeguards, and inform data collection or model updates in dynamic environments. KL/JS serve as early warning signals for domain drift, while Mahalanobis distance provides actionable, per-input reliability assessment, complementing MC Dropout’s epistemic uncertainty.
Conclusion
The study presents a controlled synthetic framework to quantify distribution shifts and their impact on model performance and uncertainty. Experiment 1 shows that divergence measures (KL, JS) correlate with accuracy degradation when feature–target relations change across gases. Experiment 2 shows that under covariate shift, per-point Mahalanobis distance effectively flags extrapolation with increased error and complements MC Dropout uncertainty estimates. Together, these tools offer practical mechanisms for robust monitoring and governance of ML systems under shift. Future directions include: extending evaluation to diverse real-world datasets and tasks; comparing additional per-sample OOD/uncertainty methods (e.g., energy scores, deep ensembles, conformal prediction); studying how bounded vs. unbounded divergences affect scaling of accuracy–shift relationships; exploring adaptive thresholds or online monitoring; and investigating model complexity effects on uncertainty–shift correlations.
Limitations
- Synthetic data and physics-based generation (van der Waals and ideal gas) may not capture complexities, noise, and confounders present in real-world domains, potentially limiting external validity. - Single, relatively simple neural network architecture and one dropout rate (0.1) were evaluated; results may vary with model class, capacity, and hyperparameters. - Divergence computations and t-SNE visualizations were based on finite samples (≈10,000), leading to non-zero divergences even for in-distribution subsets and potential estimator sensitivity. - Similarity metrics were applied to marginal/aggregate distributions (e.g., pressure), which may miss higher-order or conditional structure differences. - MC Dropout uses a fixed number of forward passes (N=100), trading precision for compute; no comparison to deep ensembles or exact Bayesian methods was conducted. - The Mahalanobis approach assumes approximately Gaussian feature distributions and relies on a well-conditioned covariance estimate; performance may degrade with non-Gaussian features or high-dimensional settings without appropriate regularization.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny