logo
ResearchBunny Logo
Robust Counterfactual Explanations in Machine Learning: A Survey

Computer Science

Robust Counterfactual Explanations in Machine Learning: A Survey

J. Jiang, F. Leofante, et al.

Counterfactual explanations promise actionable algorithmic recourse but recent work highlights serious robustness failures. This survey reviews the fast-growing literature on robust CEs, analyzes different notions of robustness, and discusses existing solutions and limitations — research conducted by Junqi Jiang, Francesco Leofante, Antonio Rago, and Francesca Toni.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper investigates how to ensure counterfactual explanations (CEs) remain valid and trustworthy under changing conditions. CEs are widely used post-hoc explanations that suggest minimal changes to an input to alter a model’s outcome, providing actionable recourse in high-stakes contexts (e.g., finance, healthcare). However, recent studies show popular CE methods can yield explanations that are fragile—sometimes indistinguishable from adversarial examples—raising concerns about their reliability and justifiability. This survey aims to systematically categorise and analyse robustness notions for CEs, evaluate associated metrics and algorithms, and highlight limitations and research opportunities to guide the development of robust CE methods.
Literature Review
The survey situates CEs within XAI and recourse literature (e.g., Wachter et al., Tolomei et al., Karimi et al.) and reviews key CE properties (actionability, causality, diversity, plausibility). Prior work has highlighted robustness issues, including connections to adversarial examples and fairness concerns. The paper identifies four distinct robustness notions studied in the literature: (i) robustness against Model Changes (MC) (e.g., retraining, parameter shifts), (ii) robustness against Model Multiplicity (MM) (consistency across multiple near-optimal models), (iii) robustness against Noisy Execution (NE) (small deviations when users implement recourse), and (iv) robustness against Input Changes (IC) (similar inputs should receive similar CEs). For each category, the survey summarises problem formulations, robustness metrics (e.g., VaR, Δ-robustness, Counterfactual Stability, Invalidation Rate), and solution strategies (robust optimisation, verification, probabilistic modelling, training-time interventions). It also connects robustness to related areas such as adversarial robustness and fairness, and points out gaps including limited benchmarks and user studies.
Methodology
The authors conducted a systematic search on Google Scholar for technical papers from 2017 onward (motivated by early CE works in 2017). Exact-match keyword patterns included combinations of: robust/robustness, consistent/consistency, stable/stability with counterfactual explanation(s), counterfactuals, recourse, algorithmic recourse. They also expanded coverage by examining citations to influential early robust CE works (e.g., Pawelczyk et al., 2020; Upadhyay et al., 2021; Slack et al., 2021). Identified works were categorised by robustness type (MC, MM, NE, IC), model class targeted (e.g., SVM, linear, neural, tree, differentiable, model-agnostic), model access (white-box, gradients, predictions), computational method (e.g., GD, RO, MIP, formal verification, argumentation, GA, SAT, data augmentation), types of guarantees (deterministic, probabilistic, linear-model-only), and additional CE properties (actionability, causality, diversity, plausibility). The survey then reviewed problem definitions, metrics, algorithms, and theoretical results within each robustness category.
Key Findings
- Robustness taxonomy: Four main notions emerge: MC (robustness to retraining/parameter changes), MM (robustness across multiple near-optimal models), NE (robustness to user-implementation noise on CEs), and IC (consistency of CEs for similar inputs). - Metrics: For MC, validity after retraining (VaR) is common; Δ-robustness certifies validity under bounded parameter shifts; Counterfactual Stability (CS) captures stability of class scores around CEs. For NE, Invalidation Rate (IR) quantifies label changes under noise; verification-based local robustness (ε-balls) captures worst-case perturbations. For IC, local instability measures expected distances between CEs for similar inputs (extended to sets when diversity is considered). - Algorithms for MC: (i) Robust optimisation min–max formulations against bounded parameter changes (e.g., Upadhyay et al.; MIP-based exact/guaranteed methods for piecewise-linear NNs); (ii) Increasing class scores and controlling local Lipschitzness or neighbourhood stability; (iii) Probabilistic modelling of distribution/model shifts (e.g., KDE, Gaussian mixture ambiguity); (iv) Training for robustness (data augmentation with CEs, joint training of predictors and recourse models, boundary-aware surrogate training). Some methods provide deterministic or probabilistic guarantees. - Algorithms for MM: In fixed-prediction settings, plausible CEs (on-manifold) tend to be more robust across alternative models but at higher cost; exact product constructions enable guaranteed multi-model CEs for ReLU networks (with NP-completeness results). In pending-prediction settings, argumentative ensembling selects a consistent subset of models and their valid CEs under properties like non-emptiness, model agreement, and counterfactual validity. - Algorithms for NE: Robust optimisation in input space (worst-case noise on the CE), verification-based local robustness checks via formal methods, novel loss terms approximating or upper-bounding IR (differentiable surrogates, Monte Carlo), and Bayesian hierarchical approaches that output distributions over robust CEs. MIP encodings yield certified robust regions for NNs and trees. - Algorithms for IC: Methods promoting plausibility (on-manifold CEs) reduce local instability; boolean satisfiability approaches produce plausible and more stable CEs. Adversarial analyses expose CE instability and propose heuristics (randomised initialisation, fewer features, smaller models). Diversity-based formulations guarantee that sets of CEs for similar inputs contain similar members even when single-instance robustness is impossible. - Trade-offs: Robustness often increases CE cost (moving away from decision boundaries, higher class scores), with better-understood guarantees for linear models and less clarity for non-linear models. Some robust methods can find less costly CEs than non-robust baselines in practice for NNs. - Interplay among robustness notions and with adversarial robustness/fairness: MC and MM can align for trees; MC and NE may be orthogonal (linear models). Adversarially robust training can improve CE robustness. Robustness relates to fairness (e.g., similar individuals receiving similar CEs and implications for equalising recourse). - Field-wide gaps: Lack of standardised benchmarks and comprehensive baselines complicates empirical comparison; absence of user studies limits understanding of robustness’ impact on justifiability and user trust.
Discussion
By organising robustness of counterfactual explanations into four principled categories and surveying corresponding metrics, algorithms, and guarantees, the paper clarifies how to design CEs that remain valid under model shifts (MC, MM), withstand user implementation noise (NE), and maintain consistency across similar individuals (IC). This taxonomy highlights when robustness can be enforced via optimisation, verification, probabilistic modelling, or training-time strategies, and when guarantees (deterministic or probabilistic) are available. The analysis also surfaces critical trade-offs (e.g., cost vs robustness), shows that linear-model intuitions (class-score sufficiency) do not straightforwardly generalise to deep models, and identifies cross-links to adversarial robustness and fairness. These insights directly address the central question of making CEs trustworthy in realistic settings and inform practitioners on method selection and expected limitations.
Conclusion
The survey provides the first comprehensive, fine-grained synthesis of robust CE methods, categorising robustness into MC, MM, NE, and IC; detailing metrics like VaR, Δ-robustness, CS, and IR; and reviewing algorithmic families (robust optimisation, verification, probabilistic modelling, and training for robustness), along with available guarantees. It identifies key open directions: (i) deeper theoretical and empirical understanding of robustness–cost trade-offs, especially for non-linear models; (ii) exploring unifying frameworks and relationships among robustness notions and with adversarial robustness; (iii) integrating robustness with fairness objectives; (iv) developing standardised benchmarks and libraries for robust CE evaluation; and (v) conducting user studies to assess how robustness affects perceived justification and trust. These avenues can guide more reliable, user-aligned CE methods in high-stakes applications.
Limitations
- Coverage relies on keyword-based Google Scholar searches and citation chaining from selected influential works starting from 2017; some relevant studies may be missed. - The survey synthesises heterogeneous settings, models, and evaluation protocols; lack of standardised benchmarks across the field limits direct empirical comparability. - The work does not include user studies; insights about robustness impacts on user understanding and trust remain indirect. - Many reported guarantees are model- or assumption-specific (e.g., linear models, piecewise-linear networks), limiting generalisability to complex real-world systems.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny