logo
ResearchBunny Logo
Evaluation of post-hoc interpretability methods in time-series classification

Medicine and Health

Evaluation of post-hoc interpretability methods in time-series classification

H. Turbé, M. Bjelogrlic, et al.

This research, conducted by Hugues Turbé, Mina Bjelogrlic, Christian Lovis, and Gianmarco Mengaldo, unveils a groundbreaking framework for evaluating post-hoc interpretability in time-series classification. With new metrics and a synthetic dataset, this study reveals crucial insights into interpretability methods, aiming to fortify trust in applications like healthcare.... show more
Introduction

The paper addresses the question of which post-hoc interpretability method best reflects the features actually used by a neural network for time-series classification. Time-series data are pervasive across scientific, economic and biomedical domains, and neural networks achieve strong performance but are often treated as black boxes. Numerous post-hoc interpretability approaches exist and can yield markedly different relevance maps for the same model and sample, creating uncertainty for practitioners, especially in high-risk areas like healthcare. The authors focus on post-hoc interpretability (as opposed to model transparency), aiming to quantify how well attribution methods identify and weigh the input time steps that drive predictions. The study emphasizes the importance of rigorous, quantitative, model-agnostic evaluation, independent of human judgment, to support trustworthy deployment under emerging regulatory requirements.

Literature Review

Early evaluations of interpretability relied on heuristic comparisons to human expectations or domain experts, assuming models use the same features as humans—an assumption later questioned. Saliency methods were shown to sometimes be model-independent, undermining faithfulness. Occlusion-based evaluations comparing drops in model score when relevant features are corrupted were introduced, but these can induce distribution shift, confounding the results. ROAR retrains models on datasets with occluded features to maintain distributional similarity but no longer assesses the original model’s feature usage and instead reflects dataset properties like feature redundancy. In time-series settings, prior work adapted image/NLP methods and used occlusion drops, and a recent benchmark tried to preserve distribution but used static discriminative properties and assumptions tied to human judgment (that all provided discriminative information is used exclusively), limiting realism for temporal dependencies. These limitations motivate a new evaluation framework for time-series classification that avoids human judgment, retraining, and distribution shift while handling temporal and multivariate dependencies.

Methodology

Overview: The authors propose a model-agnostic evaluation framework for time-series classification that introduces two quantitative metrics—AUCStop (area under the top curve) and F1S (a modified F1 combining top and bottom performance)—to assess relevance identification, and a qualitative analysis for relevance attribution via adjusted score-drop versus time-series information content (TIC) curves. They evaluate six attribution methods (DeepLift, GradShap, Integrated Gradients, KernelShap, DeepLiftShap, Shapley Sampling) across three neural architectures (Bi-LSTM, CNN, Transformer) on a new synthetic dataset and two real-world datasets (FordA and ECG). Code and data are publicly available. Tackling distribution shift without retraining: To eliminate distribution shift between training and evaluation when corrupting time steps, models are trained with random perturbations applied per batch: consecutive time-step “blocks” are replaced by samples from N(0,1), matching input normalization. The corruption fraction γ ~ U(0, 0.8) and block size β ~ U(1, 7) are sampled each batch, akin to DropBlock/random cropping. At evaluation, time steps are corrupted with the same N(0,1) process, preserving the train–test corruption distribution and avoiding retraining (unlike ROAR). Attribution methods and baselines: Six additive attribution methods implemented via Captum are studied: DeepLift, GradShap, Integrated Gradients (IG), KernelShap, DeepLiftShap, and Shapley Sampling. Baselines: for single-sample baselines (IG, DeepLift, Shapley, KernelShap), the mean across training samples per time step is used; for methods requiring baseline distributions (GradShap, DeepLiftShap), 50 random test samples are used. Relevance identification protocol: For a sample X ∈ R^{M×T}, an attribution scheme A yields relevance R = (r_{m,t}). Positive relevance R+ = {r_{m,t} > 0} is used. Using relevance ranking, form two corruption strategies:

  • Top-k: progressively corrupt highest-relevance time steps using N(0,1), thresholded by the (1−k) quantile of R+.
  • Bottom-k: progressively corrupt lowest positive relevance time steps, thresholded by the k quantile of R+. Let N = M×T be total time steps; Ñ is the fraction corrupted relative to N. Define normalized change in score using post-softmax outputs S: S_A(k) = (S(X) − S(X^c)) / S(X) Compute S vs Ñ curves for top and bottom corruption sequences. Metrics:
  • AUCStop: Area under the top S-curve (with an added point at (N=1, S=S(X)) to normalize for methods with differing numbers of positive relevance steps), measuring how effectively an attribution method isolates the most important time steps.
  • F1S: Harmonic mean combining performance on top and bottom corruptions: F1S = [AUCStop · (1 − AUCSbottom)] / [AUCStop + (1 − AUCSbottom)] Relevance attribution protocol (qualitative): Evaluate how well attributed relevance magnitudes reflect relative contributions by comparing adjusted normalized drop in score δ_A(k) to the TIC(k):
  • TIC(k): fraction of total positive relevance contained in the corrupted top-k set.
  • Adjusted drop: δ_A(k) = [S(X) − S(X^c)] / [S(X) − S(X_{−})], where X_{−} corrupts all positive-relevance steps. Under linear additivity, δ_A(k) should match TIC(k), giving an information ratio IR = δ_A(k)/TIC(k) ≈ 1. Plot δ_A(k) against TIC(k) versus the unit-slope reference to assess under/overestimation. Datasets and preprocessing:
  • Synthetic (new): Six features, 500 time steps (Δt=2 ms). Each feature has a baseline sine (amplitude 0.5; frequency ~ U(2,5)). Two randomly chosen features include compact-support (100 steps) sine bursts at random positions with frequencies f1, f2 ~ discrete U(10,50). Remaining features may include square waves with p=0.5, frequency ~ U(10,50). Binary label y = 1 if f1+f2 ≥ τ, else 0; τ=60 ensures balanced classes. This enforces temporal and cross-feature dependencies with known discriminative regions and tunable complexity.
  • FordA (UCR/UEA archive): Univariate, binary anomaly classification; predefined train (n=3,601) and test (n=1,320) splits.
  • ECG (PhysioNet/CinC 2020, CPSC subset): Multilead ECG (12 leads) to classify Right Bundle Branch Block (RBBB) vs non-RBBB (5,020 negatives; 1,857 positives). Preprocessing: denoise via EMD removing modes with mean frequency <0.7 Hz; high-frequency noise reduced by wavelet coefficient thresholding (universal threshold). Beats centered on R-peaks (−0.35 s to +0.55 s) are extracted using BioSPPy; per-lead average beat computed and used for training. Models: Three architectures—Bi-LSTM, CNN, Transformer—trained with the perturbation scheme above. Hyperparameters and classification performance are in supplementary materials. Evaluation: For each method–model–dataset, compute S–Ñ curves (top and bottom), AUCStop and F1S; also δ_A(k)–TIC(k) curves for attribution calibration. Random relevance serves as a baseline.
Key Findings
  • Across datasets and architectures, Shapley Sampling generally achieves the highest relevance-identification performance, as measured by AUCStop and F1S, except on the CNN trained on the ECG dataset where DeepLiftShap performs best.
  • Example metrics from Table 2: • Transformer, Synthetic: Shapley AUCStop=0.943, F1S=0.303; IG AUCStop=0.929, F1S=0.301; GradShap AUCStop=0.943, F1S=0.301. • Transformer, FordA: Shapley AUCStop=0.650, F1S=0.245 (best). • Transformer, ECG: Shapley AUCStop=0.619, F1S=0.228 (best). • CNN, ECG: DeepLiftShap AUCStop=0.465, F1S=0.200 (best); Shapley AUCStop=0.341, F1S=0.142. • Bi-LSTM, Synthetic: Shapley AUCStop=0.554, F1S=0.210 (best); IG AUCStop=0.480, F1S=0.196 (second-best).
  • The distribution-preserving corruption scheme ensures observed score drops are not attributable to distribution shift. Corrupting top-relevance steps yields consistently larger performance degradation than random corruption baselines, validating metric sensitivity.
  • Relevance attribution (δ_A vs TIC) indicates methods are generally not well calibrated to reflect relative contribution magnitudes: curves often deviate from the unit-slope theoretical line, suggesting attributions act primarily as rankings rather than proportional contributions.
  • Synthetic dataset performance trends mirror those on FordA and ECG, supporting its utility as a proxy for real-world multivariate time-series classification with tunable complexity.
  • Clinical case (ECG, RBBB): Shapley highlights a compact, clinically meaningful region and reveals the model relies predominantly on a single lead for RBBB prediction, offering actionable insights and potential detection of biases or spurious correlations.
Discussion

The proposed evaluation framework quantitatively answers which interpretability method most closely matches the model’s effective use of time-series inputs. By training with randomized block corruptions and evaluating with matching corruption distributions, the framework avoids human judgment, retraining, and distribution shift, enabling faithful comparison via AUCStop and F1S. Results consistently rank Shapley as top-performing for relevance identification across most scenarios, although method rankings vary with architecture and dataset (e.g., DeepLiftShap excels for CNN on ECG). The adjusted drop versus TIC analysis shows current methods often miscalibrate attribution magnitudes, implying their strengths lie in ranking influential time steps rather than quantifying absolute contributions. The synthetic dataset faithfully reproduces real-world challenges (temporal and cross-feature dependencies) and, with known discriminative regions, facilitates controlled evaluation. Application to RBBB classification demonstrates operational value: validated methods can surface compact, clinically relevant regions, expose model reliance on specific leads, and aid in bias detection—insights pertinent to regulated, high-stakes deployments.

Conclusion

This work introduces a comprehensive, model-agnostic framework to evaluate post-hoc interpretability methods for time-series classification, addressing key shortcomings in prior evaluations (human judgment dependence, retraining confounds, and distribution shift). Two metrics—AUCStop and F1S—quantify relevance identification, while adjusted drop vs TIC curves qualitatively assess relevance attribution. A new synthetic dataset with tunable complexity and known discriminative temporal regions complements real datasets. Empirically, Shapley Sampling generally provides the most faithful relevance identification, with architecture- and dataset-dependent alternatives (e.g., DeepLiftShap for CNN on ECG; IG competitive on Bi-LSTM and Transformer). Future research directions include: developing efficient approximations to computationally intensive methods like Shapley; creating quantitative calibration tests for relevance attribution; extending the framework to broader tasks and modalities; and integrating evaluation outputs into regulatory and clinical workflows for trustworthy AI deployment.

Limitations
  • Computational cost: Shapley Sampling, the best performer in many settings, is the most computationally intensive among tested methods, potentially limiting scalability.
  • Attribution calibration: The evaluation of relevance attribution is qualitative; observed deviations from the theoretical line indicate current methods may not provide proportionally calibrated contributions.
  • Scope: Only six additive attribution methods and three neural architectures were evaluated; results may not generalize to other methods or model families (e.g., non-additive explanations or hybrid architectures).
  • Design choices: The corruption scheme (Gaussian N(0,1), γ and β ranges) and focus on positive relevance may influence outcomes; these were empirically selected and may require adaptation for other data normalizations or domains.
  • Dataset dependence: Rankings vary across datasets and architectures (e.g., CNN on ECG), indicating context sensitivity and limiting universal conclusions.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny