logo
ResearchBunny Logo
Evaluating explainability for graph neural networks

Computer Science

Evaluating explainability for graph neural networks

C. Agarwal, O. Queen, et al.

As the use of Graph Neural Networks (GNNs) expands in critical applications, evaluating the quality and reliability of their explanations becomes vital. This paper introduces SHAPEGGEN, a versatile synthetic graph data generator that produces benchmark datasets with ground-truth explanations, paving the way for rigorous assessments. The research was conducted by Chirag Agarwal, Owen Queen, Himabindu Lakkaraju, and Marinka Zitnik.

00:00
00:00
~3 min • Beginner • English
Introduction
The study addresses the problem of reliably evaluating explanations produced by graph neural networks (GNNs), which are increasingly applied in high-stakes domains such as criminal justice, molecular chemistry, and biology. Despite numerous GNN explanation methods, the field lacks standardized, trustworthy evaluation strategies and datasets with dependable ground-truth explanations. Existing benchmarks can suffer from pitfalls including non-unique (redundant) rationales, weak predictors using spurious reasoning, and trivial ground truths that can be recovered by simple baselines. These issues hinder accurate assessment of post hoc explainers, especially when model rationales differ from the assumed ground truth. The authors aim to develop general-purpose data resources and a benchmarking framework that produce reliable, diverse, and application-relevant evaluations of GNN explainability. They introduce SHAPEGGEN, a synthetic XAI-ready generator producing graphs with ground-truth explanations, and integrate it into GRAPHXAI, a comprehensive library that also includes real-world datasets, GNN models, visualization, and a suite of metrics for accuracy, faithfulness, stability, and fairness. The overarching research question is how to construct datasets and evaluation tools that enable robust, fair, and faithful benchmarking of GNN explainers across diverse graph properties and tasks.
Literature Review
Prior work has proposed various GNN explanation methods, including gradient-based (e.g., Grad, GradCAM, Guided Backprop, Integrated Gradients), perturbation-based (e.g., GNNExplainer, PGExplainer, SubgraphX), and surrogate-based (e.g., PGMExplainer). However, evaluation typically relies on specific real-world or synthetic datasets whose ground-truth explanations may be limited or unreliable, as highlighted by Faber et al. Limitations include multiple valid rationales for labels, models that learn different rationales than the encoded ground truth, and trivial explanations recoverable by simple baselines. Existing graph ML benchmarks and libraries (e.g., OGB, GNNMark, GraphGT, MalNet, GRB, TDC; DIG, PyG, DGL) primarily target predictive performance rather than explanation quality, often lacking ground-truth explanations. This motivates a broader ecosystem focused on evaluating GNN explainers with reliable ground truth and comprehensive metrics.
Methodology
Framework and datasets: The authors present GRAPHXAI, a benchmarking framework integrating SHAPEGGEN-generated synthetic datasets and several real-world datasets. GRAPHXAI implements an Explanation class, base explainer interfaces, GNN predictors (GIN and GCN), visualization utilities, and metrics for explanation accuracy, faithfulness, stability, and fairness. SHAPEGGEN generator: SHAPEGGEN produces synthetic graphs with controllable properties and ground-truth explanations. A graph G = (V, E, X) is assembled from Ns subgraphs that exhibit preferential attachment. Each subgraph starts from a user-specified motif S (e.g., triangle, house) and is expanded using a Poisson-distributed number of added nodes with degree-proportional attachment. Subgraphs are connected with probability p while enforcing constraints so each node’s neighborhood contains between 1 and K motifs, naturally defining a classification task based on motif counts in the L-hop neighborhood. Labels are assigned as the number of motifs in a node’s 1-hop neighborhood minus one. Node features: A latent variable model (inspired by MADELON/make_classification) creates n total features with ni informative features correlated to labels and the rest redundant/noise. Parameters include separation s (class signal strength), c clusters per class, and an optional protected binary feature with controllable correlation to labels via flip probability φ. Node features are optimized for a desired level of homophily/heterophily via an objective that encourages or discourages similarity of connected nodes with the same/different labels controlled by homophily coefficient η. Ground-truth explanations: For each target node’s L-hop enclosing subgraph, SHAPEGGEN provides: (a) node masks marking motif nodes as important; (b) node feature masks marking informative features; (c) edge masks marking edges that connect motif nodes (and the central node) as important. These masks serve as ground truth for evaluating node-, edge-, and feature-based explainers. Explainers and predictors: Eight explainers are benchmarked: gradient-based (Grad, GradCAM, GuidedBP, Integrated Gradients), perturbation-based (GNNExplainer, PGExplainer, SubgraphX), and surrogate-based (PGMExplainer). A 3-layer GIN and a GCN are used as predictors. Hyperparameters include hidden size 16, ReLU activations, Adam optimizers (typical learning rates 1e-2 for GIN, 3e-2 for GCN), and training up to 1000–1500 epochs. Explanations select top-k (k=25%) nodes, edges, or features. Evaluation metrics: GRAPHXAI implements: - Graph Explanation Accuracy (GEA): Jaccard index between predicted and ground-truth masks, accounting for multiple valid ground truths by taking the maximum over ground-truth set. - Graph Explanation Faithfulness (GEF): Unfaithfulness measured by 1 − exp(−KL(f(S)||f(Ŝ))) where Ŝ is the masked subgraph retaining top-k elements from the explanation; lower is better (more faithful). - Graph Explanation Stability (GES): Instability measured as maximum cosine distance between explanations on a graph and its small perturbations within a ball that preserves model behavior. - Counterfactual Fairness Mismatch (GECF): Distance between explanations on original and counterfactual graphs (protected attribute flipped); lower indicates better preservation of counterfactual fairness where appropriate. - Group Fairness Mismatch (GEGF): Difference in statistical parity between predictions using all features versus only essential features identified by the explanation; lower indicates better fairness preservation. Datasets: SHAPEGGEN variants include SG-BASE (homophilic house motifs), SG-HETEROPHILIC (heterophilic ground truths), SG-SMALLEX (triangle motifs; smaller explanations), SG-UNFAIR (strongly unfair via protected feature), and feature-information variants (SG-MOREINFORM, SG-BASE, SG-LESSINFORM). Real-world datasets with ground-truth explanations include MUTAG, Benzene, Fluoride Carbonyl, and Alkane Carbonyl; additional node classification graphs (German Credit, Recidivism, Credit Defaulter) lack ground-truth explanations. Standard splits (e.g., 70/5/25 for SHAPEGGEN) are used, with performance averaged on test sets.
Key Findings
- Benchmarking across SHAPEGGEN and real-world datasets shows no single explainer dominates universally. On SHAPEGGEN node classification datasets, SubgraphX outperforms others on average, providing 145.95% higher explanation accuracy and 64.80% lower unfaithfulness than other methods. Gradient-based methods (Grad, GradCAM, GuidedBP) are often next best; Grad yields the second-lowest unfaithfulness and GradCAM the second-highest accuracy. PGExplainer produces the most stable explanations (35.35% less instability than the average of others). - Node explanation masks tend to be more reliable than edge or node-feature masks; explainers achieve better faithfulness on synthetic than on real-world graphs. - Homophily vs. heterophily: Explanations are 55.98% more faithful (lower GEF) on homophilic ground truths than heterophilic ones, indicating current explainers struggle on heterophilic and attributed graphs. - Explanation size: Across eight explainers, larger ground-truth explanations substantially reduce faithfulness; average GEF for large explanations is 0.7476. Explanations for small (triangle) versus large (house) motifs are 59.98% less unfaithful on average. The Grad explainer achieves 9.33% lower unfaithfulness on large explanations compared to other methods. - Fairness: On SG-UNFAIR, many explainers do not preserve counterfactual fairness. For weakly-unfair ground truths, explainers often show high mismatch (undesirable), while for strongly-unfair settings, many fail to reflect the induced unfairness. GradCAM and PGExplainer perform better for preserving counterfactual fairness under weakly-unfair settings; PGMExplainer performs best under strongly-unfair settings. - Node feature information: Unfaithfulness increases as the proportion of informative features decreases. Gradient-based feature explanations are the most faithful among tested methods for node features, but overall gains are small. - Real-world molecular datasets: Integrated Gradients achieves the lowest unfaithfulness across MUTAG, Benzene, and Fluoride Carbonyl; stability and fairness metrics are not applied due to difficulties in generating plausible perturbations and lack of protected attributes.
Discussion
The work demonstrates that reliable evaluation of GNN explanations requires datasets with trustworthy, non-trivial ground truths and metrics that consider model behavior, robustness, and fairness. SHAPEGGEN enables generating diverse synthetic graphs that avoid common pitfalls in existing benchmarks and provide comprehensive ground-truth masks for nodes, edges, and features. Empirical results reveal systematic weaknesses of current explainers: reduced faithfulness on heterophilic graphs and on tasks with larger explanation subgraphs, and limited ability to preserve fairness. These findings directly address the research question by highlighting conditions under which explainers succeed or fail and by providing a standardized, extensible framework (GRAPHXAI) to evaluate new methods. The significance lies in enabling principled development and comparison of GNN explainers across accuracy, faithfulness, stability, and fairness dimensions, thereby improving the reliability of explainability in high-stakes applications.
Conclusion
This paper introduces SHAPEGGEN, a flexible synthetic graph dataset generator that provides robust ground-truth explanations, and GRAPHXAI, a general-purpose benchmarking framework for GNN explainability. Together, they fill a critical gap by enabling reliable, diverse, and fairness-aware evaluation of GNN explanations. Experiments show that current explainers struggle on heterophilic graphs, larger explanation subgraphs, and fairness preservation, and are sensitive to the informativeness of node features. Future work includes expanding GRAPHXAI with additional datasets (including non–scale-free generators), explanation methods, evaluation metrics (including for self-explaining GNNs), and visualization tools, fostering reproducible and comprehensive assessment of GNN explainability methods.
Limitations
- Stability and fairness metrics are not evaluated on molecular datasets due to the difficulty of generating plausible molecular perturbations and absence of protected attributes. - Experiments primarily use two predictor architectures (GIN, GCN) and top-k selection at 25%; broader model classes and selection strategies may affect results. - Although SHAPEGGEN aims to mimic real-world properties, synthetic datasets may not capture all complexities of domain-specific graphs. - Current GRAPHXAI release emphasizes molecular chemistry among real-world datasets; coverage of other domains and additional random graph models is planned. - Ground-truth fairness is simulated via a protected feature proxy; real-world protected attributes and biases can be more nuanced.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny