Computer Science

Evaluating explainability for graph neural networks

C. Agarwal, O. Queen, et al.

As the use of Graph Neural Networks (GNNs) expands in critical applications, evaluating the quality and reliability of their explanations becomes vital. This paper introduces SHAPEGGEN, a versatile synthetic graph data generator that produces benchmark datasets with ground-truth explanations, paving the way for rigorous assessments. The research was conducted by Chirag Agarwal, Owen Queen, Himabindu Lakkaraju, and Marinka Zitnik.

00:00

Playback language: English

Index

Introduction

Graph Neural Networks (GNNs) are increasingly used in high-stakes applications like criminal justice, molecular chemistry, and biological networks. Ensuring that stakeholders understand and trust GNNs is paramount. Numerous methods have been developed to explain GNN predictions, but evaluating their reliability is challenging due to the lack of standardized evaluation strategies and reliable data resources with ground-truth explanations. Existing research often relies on specific datasets with limited or unreliable ground-truth explanations, which are not representative of diverse real-world scenarios. This limitation necessitates the development of a broader ecosystem of data resources for benchmarking GNN explainers. A comprehensive data resource is crucial for evaluating the quality and reliability of GNN explanations, particularly in high-stakes applications. Current approaches often use ground-truth explanations from specific datasets, which suffer from several pitfalls. These include redundant explanations (multiple rationales generating correct labels), weak GNN predictors (models using different prediction rationales), and trivially correct explanations (explanations easily recovered by simple baselines). These issues highlight the need for general-purpose data resources that can reliably evaluate explanations across diverse applications. Existing benchmark datasets and deep learning libraries primarily focus on GNN predictor performance, lacking ground-truth explanations necessary for evaluating explainers. This paper addresses these challenges by introducing SHAPEGGEN, a flexible synthetic dataset generator that produces diverse benchmark datasets with ground-truth explanations robust against known pitfalls. SHAPEGGEN generates explanations not susceptible to redundancy, weak predictors, or trivial correctness. It is integrated into GRAPHXAI, a framework for benchmarking GNN explainers. GRAPHXAI also offers data loaders, processing functions, visualizers, and evaluation metrics (accuracy, faithfulness, stability, fairness) to comprehensively evaluate GNN explanation methods. Empirical assessment of eight state-of-the-art GNN explanation methods using GRAPHXAI reveals weaknesses in handling large ground-truth explanations and in preserving fairness.

Literature Review

The introduction extensively reviews existing literature on GNN explainability methods and existing benchmark datasets. It highlights the limitations of existing datasets for evaluating GNN explainers, emphasizing issues like unreliable or missing ground truth explanations, redundant explanations, weak GNN predictors, and trivially correct explanations. The authors discuss various existing benchmark datasets (OGB, GNNMark, GraphGT, MalNet, Graph Robustness Benchmark (GRB), Therapeutics Data Commons, EFO-1-QA) and libraries (DIG, Pytorch Geometric, Deep Graph Library) commonly used for evaluating GNN predictors, but point out their insufficiency for evaluating explainers due to the absence of ground truth explanations. The paper positions its contribution as addressing the crucial gap in reliable, comprehensive, and diverse data resources for evaluating the quality of GNN explanations.

Methodology

The core methodology centers around the development and application of SHAPEGGEN and GRAPHXAI. SHAPEGGEN, a synthetic XAI-ready dataset generator, is designed to create various benchmark datasets with controllable properties: graph size, degree distribution, homophily/heterophily, and fairness. It generates graphs with ground-truth explanations robust to the pitfalls mentioned in the introduction. The generation process involves defining motifs (subgraphs) and connecting them probabilistically, controlling the homophily/heterophily and the presence of protected attributes for fairness evaluation. Node features are generated using a latent variable model, incorporating informative and redundant features, allowing control over the information content in node features. The ground truth explanations are generated based on the location of the motifs within the graph, considering node, node feature, and edge importance. GRAPHXAI, a general-purpose framework for benchmarking GNN explainers, integrates SHAPEGGEN datasets along with several real-world datasets. GRAPHXAI provides data loaders, processing functions, visualization tools, and a suite of evaluation metrics. The evaluation metrics include: Graph Explanation Accuracy (GEA) using Jaccard index to compare ground truth and predicted explanation masks; Graph Explanation Faithfulness (GEF) measuring the agreement between the GNN's predictions with and without the explanation; Graph Explanation Stability (GES) assessing the consistency of explanations under small perturbations; and Graph Explanation Fairness (GECF, GEGF) evaluating counterfactual and group fairness. Eight state-of-the-art GNN explanation methods are evaluated using GRAPHXAI on both SHAPEGGEN and real-world datasets. These methods encompass gradient-based (Grad, GradCAM, GuidedBP, Integrated Gradients), perturbation-based (GNNExplainer, PGExplainer, SubgraphX), and surrogate-based (PGMExplainer) approaches. Random explanations serve as a baseline. A three-layer GIN and GCN are used as GNN predictors. The experimental setup includes details on hyperparameter settings, data splitting, and the implementation of the explanation methods. The code is made available via GitHub, and visualization examples are provided to illustrate the GRAPHXAI package’s functionalities.

Key Findings

The evaluation of eight GNN explainers on SHAPEGGEN and real-world datasets reveals several key findings. SubgraphX generally outperforms other methods in terms of accuracy and faithfulness, particularly on synthetic datasets. Gradient-based methods show relatively good performance, and PGExplainer produces the most stable explanations. Node explanation masks prove more reliable than edge and node feature masks. The study reveals that GNN explainers perform better on homophilic graphs than heterophilic graphs, indicating a significant limitation. Explanations are much more faithful when the ground truth explanations are smaller. Existing GNN explainers struggle with graphs containing larger ground truth explanations, showing high unfaithfulness scores. Across various datasets, there is evidence of a lack of fairness preservation by the explainers. The explainers frequently fail to capture counterfactual and group fairness, even for weakly unfair scenarios. Regarding node feature explanations, faithfulness decreases with fewer informative node features. Gradient-based methods show higher faithfulness in datasets with varied node feature information. The GRAPHXAI package facilitates comparisons of different explanation methods through visualization tools. These visualizations help to pinpoint the strengths and weaknesses of the different explainers across a range of graph and explanation types. The results underscore the need for more research on GNN explainers that can better address larger and heterophilic graphs, as well as maintain fairness properties.

Discussion

The findings address the central research question of evaluating GNN explainability methods by providing a systematic benchmark using a novel dataset generator and a comprehensive evaluation framework. The results demonstrate that existing state-of-the-art GNN explainers have notable limitations, particularly in handling larger and more complex ground truth explanations, and in maintaining fairness properties. The performance differences across various graph types (homophilic vs. heterophilic) and the impact of informative node features highlight the need for further algorithmic innovation to address these shortcomings. The creation of SHAPEGGEN and GRAPHXAI significantly advances the field by providing a robust, flexible, and standardized approach to evaluating GNN explainers. The open-source nature of GRAPHXAI fosters reproducibility and transparency in research, facilitating future development and comparison of GNN explanation techniques. Future work could focus on developing more robust GNN explainers that can effectively handle complex, heterogeneous graph structures and ensure fairness in their explanations.

Conclusion

This paper introduces SHAPEGGEN and GRAPHXAI, valuable resources for evaluating GNN explainability methods. The findings reveal limitations of current methods in handling large explanations and preserving fairness. GRAPHXAI's open-source nature promotes reproducible and transparent research, paving the way for future innovations in GNN explainability.

Limitations

While the study comprehensively evaluates eight state-of-the-art GNN explainers, it's limited to those specific methods. Future work should include a broader range of explainers. The synthetic datasets, while designed to mimic real-world characteristics, might not perfectly capture the nuances of real-world data. The evaluation focuses on specific metrics; exploring other relevant metrics could provide additional insights. The datasets used, even the real world ones, may not be exhaustive of all scenarios GNNs are used in, potentially limiting the generalizability of the conclusions.