logo
ResearchBunny Logo
Hallucination Detection in LLMs Using Spectral Features of Attention Maps

Computer Science

Hallucination Detection in LLMs Using Spectral Features of Attention Maps

J. Binkowski, D. Janiak, et al.

Discover LapEigvals — a new spectral approach that treats attention maps as graph adjacency matrices and uses top-k Laplacian eigenvalues to detect LLM hallucinations. Experiments show state-of-the-art performance among attention-based methods, with strong robustness and generalization. This research was conducted by the authors present in the <Authors> tag.... show more
Introduction

The paper addresses the problem of detecting hallucinations in LLM outputs—content that is nonsensical or unfaithful to the source. Eliminating hallucinations is considered impossible, so detection is critical for safety-critical applications. The authors hypothesize that hallucinations arise from disruptions in information flow (e.g., bottlenecks) within attention, and that these can be captured by spectral properties of attention graphs. They investigate whether spectral features from attention maps provide stronger signals of hallucinations than prior attention-based metrics, aiming to build a robust, generalizable detection probe.

Literature Review

The related work contrasts black-box and white-box hallucination detection approaches. Black-box methods analyze generated text, including retrieval-based factuality checking (e.g., Li et al., 2024) and self-consistency methods like SelfCheckGPT (Manakul et al., 2023). White-box methods leverage internal model signals: SAPLMA (Azaria & Mitchell, 2023) used linear probes on hidden states; INSIDE (Chen et al., 2024) computed consistency across multiple responses via eigenvalues of hidden-state covariance; Semantic Entropy (Farquhar et al., 2024) measures uncertainty via response clustering; Semantic Entropy Probes (Kossen et al., 2024) predict such expensive metrics from hidden states. Orgad et al. (2025) showed truthfulness is localized in specific tokens and highlighted limited cross-dataset generalization of probes. Attention-only methods include the lookback ratio (Chuang et al., 2024a) and AttentionScore (Sriramanan et al., 2024), an unsupervised log-determinant-based metric with strong layer dependence. The present work differs by interpreting attention as a directed graph, using the Laplacian’s top-k eigenvalues as features, aggregating across all layers/heads/tokens, and employing a supervised probe to detect hallucinations.

Methodology

Core idea: interpret each attention map A^(l,h) (row-stochastic, lower-triangular, non-negative T×T) as a weighted adjacency matrix of a directed graph over tokens. Define a Laplacian L^(l,h)=D^(l,h)−A^(l,h), where D is the normalized out-degree diagonal matrix: d_{ii} = sum_{u=i}^{T-1} a_{u i}^{(l,h)} / (T−i). This makes L bounded in [−1,1], and due to the lower-triangular structure, its eigenvalues equal its diagonal entries. For each head and layer, extract and sort the diagonal of L (the eigenvalues), then take the top-k largest eigenvalues; concatenate across all layers and heads into a single feature vector z ∈ R^{L·H·k}. Because this is high-dimensional, apply PCA to 512 dimensions (or fewer in per-layer experiments) before classification. A logistic regression probe (scikit-learn, max_iter=2000, class_weight="balanced") predicts hallucination vs non-hallucination.

Motivational analysis: Using Llama-3.1-8B on TriviaQA (and similarly across 7 datasets and 5 LLMs), a two-sided Mann–Whitney U test compared AttentionScore values vs Laplacian eigenvalues between hallucinated and non-hallucinated examples across layers/heads. A higher fraction of heads showed significant differences (p<0.05) for Laplacian eigenvalues (e.g., 91%) than for AttentionScore (80%), suggesting stronger predictive signal from Laplacian spectra.

Datasets and labeling: Construct hallucination detection datasets from QA benchmarks, labeling incorrect LLM answers as hallucinations. Datasets: NQ-Open (val, 3,610), TriviaQA (val, 7,983), CoQA (dev, 5,928), SQuADv2 rc.nocontext (9,960), HaluEvalQA QA (10,000), TruthfulQA generation (817), GSM8K test (1,319, evaluated with exact match). Labeling primarily uses an llm-as-judge approach with gpt-4o-mini for correctness/refusal; agreement with gpt-4.1 on Llama-3.1-8B was within accepted ranges. Train/test splits are stratified by labels (80/20), discarding "rejected".

Models and generation: Five open LLMs: Llama-3.1-8B, Llama-3.2-3B, Phi-3.5-mini-instruct, Mistral-Nemo-Instruct-2407, Mistral-Small-24B-Instruct-2501. Decoding temperatures {0.1, 1.0}. Prompts include few- and zero-shot variants; GSM8K uses a structured output prompt.

Baselines: (1) AttentionScore (unsupervised) as in Sriramanan et al., plus AttnLogDet (supervised variant using per-head log-determinants as features). (2) AttnEigvals: eigenvalues of the raw attention maps (not Laplacian) as features. Evaluation uses AUROC on test sets; additional precision/recall reported. Ablations vary k∈{5,10,20,50,100}, per-layer vs all-layers, temperature, cross-dataset generalization, and prompt variations. Implementation uses HuggingFace Transformers, PyTorch, scikit-learn; attention maps are captured during inference; PCA and probes are trained on CPU.

Key Findings
  • Spectral signal strength: Across layers/heads, Laplacian eigenvalues showed stronger statistical separation between hallucinated and non-hallucinated examples than AttentionScore (e.g., for Llama-3.1-8B/TriviaQA, 91% vs 80% of heads with p<0.05).
  • Overall efficacy: LapEigvals achieved the best AUROC among attention-based methods on 6/7 datasets and consistently across 5 LLMs (3B–24B). Selected AUROCs (test, temp=1.0): • Llama-3.1-8B: CoQA 0.830, GSM8K 0.872, HaluEvalQA 0.874, NQOpen 0.827, SQuADv2 0.791, TriviaQA 0.889, TruthfulQA 0.829. • Llama-3.2-3B: CoQA 0.812, GSM8K 0.870, HaluEvalQA 0.828, NQOpen 0.693, SQuADv2 0.757, TriviaQA 0.832, TruthfulQA 0.787. • Phi-3.5: CoQA 0.821, GSM8K 0.885, HaluEvalQA 0.836, NQOpen 0.826, SQuADv2 0.795, TriviaQA 0.872, TruthfulQA 0.777. • Mistral-Nemo: CoQA 0.835, GSM8K 0.890, HaluEvalQA 0.833, NQOpen 0.795, SQuADv2 0.812, TriviaQA 0.865, TruthfulQA 0.828. • Mistral-Small-24B: CoQA 0.861, GSM8K 0.925, HaluEvalQA 0.882, NQOpen 0.791, SQuADv2 0.820, TriviaQA 0.876, TruthfulQA 0.748. TruthfulQA is the notable exception where LapEigvals was often second-best, potentially due to small size and class imbalance.
  • Importance of Laplacian transform: Using eigenvalues from raw attention matrices (AttnEigvals) underperforms LapEigvals, indicating the Laplacian transformation is crucial to reveal information-flow features related to hallucinations.
  • Supervision vs unsupervision: Unsupervised AttentionScore performs poorly overall; its supervised analog (AttnLogDet) improves but remains inferior to spectral methods (AttnEigvals, LapEigvals).
  • Ablations: • k-top eigenvalues: Performance generally increases with larger k, but LapEigvals is less sensitive to k and outperforms AttnEigvals even at small k (e.g., k=5). • All-layers vs per-layer: Aggregating across all layers outperforms best per-layer, suggesting hallucination signals are distributed across layers. • Temperature: Higher sampling temperature increases AUROC for all methods; LapEigvals remains best across temperatures; effect possibly linked to properties of softmax and differing hallucination characteristics. • Cross-dataset generalization: Comparable or better robustness than baselines; all methods drop on TruthfulQA and GSM8K (domain and imbalance issues). LapEigvals often attains the highest test AUROC in cross-dataset scenarios. • Prompt robustness: Across four prompts, LapEigvals consistently outperforms baselines; lower variance indicates higher robustness.
  • Additional metrics: Precision/Recall patterns vary by dataset; LapEigvals shows higher recall on CoQA, GSM8K, TriviaQA and higher precision on HaluEvalQA, NQ-Open, SQuADv2, TruthfulQA.
Discussion

The results support the hypothesis that hallucinations correlate with disruptions in information flow, which can be captured by spectral properties of attention graphs. Transforming attention maps into a Laplacian emphasizes global structure (via normalized out-degrees minus self-attention), better surfacing bottlenecks or oversquashing signatures. The top-k eigenvalues from all heads and layers provide a compact yet informative representation for a lightweight probe. Consistent gains across LLM architectures, datasets, temperatures, and prompts, plus reduced generalization gaps, indicate that spectral features carry a robust, model-internal signal for hallucination detection. The superiority of LapEigvals over raw attention eigenvalues and log-determinant features underscores the importance of the Laplacian transform. Nonetheless, cross-dataset generalization challenges persist—especially on small, imbalanced (TruthfulQA) or domain-shifted (GSM8K) settings—suggesting future work on broader-domain training or self-supervised objectives.

Conclusion

The paper introduces LapEigvals, a supervised hallucination detection method that interprets attention maps as graphs and uses the top-k eigenvalues of the Laplacian as features for a logistic regression probe. It achieves state-of-the-art performance among attention-based methods on most evaluated datasets and LLMs, and demonstrates robustness across the number of eigenvalues, prompts, and decoding temperatures, with competitive cross-dataset generalization. Future research directions include self-supervised learning on attention-derived graph features to enhance generalization, deeper theoretical analysis of information flow and spectral signatures in transformers, and developing architecture-agnostic probes to handle varying layer/head configurations.

Limitations
  • Supervised method: Requires labeled hallucination vs non-hallucination examples; llm-as-judge may introduce label noise and potential overfitting.
  • Architecture dependence: Features depend on the number of layers and heads; probes are not directly transferable to LLMs with differing architectures.
  • Minimum length requirement: Computing top-k Laplacian eigenvalues requires sequences with at least k tokens (e.g., k=100 needs ≥100 tokens).
  • Access to internals: Requires attention maps, so not applicable to closed LLMs without white-box access.
  • Scope and risks: Evaluated on selected open LLMs and English QA tasks; applying to new domains/languages or tasks may incur risks without further validation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny