Computer Science

Detecting hallucinations in large language models using semantic entropy

S. Farquhar, J. Kossen, et al.

Discover how researchers Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal are tackling the reliability of large language models with an innovative entropy-based method. This approach enables users to identify confabulations—incorrect outputs—without needing prior task knowledge, paving the way for safer applications in various fields.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses the challenge of detecting unreliable outputs from large language models, commonly termed hallucinations. The authors narrow focus to a specific subtype—confabulations—where models produce fluent but arbitrary and incorrect answers that vary with irrelevant factors (for example, random seed), in contrast to systematic errors, intentional deception for reward, or failures stemming from erroneous training data. The research question is how to quantitatively detect when an input is likely to induce such arbitrary, ungrounded generations, especially in free-form settings where equivalent meanings can be expressed in many ways. The purpose is to provide a general, unsupervised, domain-agnostic method that flags likely confabulations so systems can avoid unreliable answers, alert users, or augment with retrieval, thereby improving trust and utility of LLMs across unseen tasks.

Literature Review

The work builds on and contrasts with prior research on hallucinations and uncertainty in NLP and generation, including studies on predictive uncertainty in conditional language generation and object hallucination in image captioning. It references broader discussions on faithfulness, factuality, and calibration of pre-trained models for QA and text regression, as well as NLI-based factual consistency evaluation in summarization. It relates to approaches that encourage truthful behavior or elicit uncertainty verbally (e.g., Truthful AI, language models knowing what they know), and to supervised methods that learn from labeled examples, which can fail under distribution shifts. The authors also situate their approach within probabilistic ML and information-theoretic measures (e.g., entropy, mutual information), and past work on out-of-distribution detection using Bayesian deep learning. Their contribution departs from token-level entropy and lexical variation metrics by estimating uncertainty over semantic meaning, using entailment to assess semantic equivalence.

Methodology

Core idea: estimate uncertainty at the level of semantic meaning rather than token sequences by computing semantic entropy over clusters of semantically equivalent answers. - Sampling and clustering: For each question, the model samples multiple answers (typically 10 generations). Answers are clustered by semantic meaning using bidirectional textual entailment: if answer A entails B and B entails A, they belong to the same cluster. Entailment is measured using general-purpose LLMs and specialized NLI tools. - Semantic entropy: After clustering, entropy is computed over the distribution of meaning clusters. Low semantic entropy indicates consistent meaning across generations (model is confident), while high semantic entropy signals semantic variability and likely confabulation. This corrects naive entropy estimates that are inflated when the same meaning is phrased differently. - Discrete variant: When output token probabilities or embeddings are unavailable (e.g., GPT-4 at the time of experiments), a discrete approximation infers an empirical distribution over meaning clusters directly from the sampled generations, enabling black-box application. - Long-form text (biographies): Paragraph-length generations are decomposed into factual propositions (factoids). For each proposition, questions that could elicit that fact are automatically reconstructed. The model then generates three additional answers per question; semantic entropy is computed over these plus the original claim, and averaged to produce an uncertainty score per proposition. - Datasets and models: Evaluations span free-form QA and math word problems: TriviaQA, SQuAD 1.1, BioASQ, NQ-Open, and SVAMP, using LLaMA 2 Chat (7B, 13B, 70B), Falcon Instruct (7B, 40B), and Mistral Instruct (7B). A biography-generation dataset (FactualBio) is created from GPT-4 generations for 21 individuals; 150 factual claims are extracted and labeled true/false. - Baselines: (1) Naive entropy over length-normalized joint token-sequence probabilities; (2) P(True): prompting an LLM to compare a main answer to brainstormed alternatives and predict the probability that the main answer is true; (3) Supervised embedding regression: logistic regression on final hidden-state embeddings to predict correctness; (4) For biographies, a simple self-check baseline asking the LLM to judge truthfulness. The embedding regression baseline is tested both in-distribution and out-of-distribution to assess robustness. - Metrics: Performance is assessed by AUROC for predicting incorrect answers (as a proxy correlated with confabulations) and by AURAC (area under the rejection accuracy curve), which measures accuracy gains when refusing to answer high-uncertainty questions. Threshold-specific accuracies are provided in supplementary materials. Parameter choices (e.g., 10 generations for entropy; 3 additional generations for biography questions) are informed by supplementary analyses.

Key Findings

- Semantic entropy consistently outperforms baselines across models, sizes, and datasets for sentence-length generations. Averaged over 30 task–model combinations, semantic entropy achieves AUROC 0.790 versus naive entropy 0.691, P(True) 0.698, and embedding regression 0.687. Performance is stable across model families and sizes (approximately 0.78–0.81 AUROC). - The discrete semantic entropy variant performs similarly to the standard estimator and enables black-box application without output probabilities or embeddings. - In out-of-distribution settings, supervised embedding regression degrades notably, while semantic entropy remains robust, highlighting its suitability under distribution shift. - AURAC analyses show that using semantic entropy to refuse answering high-uncertainty questions yields higher rejection accuracy than baselines on average; it provides stronger practical accuracy improvements across thresholds. - For paragraph-length biographies (GPT-4), discrete semantic entropy attains higher AUROC and AURAC than both a self-check baseline and an adapted P(True) variant. Semantic entropy yields the highest accuracy at answer rates above 80%; only when rejecting the top 20% most uncertain cases does P(True) slightly surpass it on the remaining set. - Qualitative examples illustrate advantages and edge cases: semantic entropy correctly handles paraphrase variation with consistent meaning, flags diverse and inconsistent meanings as confabulations, and in some contexts may over-split based on granularity (e.g., precise date vs year).

Discussion

The study demonstrates that measuring uncertainty over meanings rather than lexical forms effectively detects confabulations, directly addressing the research goal of identifying when LLMs are likely to generate arbitrary, ungrounded answers. By clustering semantically equivalent generations via entailment and computing entropy on meaning clusters, the method captures semantic instability indicative of a lack of knowledge. Results across multiple datasets, model families, and sizes, including out-of-distribution scenarios, show strong and stable performance gains over naive entropy and supervised baselines, indicating that confabulations constitute a major source of current LLM errors. The approach is unsupervised and domain-agnostic, making it applicable without labeled confabulation data and robust to distribution shifts where supervision often fails. In long-form generation, decomposing text into factoids and applying semantic entropy per reconstructed question effectively isolates uncertainty about facts rather than content ordering. While the method does not address cases where models are confidently and systematically wrong, it offers a practical tool to improve reliability by gating responses, alerting users, or triggering retrieval augmentation in high-uncertainty cases. The findings suggest LLMs implicitly contain signals about when they "don’t know," which semantic entropy elicits more faithfully than token-level measures or small supervised probes.

Conclusion

The paper introduces semantic entropy, a probabilistic, unsupervised method that estimates uncertainty over meanings by clustering semantically equivalent generations and computing entropy on the resulting distribution. This approach reliably detects confabulations in both sentence-level QA and paragraph-length generation, outperforming naive entropy and strong supervised or in-context baselines across diverse datasets and LLMs, and remaining robust under distribution shift. Practical benefits include improved accuracy via selective refusal to answer high-uncertainty prompts and applicability in black-box settings through a discrete variant. Future work includes adapting semantic uncertainty to other generation tasks such as abstractive summarization, exploring alternative semantic uncertainty measures (e.g., mutual information, input rephrasing or counterfactual perturbations), and integrating methods for scalable oversight like debate-style cross-examination, as well as addressing complementary failure modes where models are confidently wrong due to systematic biases or objectives.

Limitations

- Scope: The method targets confabulations (arbitrary incorrectness) and does not guarantee factuality; it is less effective for systematic errors learned from training data or reward-driven deception. - Semantic clustering sensitivity: Clustering granularity can lead to false positives (e.g., treating precise dates vs years as distinct meanings when the distinction is irrelevant), indicating dependence on context and judgment. - Evaluation constraints: For GPT-4, lack of access to probabilities and embeddings required a discrete approximation and precluded certain baselines (naive entropy, standard semantic entropy, embedding regression) in that setting. - Reference-based evaluation: Fixed reference answers may inadequately reflect the flexibility of conversational tasks, complicating assessment of subtle semantic distinctions. - Resource use: The approach requires multiple generations per query and entailment checks, which may increase computation and latency compared to single-pass methods.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

PRISM: Patient Records Interpretation for Semantic clinical trial Matching system using large language models

S. Gupta, A. Basu, et al.

Business

Gender stereotypes in artificial intelligence within the accounting profession using large language models

K. Leong and A. Sung

Computer Science

ABScribe: Rapid Exploration & Organization of Multiple Writing Variations in Human-AI Co-Writing Tasks using Large Language Models

M. Reza, P. Dushniku, et al.

Computer Science

Affect Recognition in Conversations Using Large Language Models

S. Feng, G. Sun, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny