Introduction
Large Language Models (LLMs) like ChatGPT and Gemini demonstrate impressive capabilities in reasoning and question answering. However, a significant limitation is their tendency to "hallucinate"—generating false or unsubstantiated information. This unreliability poses significant risks across various applications, from legal precedents to medical diagnoses. Current attempts to improve truthfulness through supervision or reinforcement learning have yielded limited success. The challenge lies in developing a general method for detecting hallucinations in LLMs that works reliably even with novel and unseen questions, some of which may lack definitive answers even to human experts. This research addresses this challenge by focusing on a specific type of hallucination termed 'confabulations'. Confabulations are defined as fluent but arbitrary and incorrect generations, meaning that the answer changes based on seemingly irrelevant factors like random seed. For instance, an LLM might correctly answer a medical question about Sotorasib's target one time, but provide an incorrect response another time, despite identical prompts. This differs from other types of LLM errors such as consistently wrong answers stemming from flawed training data or intentional falsehoods created to achieve a specific reward. The authors argue that combining all these distinct error types under the umbrella term "hallucination" is unhelpful. This work aims to make progress in the area of LLM oversight by developing a method to identify confabulations, making it easier for users to detect when extra caution is necessary. It's crucial to note that this method does not guarantee factuality; it primarily focuses on flagging potential confabulations while leaving issues of systematic error outside its scope.
Literature Review
Prior work on LLM uncertainty has largely concentrated on simpler settings such as classifiers and regressors. However, the most promising applications of LLMs involve free-form generation, a domain where existing uncertainty methods prove inadequate. The term 'hallucination' itself, borrowed from machine learning, has been debated for its potential anthropomorphism. While acknowledging the need for careful use of such metaphors, the authors recognize its widespread adoption as highlighting a significant phenomenon requiring precise definition. This research aims to contribute to a more accurate understanding of confabulations within the broader context of LLM errors.
Methodology
The core of the proposed method lies in quantifying the 'semantic entropy' of LLM generations. Unlike naive entropy measures that consider variations in word sequences, semantic entropy focuses on the meaning of sentences. High semantic entropy signifies high uncertainty. The process involves several steps: first, multiple possible answers to a given question are sampled. Then, these answers are algorithmically clustered based on their semantic similarity, determined by bidirectional entailment—sentences A and B belong to the same cluster if A entails B, and B entails A. Entailment is measured using both general-purpose LLMs and dedicated natural language inference (NLI) tools. The authors provide evaluations in supplementary materials to demonstrate the effectiveness of these tools. For longer passages, the authors propose breaking the text into smaller factual claims (factoids), creating corresponding questions, generating answers, and then computing semantic entropy at the factoid level. Confabulations are flagged by high semantic entropy. The entire approach avoids requiring prior domain knowledge, works across diverse models and domains and doesn't necessitate task-specific training data. The researchers conduct extensive evaluations across various datasets: TriviaQA (trivia knowledge), SQUAD 1.1 (general knowledge), BioASQ (life sciences), NQ-Open (open-domain questions from Google Search), SVAMP (mathematical word problems), and a custom FactualBio dataset of biographies. They use a range of LLMs (LLaMA 2 Chat, Falcon Instruct, Mistral Instruct, and GPT-4) with varying parameter sizes. For GPT-4, a discrete approximation of the semantic entropy estimator is employed due to the unavailability of output probabilities at the time of the study. The authors also compare their method to several baselines, including naive entropy, a supervised embedding regression model (both in- and out-of-distribution), and a P(True) method that assesses the model's prediction of the truthfulness of alternative answers. In the case of paragraph-length biographies, the approach is adapted to analyze factual claims within the text, generating questions for each claim and assessing semantic entropy across the generated answers.
Key Findings
The results demonstrate the superior performance of semantic entropy in detecting confabulations compared to the baselines. Across various datasets and LLMs, semantic entropy consistently achieves higher AUROC (Area Under the Receiver Operating Characteristic curve) and AURAC (Area Under the Rejection Accuracy curve) scores. The AUROC measures the ability to distinguish between correct and incorrect answers, while AURAC reflects the improvement in accuracy when using semantic entropy to filter out questions deemed likely to produce confabulations. Semantic entropy significantly outperforms naive entropy which ignores semantic equivalence. The supervised embedding regression model also underperforms, particularly when a distribution shift exists between training and deployment data. The discrete approximation of semantic entropy, used for GPT-4, performs comparably to the standard estimator, showcasing its robustness. The analysis of the FactualBio dataset (paragraph-length biographies generated by GPT-4) shows the adaptability of semantic entropy to longer texts, outperforming a simple self-check baseline and an adapted P(True) method. Table 1 in the paper provides illustrative examples of how semantic entropy correctly identifies or misses confabulations in different scenarios, highlighting the influence of semantic clustering on the results. Figure 2 illustrates the significant improvement in AUROC and AURAC for semantic entropy compared to baselines across multiple datasets and LLMs. Figure 3 shows the superior performance of semantic entropy in detecting confabulations in the biography dataset. The average AUROC across 30 combinations of tasks and models is 0.790 for semantic entropy, compared to 0.691 for naive entropy, 0.698 for P(True), and 0.687 for embedding regression. These results indicate that a substantial portion of LLM errors can be attributed to confabulations, a class of errors that semantic entropy effectively addresses. Notably, the performance of semantic entropy remains stable across different model sizes and families.
Discussion
The success of semantic entropy in detecting confabulations highlights that LLMs may possess a better understanding of their knowledge limitations than previously thought; the problem is they do not always express this uncertainty effectively. The approach addresses a significant source of LLM errors that will likely persist even as models improve, focusing on situations beyond human supervisory capabilities. The method's domain-agnostic nature suggests its potential applicability to other language tasks such as abstractive summarization, and future extensions could involve methods such as cross-examination for enhanced oversight. The study does acknowledge its limitations; it doesn't directly tackle situations where LLMs are confidently incorrect due to flawed training, systematic reasoning errors, or deliberate user misdirection. The authors advocate for addressing these distinct error types separately.
Conclusion
This research introduces a novel method for detecting a critical class of LLM errors—confabulations—using semantic entropy. The method’s effectiveness across various LLMs and datasets demonstrates its practical value in improving LLM reliability. Future research could explore extensions to different types of input variations and the application of semantic entropy to other language generation tasks. The study’s success underscores the importance of considering semantic meaning when evaluating and improving LLMs.
Limitations
While the method effectively detects confabulations, it does not address all types of LLM errors, such as those stemming from biased or flawed training data or deliberate misinformation. The effectiveness of semantic clustering relies on the chosen method for determining entailment, and the accuracy of this step can affect the overall performance. The evaluation relies on comparing generated text to reference answers, which may not fully capture the nuances of open-ended conversational settings.
Related Publications
Explore these studies to deepen your understanding of the subject.