Computer ScienceHarvard Data Science Review

Confidence in the Reasoning of Large Language Models

Y. Pawitan and C. Holmes

The research was conducted by Yudi Pawitan and Chris Holmes. It assesses LLM confidence—qualitatively by persistence when prompted to reconsider and quantitatively by self-reported scores—across GPT4o, GPT4-turbo, and Mistral on causal judgment, formal fallacies, and probability puzzles. Findings show performance above chance but variable answer stability, a strong tendency to overstate confidence, and a lack of internally coherent confidence signals.... show more

General Summary Metrics

Abstract

There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking. Our aim is to assess the extent of confidence that LLMs have in their answers and how it correlates with accuracy. Confidence is measured (i) qualitatively in terms of persistence in keeping their answer when prompted to reconsider, and (ii) quantitatively in terms of self-reported confidence score. We investigate the performance of three LLMS—GPT4o, GPT4-turbo, and Mistral—on two benchmark sets of questions on causal judgment and formal fallacies, and a set of probability and statistical puzzles and paradoxes. Although the LLMs show significantly better performance than random guessing, there is a wide variability in their tendency to change their initial answers. There is a positive correlation between qualitative confidence and accuracy, but the overall accuracy for the second answer is often worse than for the first answer. There is a strong tendency to overstate the self-reported confidence score. Confidence is only partially explained by the underlying token-level probability. The material effects of prompting on qualitative confidence and the strong tendency for overconfidence indicate that current LLMs do not have any internally coherent sense of confidence.

Publisher

Harvard Data Science Review

Published On

Jan 30, 2025

Authors

Yudi Pawitan, Chris Holmes

DOI

https://doi.org/10.1162/99608f92.b033a087

Explore these studies to deepen your understanding

Adjacent work that informs or extends this paper's methodology and findings.

Psychology

Understanding the Role of Large Language Models in Personalizing and Scaffolding Strategies to Combat Academic Procrastination

A. Bhattacharjee, Y. Zeng, et al.

Computer Science

Sentiment Analysis in the Era of Large Language Models: A Reality Check

W. Zhang, Y. Deng, et al.

Computer Science

Evaluating the capacity of large language models to interpret emotions in images

H. Alrasheed, A. Alghihab, et al.

Interdisciplinary Studies

Analyzing Memory Effects in Large Language Models through the Lens of Cognitive Psychology

Z. Cao, L. Schooler, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 22+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny