Computer Science
Confidence in the Reasoning of Large Language Models
Y. Pawitan and C. Holmes
The research was conducted by Yudi Pawitan and Chris Holmes. It assesses LLM confidence—qualitatively by persistence when prompted to reconsider and quantitatively by self-reported scores—across GPT4o, GPT4-turbo, and Mistral on causal judgment, formal fallacies, and probability puzzles. Findings show performance above chance but variable answer stability, a strong tendency to overstate confidence, and a lack of internally coherent confidence signals.
~3 min • Beginner • English
Related Publications
Explore these studies to deepen your understanding of the subject.

