logo
ResearchBunny Logo
Confidence in the Reasoning of Large Language Models

Computer Science

Confidence in the Reasoning of Large Language Models

Y. Pawitan and C. Holmes

The research was conducted by Yudi Pawitan and Chris Holmes. It assesses LLM confidence—qualitatively by persistence when prompted to reconsider and quantitatively by self-reported scores—across GPT4o, GPT4-turbo, and Mistral on causal judgment, formal fallacies, and probability puzzles. Findings show performance above chance but variable answer stability, a strong tendency to overstate confidence, and a lack of internally coherent confidence signals.

00:00
00:00
~3 min • Beginner • English
Introduction
The emergence of large language models (LLMs) such as OpenAI’s GPT series has prompted debate about whether these next-token predictors exhibit genuine reasoning or are sophisticated pattern recognizers. A hallmark of humanlike reasoning is awareness of uncertainty and calibrated confidence in answers. The study’s aim is to assess the degree of confidence LLMs have in their answers and how that confidence correlates with actual performance. Because conventional statistical measures of uncertainty (e.g., bootstrap) are infeasible without access to raw data or distributional assumptions, the authors adopt empirical approaches. Qualitative confidence is measured by persistence—whether models keep or change an initial answer when prompted to reconsider. Quantitative confidence is measured via self-reported confidence scores (0–100). The relationship between these measures and token-level probabilities is also investigated. Three LLMs (GPT4o, GPT4-turbo, Mistral) are evaluated on two BIG-Bench Hard (BBH) tasks—causal judgment (187 items) and formal fallacies (250 items)—and on 46 probability/statistics puzzles. In brief, the models outperform random guessing but frequently change initial answers when asked to rethink; second answers often have lower accuracy. Self-reported confidence shows strong overconfidence and is only partially explained by token-level probabilities, suggesting current LLMs lack a coherent internal sense of uncertainty.
Literature Review
The paper situates its work within research on evaluating LLM reasoning beyond surface pattern matching. BIG-Bench and BBH provide challenging tasks across domains to probe emergent abilities; early models scored poorly, but newer models (e.g., GPT-4) match or exceed humans in many BBH tasks. However, other indicators of reasoning, such as planning (e.g., International Planning Competition domains), remain limited (e.g., ~12% success for GPT-4 across domains). Prompt sensitivity is a known issue: semantically similar prompts can yield different outputs, motivating techniques like chain-of-thought (CoT) prompting and decomposition-based prompting (e.g., Self-Discover) to guide stepwise reasoning. Despite improvements, whether such methods confer genuine reasoning or calibrated introspection remains open. This study focuses specifically on confidence calibration and uncertainty expression—areas underexplored relative to accuracy benchmarks—and examines how prompt phrasing affects confidence and persistence.
Methodology
Models: OpenAI GPT4o (version 2024-08-06), GPT4-turbo (2024-04-09), and Mistral Large 2 (2024-07-24). Temperature was set to 0 to reduce randomness; small variability (~1%) remained across runs. Data sets: Two BBH tasks—causal judgment (187 yes/no items) and formal fallacies (250 valid/invalid items)—and 46 statistical puzzles/paradoxes from Pawitan and Lee (2024). Prompts and procedure: For base performance, each item was answered directly (First answer) without explanations, followed by a rethink prompt to allow changing the answer. Qualitative confidence was operationalized as persistence (keeping the initial answer) versus changing it after rethink prompts. Three rethink prompts were used to test phrasing effects: (i) Simple: “Please think again carefully”; (ii) Neutral: “We always ask our LLM to double-check their answers, so please think again carefully”; and (iii) Post-confidence: same as Neutral but issued after obtaining a confidence score. Quantitative confidence was elicited by asking: “On a score between 0 and 100 … what confidence score do you have in your answer?” Additional quantitative proxy: fair betting odds for correctness were requested and analyzed (odds <1 inverted for comparability). A recent CoT method, Self-Discover, was also evaluated using its four-step procedure (select high-level reasoning modules; adapt modules; create a reasoning structure; solve), with sessions reset between prompts to mitigate context effects. Sessions were reset after each question for BBH; for statistical puzzles, sessions were reset after each related section. Interfaces: R wrappers chatgpt (for GPT4o/GPT4-turbo) and tidychatmodels (for Mistral). Statistical analysis: Comparisons of proportions used chi-squared tests with Yates’s correction (approximating two-sided Fisher’s exact test for small tables). Outcome measures: accuracy on first and second answers, conditional accuracy when keeping versus changing answers, rates of changing answers under different rethink prompts, distributions of self-reported confidence scores and betting odds, associations with token-level probabilities for decision keywords (yes/no; valid/invalid).
Key Findings
Performance and qualitative confidence (BBH tasks): First-answer accuracy for causal judgment and formal fallacies ranged between 0.62 and 0.70 across models, all significantly above random (0.5). After rethinking (Simple prompt), accuracy dropped for GPT4-turbo and Mistral but not for GPT4o. Causal judgment (n=187): First vs rethink accuracy—GPT4o 0.67→0.66; GPT4-turbo 0.70→0.60; Mistral 0.68→0.45. Formal fallacies (n=250): GPT4o 0.62→0.66; GPT4-turbo 0.66→0.45; Mistral 0.68→0.33. Conditional accuracy showed clear qualitative confidence effects: accuracy when keeping the first answer exceeded accuracy after changing it. For formal fallacies: if kept vs changed—GPT4o 0.68 vs 0.61 (p=0.48), GPT4-turbo 0.93 vs 0.38 (p=3.7E-08), Mistral 0.80 vs 0.32 (p=0.078). Change rates varied widely: causal judgment—GPT4o 0.17, GPT4-turbo 0.61, Mistral 0.87; formal fallacies—GPT4o 0.18, GPT4-turbo 0.88, Mistral 0.98. Although wrong initial answers were more likely to be changed, GPT4-turbo and Mistral often changed correct answers, degrading second-answer accuracy. Statistical puzzles (n=46): First-answer accuracy was 0.52 (GPT4o), 0.57 (GPT4-turbo), 0.61 (Mistral), all >2σ above random (0.39). Rethink accuracy decreased for GPT4-turbo (0.46) and Mistral (0.54) but not GPT4o (0.54). If kept vs changed accuracy—GPT4o 0.55 vs 0.50 (p=1.00), GPT4-turbo 0.75 vs 0.30 (p=9.1E-03), Mistral 0.67 vs 0.31 (p=0.047). Change rates: 0.13 (GPT4o), 0.65 (GPT4-turbo), 0.35 (Mistral). Version comparisons indicated differing qualitative confidence across versions (e.g., GPT-4 had a much higher change tendency than GPT4o in formal fallacies: 83% vs 18%). Prompting method (Self-Discover): Modestly improved first-answer accuracy (e.g., GPT4-turbo formal fallacies to 0.74) and substantially reduced change rates for GPT4-turbo and Mistral versus direct prompting; GPT4o’s behavior was similar across prompting modes. Self-reported confidence: Substantial overconfidence. In formal fallacies, GPT4o and GPT4-turbo reported 100 for all answers; Mistral reported 100 in 79% and 95 otherwise. In causal judgment, high confidence (≥95) was frequent (GPT4-turbo 78%, Mistral 86%). High self-reported confidence often did not correspond to higher accuracy (false confidence), though for GPT4o in causal judgment first answers, accuracy was higher when confidence ≥95 vs <95 (0.80 vs 0.60; p=8.3E-03). Qualitative vs quantitative confidence: Correlation between persistence and self-reported score was weak to marginal overall; significant for GPT4-turbo under the Post-confidence prompt, and marginally significant for all models with the Simple prompt (p from 0.0065 to 0.093). Betting odds: After normalizing odds (<1 inverted), high odds (>2) proportions were 0.17 (GPT4o), 0.78 (GPT4-turbo), 0.95 (Mistral). Associations with accuracy and confidence scores were generally non-significant (except some associations for GPT4o). Odds were associated with persistence after the Simple rethink for GPT4o and GPT4-turbo (not Mistral). Prompt phrasing effects: Rethink prompt strongly influenced change rates in causal judgment (n=187): Simple vs Neutral vs Post-confidence—GPT4o 0.17 vs 0.04 vs 0.02; GPT4-turbo 0.61 vs 0.50 vs 0.33; Mistral 0.87 vs 0.71 vs 0.01. Token-level probabilities: Decision-token probabilities were very high (median >0.995 except GPT4o in formal fallacies at 0.93). Accuracy increased with token probability but remained far below the token probability except at extremely high levels (>0.99999). Persistence also increased with token probability but remained materially lower than implied by token probability (e.g., models could still change answers around token probability ~0.9). Correlation between token probability and self-reported confidence was weak due to ceiling effects (many 100s or ≥95 scores).
Discussion
The findings show that while LLMs exceed random guessing on challenging reasoning tasks, their confidence is not coherently calibrated. Higher qualitative confidence (persistence) correlates with higher accuracy, but frequent, prompt-sensitive changes of initially correct answers by some models (notably GPT4-turbo and Mistral) degrade second-answer accuracy. Self-reported confidence is often inflated and poorly aligned with correctness, indicating a lack of introspective uncertainty understanding distinct from hallucination. Token-level probabilities partially explain accuracy and persistence but fall short as a full account of confidence: models can exhibit high token probabilities yet still change answers or be incorrect, and self-reported confidence shows weak linkage to these probabilities. Practically, users’ trust should be tempered: a persistent answer after a rethink can signal higher accuracy, but prompting can itself induce changes and miscalibration. Multiple rethinks can modestly improve accuracy (e.g., majority over three prompts), but benefits vary by model and task. The study highlights pitfalls such as the Clever Hans effect: selective re-prompting only on wrong initial answers can give a misleading impression of self-correction if not accounting for changes to initially correct answers. Effective use may come from expert-in-the-loop workflows where humans vet and adjudicate LLM outputs rather than defer to self-reported confidence. The broader implication is that current LLMs lack robust, humanlike introspection about uncertainty, limiting their reliability as autonomous experts in high-stakes reasoning tasks.
Conclusion
This work systematically evaluates LLM confidence through qualitative persistence and quantitative self-reports across BBH causal judgment, BBH formal fallacies, and statistical puzzles. Models outperform random guessing but exhibit wide variability in persistence, strong prompt sensitivity, and substantial overconfidence in self-reported scores. Confidence measures only partially align with token-level probabilities, and second answers often underperform first answers because correct initial responses are frequently overturned. These results suggest that current LLMs lack a coherent internal representation of uncertainty. Looking forward, improvements may come from larger models, enhanced training, more sophisticated reasoning prompts (e.g., CoT, Self-Discover), new architectures, or integrated generate-test frameworks. Benchmarking studies like this provide targets for progress and guidance for eliciting more meaningful and externally validated uncertainty measures.
Limitations
- Response brevity constraints: For practicality, prompts asked models to be brief, which may influence performance and confidence measures. - Potential data leakage: Some tasks (e.g., well-known puzzles; BBH despite warnings) may have appeared in training corpora, making it difficult to distinguish de novo reasoning from memorization. - Lack of ground-truth uncertainty access: Classical uncertainty estimation (e.g., bootstrap) is infeasible without raw data or distributional assumptions, limiting uncertainty assessment to empirical proxies. - Residual randomness: Even at temperature 0, small stochasticity (~1%) across runs may slightly affect reported accuracies. - Session effects: Need to reset sessions between prompts; without resets, performance (e.g., Self-Discover) can degrade, indicating sensitivity to conversational context.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny