logo
Loading...
LARGE LANGUAGE MODELS CANNOT SELF-CORRECT REASONING YET

Computer Science

LARGE LANGUAGE MODELS CANNOT SELF-CORRECT REASONING YET

J. Huang, X. Chen, et al.

This paper critically examines self-correction in Large Language Models, focusing on intrinsic self-correction—when a model tries to fix its own answers without external feedback. The authors find that LLMs often struggle to self-correct during reasoning and can even degrade after attempting fixes, and they offer directions for future research and applications. This research was conducted by Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou.... show more
Introduction

The paper investigates whether large language models can self-correct their own reasoning without external feedback. Motivated by growing interest in self-correction techniques, the authors pose the key question: if an LLM can self-correct, why does it not produce the correct answer initially? They introduce the concept of intrinsic self-correction, where the model attempts to refine its prior output using only its internal capabilities. This setting is important because high-quality external feedback is often unavailable in real-world applications. The study focuses on reasoning tasks, a domain where accuracy and reliability are critical, and examines the true potential and limitations of self-correction under realistic constraints. Contrary to optimistic prior reports, the authors hypothesize and find that LLMs generally fail to improve—and often degrade—their reasoning performance when self-correcting without oracle or external feedback.

Literature Review

Section 2 surveys self-correction methods and clarifies ambiguity in definitions, particularly distinguishing internal (intrinsic) versus external feedback. Prior works often rely on oracle labels or external signals—human judgments, other models, or tools—to guide corrections (e.g., Kim et al. 2023; Shinn et al. 2023; Gou et al. 2023; Chen et al. 2023b). For reasoning, some studies demonstrate gains by using correctness labels to stop correction loops, which the authors argue does not reflect intrinsic capability. Multi-agent debate (Du et al., 2023; Liang et al., 2023) is reviewed as another approach relying on multiple model critiques, but its gains may stem from aggregating multiple generations rather than true correction. Self-consistency (Wang et al., 2022) is highlighted as a strong baseline that selects answers via majority voting across multiple samples. The review also notes prompt design issues in prior evaluations (Madaan et al., 2023), where feedback prompts include constraints omitted from initial instructions, confounding the source of improvements.

Methodology

Experiments evaluate self-correction with and without oracle labels on reasoning benchmarks where prior work reported gains. Benchmarks: GSM8K (1,319 math word problems), CommonSenseQA (1,221 multiple-choice Dev set), and HotpotQA (100 questions; closed-book EM). Models: GPT-3.5-Turbo (gpt-3.5-turbo-0613), GPT-4 (API accessed 2023/08/29), plus GPT-4-Turbo (gpt-4-1106-preview) and Llama-2-70b-chat for intrinsic tests. For GPT-3.5, full sets are used; for GPT-4/GPT-4-Turbo/Llama-2, 200 random questions per dataset (100 for HotpotQA) to reduce cost. Decoding temperatures: GPT-3.5/4 at 1; GPT-4-Turbo/Llama-2 at 0. Up to two rounds of self-correction are performed. Prompting follows a three-step protocol: (1) initial answer generation (also baseline), (2) review/feedback on the previous answer, (3) re-answer using feedback. Prompts largely follow Kim et al. (2023) and Shinn et al. (2023) with added format instructions for evaluation. For intrinsic self-correction, labels are removed and the model decides whether to change or retain answers. Multiple feedback prompt variations are tested (e.g., "Assume that this answer could be either correct or incorrect. Review..."; "Review your previous answer and determine whether it's correct..."; "Verify whether your answer is correct, and provide an explanation."). Additional studies: multi-agent debate per Du et al. (2023) replicated on GSM8K with gpt-3.5-turbo-0301, 3 agents and 2 rounds, compared against self-consistency with matched response counts; and a prompt design case study on Constrained Generation (Madaan et al., 2023), adding an explicit constraint "includes ALL concepts" to the initial instruction and evaluating concept coverage.

Key Findings
  1. With oracle labels (used to stop correction when the answer is correct), self-correction shows notable gains but reflects oracle-aided performance rather than intrinsic capability:
  • GPT-3.5: GSM8K 75.9 → 84.3; CommonSenseQA 75.8 → 89.7; HotpotQA 26.0 → 29.0.
  • GPT-4: GSM8K 95.5 → 97.5; CommonSenseQA 82.0 → 85.5; HotpotQA 49.0 → 59.0.
  1. Intrinsic self-correction (no labels) consistently degrades accuracy across models and benchmarks:
  • GPT-3.5: GSM8K 75.9 → 75.1 (round 1) → 74.7 (round 2); CommonSenseQA 75.8 → 38.1 → 41.8; HotpotQA 26.0 → 25.0 → 25.0.
  • GPT-4: GSM8K 95.5 → 91.5 → 89.0; CommonSenseQA 82.0 → 79.5 → 80.0; HotpotQA 49.0 → 49.0 → 43.0.
  • GPT-4-Turbo: GSM8K 91.5 → 88.0 → 90.0; CommonSenseQA 84.0 → 81.5 → 83.0.
  • Llama-2: GSM8K 62.0 → 43.5 → 36.5; CommonSenseQA 64.0 → 37.5 → 36.5. Prompt variants do not reverse the trend: for GPT-4-Turbo and Llama-2, different feedback prompts still yield lower accuracies post-correction.
  1. Models tend to change correct answers to incorrect ones more often than correcting wrong answers, indicating poor internal judgment of answer correctness. Example distributions (two rounds):
  • GSM8K (GPT-3.5): No Change 74.7%; Incorrect→Incorrect 8.8%; Incorrect→Correct 8.9%; Correct→Incorrect 7.6.
  • CommonSenseQA (GPT-3.5): No Change 42.8%; Incorrect→Incorrect 39.8%; Incorrect→Correct 5.8%; Correct→Incorrect 11.6.
  • GSM8K (Llama-2): No Change 40.0%; Incorrect→Incorrect 33.0%; Incorrect→Correct 5.5%; Correct→Incorrect 23.5.
  1. Multi-agent debate does not outperform self-consistency when matched for inference cost (number of responses): On GSM8K, self-consistency outperforms debate at equivalent or higher response counts: 3 responses 82.5 (SC); 6 responses 85.3 (SC) vs 83.2 (Debate); 9 responses 88.2 (SC) vs 83.0 (Debate).
  2. Prompt design strongly influences perceived gains: In Constrained Generation, strengthening the initial instruction to explicitly require inclusion of ALL concepts dramatically improves baseline performance (concept coverage): Standard Prompting (ours) 81.8 vs Self-Correct 75.1; compared to prompts/results in Madaan et al. (2023): Standard 44.0/53.0 and Self-Correct 67.0/61.1. Improvements attributed to feedback can vanish or reverse when the initial prompt is made sufficiently informative.
Discussion

The study directly addresses whether LLMs can intrinsically self-correct reasoning. Findings show that without access to external signals or oracle labels, LLMs generally fail to identify and rectify their own reasoning errors and often degrade correct answers. This suggests that the observed gains in prior work depend on external feedback rather than innate self-assessment abilities. The multi-agent debate experiments indicate that improvements largely derive from aggregating multiple generations (self-consistency) rather than critique-driven correction. The prompt design case study demonstrates that some reported improvements are confounded by suboptimal initial prompts; simply clarifying constraints in the initial instruction can eliminate the need for iterative feedback and yield higher performance. These results underscore that current LLMs lack reliable internal mechanisms to judge correctness in reasoning. For practical applications, intrinsic self-correction is unlikely to boost reasoning quality without verifiers or external signals. Researchers and practitioners should compare self-correction methods against strong, cost-matched baselines (e.g., self-consistency), design prompts equitably across initial and feedback stages, and leverage external feedback sources when available.

Conclusion

The paper concludes that contemporary LLMs cannot reliably self-correct reasoning without external feedback, and intrinsic self-correction often reduces accuracy. It clarifies evaluation pitfalls in the literature: reliance on oracle labels, unfair cost comparisons, and confounding prompt designs. The authors recommend future work to: (1) leverage valid external feedback (e.g., code execution, tools, trained verifiers) to enable effective correction; (2) evaluate self-correction against baselines with comparable inference costs and include cost analyses; (3) invest equal effort in initial and feedback prompt design to avoid misleading conclusions; and (4) develop alignment or training methods that increase the probability of decoding optimal solutions in single-shot generations, reducing reliance on multiple samples or correction loops.

Limitations

The evaluation centers on reasoning tasks and a set of specific benchmarks and models; self-correction may be effective in other domains (e.g., preference alignment, style, safety). The work does not exhaustively explore all possible feedback prompts or model variants, and HotpotQA uses a small evaluation sample in some prior setups. The authors note that LLMs can be biased by misleading feedback and that some community confusion arises from unclear reporting about label usage or external feedback in self-correction strategies. Results accessed via proprietary APIs may vary over time, though prompts and model versions/timestamps are documented to aid reproducibility.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny