logo
ResearchBunny Logo
The Illusion of the Illusion of Thinking

Computer Science

The Illusion of the Illusion of Thinking

C. Opus and A. Lawson

Discover the intriguing findings by Shojaee et al. (2025), who reveal that Large Reasoning Models may experience 'accuracy collapse' under complex planning puzzles. This research, conducted by C. Opus and A. Lawson, challenges the prevailing narrative by suggesting that these anomalies are more about experimental design than actual reasoning failures. Dive into the details!... show more
Introduction

The paper critiques Shojaee et al. (2025), who claim Large Reasoning Models exhibit a collapse in accuracy beyond certain puzzle complexities. The authors argue that the observed failures are artifacts of experimental design, not inherent reasoning limitations. They set out to show that output token limits, mis-specified automated evaluation criteria, and inclusion of unsolvable benchmark instances explain the reported collapses, thereby challenging the interpretation that LRMs fundamentally fail at higher complexities.

Literature Review

The authors situate their critique within prior work on LLM limitations and puzzle theory. They reference Dziri et al. (2023) as an example of arguments that small per-token error rates imply inevitable failure over long outputs (the p^T argument). They cite Efimova (2018) for formal results on River Crossing problems, noting no solutions for N > 5 with boat capacity b = 3. They also reference Shojaee et al. (2025) as the primary target of critique, and a 2025 Twitter replication by @scaling01 documenting model outputs that explicitly recognize output-length constraints. Together, these works frame the debate on whether observed failures indicate reasoning limits versus evaluation or practical constraints.

Methodology

The authors analyze Shojaee et al.'s experimental setup and conduct preliminary tests to isolate reasoning ability from output-format constraints. Steps include: (1) Inspecting model outputs for explicit acknowledgments of output token limits during Tower of Hanoi tasks (e.g., models truncating with statements about length). (2) Identifying that evaluation scripts conflate inability to enumerate exhaustive move lists with reasoning failure. (3) Verifying River Crossing instance solvability using known results: for N ≥ 6 with boat capacity b = 3, instances are unsolvable. (4) Quantifying token growth required by the evaluation’s exhaustive move enumeration for Tower of Hanoi, modeling token requirements as T(N) ≈ 5(2^N − 1)^2 + C and deriving maximum solvable N under given token budgets via N_max ≈ floor(log2(sqrt(L_max/5))). (5) Running preliminary alternative-format experiments that request algorithmic representations instead of full enumerations: prompting models (Claude-3.7-Sonnet, Claude Opus 4, OpenAI 03, Google Gemini 2.5) to output a Lua function that prints the Tower of Hanoi solution for N = 15, thereby testing algorithmic understanding while staying within token limits. The authors report high accuracy and completion within ~5,000 tokens in these trials.

Key Findings
  • Models recognize and state output-length constraints during Tower of Hanoi tasks; truncation reflects practical limits, not reasoning collapse. - Automated evaluation misclassifies capability by not distinguishing between inability to enumerate long outputs and inability to solve. - River Crossing benchmarks with N ≥ 6 and boat capacity b = 3 are mathematically unsolvable; scoring these as failures penalizes correct recognition of impossibility. - Token growth for exhaustive Tower of Hanoi outputs is approximately T(N) ≈ 5(2^N − 1)^2 + C. With token budgets of ~64,000 (Claude-3.7-Sonnet, DeepSeek-R1) and ~100,000 (o3-mini), the implied maximum sizes are N_max ≈ 7–8 (Claude-3.7, DeepSeek-R1) and ≈ 8 (o3-mini), matching the reported collapse points. - Rigid grading with character-by-character accuracy implies P(all correct) = p^T; for T = 10,000: p = 0.9999 yields < 37% success and p = 0.999 yields < 0.5% success, illustrating how format choices can produce apparent failures despite strong underlying competence. - Alternative representations restore performance: for Tower of Hanoi with 15 disks, models produced correct Lua functions implementing the recursive algorithm with very high accuracy in under 5,000 tokens across multiple models (Claude-3.7-Sonnet, Claude Opus 4, OpenAI 03, Google Gemini 2.5). - Solution length is a poor proxy for problem difficulty; Tower of Hanoi has trivial per-move decisions despite exponential length, while River Crossing and Blocks World require genuine search with higher branching factors and computational complexity.
Discussion

The findings indicate that the reported accuracy collapse primarily reflects evaluation artifacts—output token limits, unsolvable instances, and strict grading—rather than a fundamental inability to reason. By separating algorithmic understanding from exhaustive enumeration, the same models demonstrate correct reasoning on instances previously labeled as complete failures. This reframes the core question from whether LRMs can reason to whether evaluations appropriately measure reasoning independent of output constraints. Furthermore, using solution length (compositional depth) as a complexity metric conflates mechanical execution with computational difficulty; tasks with short solutions may require complex search, while long solutions may involve trivial decision-making. Consequently, rigorous evaluation must account for solvability, practical constraints, and suitable complexity metrics to accurately characterize LRM reasoning capabilities.

Conclusion

The paper concludes that Shojaee et al.'s observed collapses are driven by context token limits, evaluation misclassification, and inclusion of unsolvable River Crossing instances. These issues yield misleading claims about fundamental reasoning limits. The authors recommend: (1) designing evaluations that distinguish reasoning ability from output-length constraints; (2) verifying puzzle solvability before scoring; (3) employing complexity metrics reflecting computational difficulty rather than solution length; and (4) considering multiple output representations to decouple algorithmic understanding from exhaustive execution. They argue that the key challenge is building evaluations that separate reasoning from typing, not demonstrating whether LRMs can reason.

Limitations

The experimental support is preliminary: due to budget constraints, the authors conducted an insufficient number of trials for high statistical power, and full experimental validation remains future work. The analysis relies on re-examining evaluation protocols and limited demonstrations (e.g., N = 15 Tower of Hanoi via a Lua function) rather than large-scale benchmarking under revised conditions.

Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny