logo
ResearchBunny Logo
The Illusion of the Illusion of Thinking

Computer Science

The Illusion of the Illusion of Thinking

C. Opus and A. Lawson

Discover the intriguing findings by Shojaee et al. (2025), who reveal that Large Reasoning Models may experience 'accuracy collapse' under complex planning puzzles. This research, conducted by C. Opus and A. Lawson, challenges the prevailing narrative by suggesting that these anomalies are more about experimental design than actual reasoning failures. Dive into the details!

00:00
00:00
Playback language: English
Introduction
Shojaee et al. (2025) claim to have uncovered fundamental limitations in Large Reasoning Models (LRMs) by evaluating their performance on planning puzzles. Their central finding – that model accuracy collapses to zero beyond certain complexity thresholds – has significant implications for AI reasoning research. This commentary challenges this conclusion, arguing that the observed "accuracy collapse" is not due to inherent limitations in LRM reasoning capabilities, but rather stems from flaws in the experimental design and evaluation methodology of the original study. The authors aim to demonstrate that the limitations observed are primarily artifacts of experimental design choices, such as output token limits and the inability of the evaluation framework to differentiate between genuine reasoning failures and practical constraints imposed by the experimental setup.
Literature Review
The paper references Shojaee et al. (2025) as the primary focus of their critique. It also cites Dziri et al. (2023) regarding the limits of transformers on compositionality and Efimova (2018) concerning the mathematical properties of River Crossing puzzles. A Twitter thread by @scaling01 is mentioned as supporting evidence.
Methodology
The authors re-examined the experimental design and evaluation metrics used in Shojaee et al. (2025). They analyzed the token limitations of LRMs and showed how models often explicitly acknowledge output constraints in their responses. They highlighted the failure of the automated evaluation to differentiate between 'cannot solve' and 'chooses not to enumerate exhaustively'. They further demonstrated that some River Crossing puzzle instances used in Shojaee et al. (2025) were mathematically impossible to solve given the constraints, leading to incorrect scoring of the models. To counter the limitations, the authors conducted preliminary testing using a different representation for the Tower of Hanoi problem—requesting the generation of a Lua function that outputs the solution rather than a full move sequence—which yielded significantly improved results. They also analyzed the relationship between problem size and token requirements, showing that the reported 'collapse' aligns with the model's token limits.
Key Findings
The key findings are threefold: (1) LRMs often hit token limits before reaching the true limits of their reasoning abilities. Models implicitly or explicitly acknowledge these limitations in their output. (2) The automated evaluation frameworks used in the original study failed to distinguish between actual reasoning failures and the inability to generate a complete solution due to token constraints. This led to misclassifications of model capabilities and an overestimation of the difficulty of the puzzles. (3) The River Crossing puzzles included mathematically unsolvable instances, resulting in the models being unfairly penalized. The authors' alternative method, using Lua function generation for the Tower of Hanoi problem, showed very high accuracy across multiple models, even for instances previously deemed unsolvable. This demonstrates that the limitations observed are primarily artifacts of experimental design and evaluation methodologies, not limitations in LRM reasoning capabilities. The authors also point out that using 'compositional depth' (minimum moves) as a complexity metric conflates mechanical execution with problem-solving difficulty. The analysis reveals that problem complexity is not solely determined by solution length but also by factors like branching factor and search space size. This is exemplified by comparing Tower of Hanoi and River Crossing puzzles: the former has exponentially many moves but a trivial decision process, while the latter has far fewer moves but requires complex search and constraint satisfaction.
Discussion
The findings challenge the conclusion of Shojaee et al. (2025) that LRMs exhibit fundamental reasoning limitations. The authors argue that the 'accuracy collapse' is largely an artifact of experimental design, focusing on token limitations, evaluation methodology, and puzzle solvability. The success of the alternative representation in solving complex Tower of Hanoi problems strongly suggests that LRMs possess the reasoning abilities to tackle these tasks, but their performance is hindered by the limitations of the evaluation framework. The study highlights the need for more careful consideration of experimental design when evaluating AI reasoning capabilities and the dangers of relying solely on programmatic evaluation without considering the model's awareness and decision-making processes.
Conclusion
This commentary demonstrates that the reported limitations of Large Reasoning Models (LRMs) on planning puzzles, as presented in Shojaee et al. (2025), are largely due to issues in experimental design and evaluation methodology. The authors recommend designing evaluations that differentiate between reasoning capability and output constraints, verifying puzzle solvability, using complexity metrics that better reflect computational difficulty, and considering various solution representations. The critical takeaway is the need to carefully distinguish between true reasoning limitations and the practical constraints of the evaluation process itself.
Limitations
The authors acknowledge that their preliminary experiments using the alternative Lua function representation for the Tower of Hanoi problem lacked the statistical power of the original study due to budget constraints. Further, the focus is on critiquing a single study; generalizability to other LRM evaluation methodologies remains an area for future work.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny