logo
ResearchBunny Logo
DISSOCIATING LANGUAGE AND THOUGHT IN LARGE LANGUAGE MODELS

Linguistics and Languages

DISSOCIATING LANGUAGE AND THOUGHT IN LARGE LANGUAGE MODELS

K. Mahowald, I. A. Blank, et al.

This groundbreaking research by Kyle Mahowald, Idan A. Blank, Joshua B. Tenenbaum, Anna A. Ivanova, Nancy Kanwisher, and Evelina Fedorenko explores the cognitive divides in Large Language Models. While these models excel at understanding linguistic rules, their ability to utilize language effectively in real-world situations remains inconsistent. The authors suggest that future LLMs should mimic the human brain's modular processes for language mastery.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses how to evaluate the linguistic and cognitive abilities of large language models by disentangling language from thought. Motivated by the historical conflation of linguistic fluency with intelligence (e.g., the Turing Test), the authors identify two fallacies: that being good at language implies being good at thinking, and conversely, that being bad at thought implies being a bad language model. To mitigate these, they propose a distinction between formal linguistic competence (mastery of linguistic rules and patterns) and functional linguistic competence (using language to act and reason in the world). Grounding this distinction in human neuroscience—which shows dissociable neural mechanisms for language versus non-linguistic cognition—the paper aims to evaluate contemporary LLMs along these two axes. The central hypothesis is that next-word prediction can yield near-human formal competence in English, but functional competence remains uneven, often requiring specialized fine-tuning (e.g., RLHF) or coupling with external modules. The study’s purpose is to provide a cognitive-science framework for assessing LLMs, clarify claims about their abilities, and guide future model development and evaluation.
Literature Review
The authors situate their work within debates about LLM understanding, noting that state-of-the-art models generate human-like text, perform well on comprehension benchmarks, and achieve superhuman next-word prediction, prompting claims of broad intelligence. Conversely, critiques highlight inconsistent world knowledge, limited commonsense reasoning, and lack of deep understanding. From cognitive neuroscience, they review evidence for a dedicated human language network that supports formal linguistic operations but not non-linguistic cognition (e.g., logic, math, ToM), supported by fMRI and studies of aphasia. They also review the evolution from n-grams and RNNs to transformers, the role of pretraining on massive corpora, and prior claims that statistical models could not learn hierarchical or abstract syntax—claims now challenged by LLM performance on benchmarks (e.g., BLiMP, SyntaxGym) and emergent structure studies. They discuss prior work on linguistic abstractions, construction learning, and correspondences between LLM representations and neural responses in the language network. For functional competence, they review literature on LLMs’ limits in formal reasoning, world knowledge reliability, situation modeling, and social reasoning/ToM, as well as augmentation methods (retrieval, tools, symbolic trackers) and the impact of RLHF. They highlight evaluation pitfalls (fine-tuning contamination, closed models, prompt sensitivity, heuristic shortcuts) and the need for better benchmarks that cleanly separate competencies.
Methodology
This is a conceptual and integrative review that develops a formal/functional competence framework grounded in human cognitive neuroscience and uses it to evaluate LLMs. The approach includes: (1) Neuroscience grounding: summarizing fMRI and neuropsychological evidence for a selective human language network distinct from multiple-demand and ToM networks. (2) Formal competence evaluation: compiling results from linguistic benchmarks (e.g., BLiMP, SyntaxGym), emergent structure analyses (hierarchical dependencies, agreement, filler-gap), probing studies of abstractions (POS, roles), construction learning, and brain-model alignment studies that map LLM activations to language network responses. (3) Functional competence evaluation across four domains: formal reasoning (logic, math, planning), world knowledge (factual/commonsense), situation modeling (discourse and entity-state tracking), and social reasoning (pragmatics, theory of mind). For each, the authors summarize human neural substrates and synthesize LLM performance evidence, noting when success depends on fine-tuning or augmentation (e.g., CoT prompting, tool use, retrieval, symbolic belief tracking). (4) Design implications: proposing modular architectures (architectural or emergent modularity) that separate formal and functional mechanisms, in line with the brain’s division of labor. (5) Evaluation guidance: outlining pitfalls (data contamination, closed models, prompt sensitivity) and advocating for benchmarks that disentangle competencies and avoid heuristic exploits. No new empirical experiments are conducted; rather, the paper systematically aggregates and analyzes existing findings to support its framework and claims.
Key Findings
- LLMs exhibit substantial formal linguistic competence in English: they generate grammatical, coherent text; learn hierarchical structure (e.g., long-distance agreement, filler-gap dependencies); show evidence of linguistic abstractions (POS, roles) and constructions; and their internal representations partially predict brain responses in the human language network. - Performance on linguistic benchmarks approaches human levels in controlled settings; e.g., a BabyLM-submitted model achieved 86% on BLiMP (human baseline ~89%) with developmentally plausible training data sizes. - Formal competence improves markedly with scale and pretraining, often without task-specific fine-tuning. - Functional linguistic competence is inconsistent in non-augmented LLMs. • Formal reasoning: models struggle with out-of-distribution generalization, multi-step logic/maths, and larger-digit arithmetic; chain-of-thought helps but is not robust; external tools (math engines, planners, probabilistic programs) materially improve performance. • World knowledge: models hallucinate, are prompt-sensitive and inconsistent, and commonsense is underrepresented in text; maintaining and updating facts in parameters is difficult; retrieval-augmented and knowledge-base approaches are more reliable. • Situation modeling: increasing context windows helps but is inefficient; hierarchical summarization methods offer promise; models make non-human discourse errors (e.g., referencing nonexistent entities) despite some entity tracking abilities. • Social reasoning/pragmatics: targeted RLHF enhances some pragmatic inferences (e.g., metaphor), but performance is uneven (e.g., sarcasm, jokes); ToM claims often fail under simple controls; augmenting with belief trackers improves robustness. - The dissociation between formal and functional competence mirrors human brain organization: language-selective regions support formal processing, while multiple-demand and ToM networks support reasoning and social cognition. - Implication: next-word prediction suffices for strong formal competence but not for robust functional competence; modular architectures or emergent modularity may be necessary. - Evaluation: functional benchmarks are prone to heuristic exploitation; better benchmarks should disentangle formal/functional skills and ensure generalization beyond surface cues.
Discussion
The findings support the central thesis that formal and functional linguistic competences are distinct and map onto different neural systems in humans. Contemporary LLMs, optimized for next-token prediction, reach near-human formal competence in English—capturing hierarchy, abstraction, and constructions—and their representations align with the human language network. However, non-linguistic cognitive capacities essential for language use in the world (reasoning, robust world knowledge, discourse situation tracking, social inference) remain uneven in non-augmented LLMs. This dissociation explains why fluent language output does not necessarily indicate robust reasoning or understanding, addressing common fallacies in public and academic discourse. The significance is twofold: for science, LLMs are valuable models of formal language processing and a testbed for long-standing questions (e.g., poverty of the stimulus, inductive biases); for engineering, progress toward human-like language use likely requires integrating LLMs with distinct reasoning and knowledge systems via architectural or emergent modularity, tool use, and retrieval. The paper also underscores the need for rigorous, contamination-resistant benchmarks that isolate competencies and evaluate generalization and robustness across prompts, formats, and domains.
Conclusion
The paper contributes a neuroscience-grounded framework separating formal from functional linguistic competence and uses it to reconcile seemingly conflicting assessments of LLMs. It concludes that: (1) Next-word prediction yields strong formal competence (hierarchy, abstraction, construction sensitivity, neural alignment), making LLMs compelling scientific models of language processing. (2) Functional competence remains patchy without augmentation, particularly in formal reasoning, reliable and updatable world knowledge, situation modeling, and social reasoning. (3) Future systems that use language in human-like ways will likely be modular, combining language processing with specialized components for reasoning, memory, planning, and social cognition—either through explicit architectural modularity or emergent modularity induced by training. (4) Better benchmarks are needed to separately assess formal and functional skills. Potential future directions include training on more human-like and smaller data regimes with appropriate inductive biases, multilingual evaluation (esp. low-resource and typologically diverse languages), multimodal grounding, improved knowledge updating mechanisms, and rigorous evaluations of modular designs.
Limitations
- Scope: This is an integrative review rather than new empirical experimentation; conclusions rely on published results that may vary in quality, reproducibility, and contamination controls. - Language focus: Most evidence concerns English and high-resource languages; generalization to typologically diverse or signed languages is uncertain. - Model transparency: Closed-source LLMs hinder precise attribution of gains to scale vs. fine-tuning and complicate contamination checks. - Benchmark artifacts: Many functional benchmarks can be solved via heuristics; reported successes may overestimate generalizable ability without adversarial or counterfactual controls. - Data mismatch: LLM training data far exceed child input and differ qualitatively from interactive human learning; implications for human-like acquisition remain partially unresolved. - World knowledge updating: Discussions highlight challenges but do not provide a definitive, general solution; external knowledge-base coverage remains limited. - Brain alignment: While representations correlate with language network activity, correspondences are partial and do not imply one-to-one mechanistic equivalence.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny