logo
ResearchBunny Logo
Evidence of a predictive coding hierarchy in the human brain listening to speech

Linguistics and Languages

Evidence of a predictive coding hierarchy in the human brain listening to speech

C. Caucheteux, A. Gramfort, et al.

This exciting study by Charlotte Caucheteux, Alexandre Gramfort, and Jean-Rémi King explores how the human brain employs a hierarchical predictive coding system for language processing. Their findings reveal that enhancing language models with multi-timescale predictions significantly aligns them with brain activity, showcasing the brain's complex prediction hierarchy. Don't miss the insights from this cutting-edge research!

00:00
00:00
~3 min • Beginner • English
Introduction
The study investigates whether the human brain implements hierarchical predictive coding during language comprehension, making predictions across multiple timescales and representational levels, in contrast to most deep language models trained to predict the next word from nearby context. Prior work has shown that activations from deep language models can linearly map onto brain responses and that predictive ability contributes to this mapping. However, current models struggle with long-range coherence, deeper syntactic and semantic constructs, and generate degenerate text when optimizing solely for next-word prediction. Predictive coding theory posits that the cortex forms predictions across a hierarchical organization and multiple timescales. The authors aim to test if augmenting language model representations with long-range, multi-level predictions improves alignment with brain activity, and whether such predictions are organized hierarchically across cortical regions.
Literature Review
The paper builds on evidence that deep language models’ activations map onto human brain responses to natural language and that prediction is central to this mapping. Prior neuroimaging and electrophysiological studies correlated brain activity with surprisal (word/phoneme) derived from models trained on next-word or next-phoneme prediction, indicating predictive processes but reducing predictions to a single scalar and focusing on immediate next-token estimates. Anatomical and functional work supports a cortical hierarchy encoding increasingly abstract linguistic information across temporal receptive windows. The authors highlight gaps in current models’ handling of long-range dependencies, syntactic constructs, and semantic understanding, motivating a test of hierarchical, multi-timescale predictive coding in the brain using richer, high-dimensional predictive representations.
Methodology
Participants and data: fMRI data from the Narratives dataset were used, comprising 304 individuals (after exclusions) listening to 27 English spoken stories (7–56 min; ~4.6 h unique stimulus total; ~26 min per participant; TR = 1.5 s). Preprocessing via fMRIPrep; cortical voxels projected to surface and morphed to fsaverage; no spatial smoothing; no temporal filtering. Alignment to transcripts followed dataset-provided timing. Language models and features: Primary model was GPT-2 (12-layer causal Transformer; layer dimensionality 768), pretrained via Hugging Face. Activations were extracted per word, with tokenization and up to 1,024 tokens of context. Unless specified, analyses used layer 8 activations; additional analyses spanned other layers and models. Brain score (encoding model): For each individual and voxel, a ridge regression predicted fMRI signals from model activations. To match BOLD dynamics, a finite impulse response (FIR) with 6 delays (0–9 s, TR = 1.5 s) was used; activations of words within the same TR were summed. A preprocessing pipeline standardized features, applied PCA (20 components for computational reasons), then ridge regression (hyperparameter via nested leave-one-out CV among 10 log-spaced values between 10^1 and 10^8). Fivefold outer CV on contiguous chunks; Pearson correlation between predicted and actual fMRI on held-out data defined the voxel-wise brain score. Noise ceiling was estimated by predicting an individual’s responses using the average fMRI of other listeners to the same stories. Forecast windows and scores: To test long-range predictions, for each current word, the authors concatenated present activations (X) with a forecast window X^(d) constructed by concatenating activations of seven successive future words, with the last word at distance d. Distances spanned negative (past-only) to positive (future) windows; analyses focused on future distances up to +30 words. PCA was trained separately on X and X^(d) before concatenation. The forecast score F(d) was the gain in brain score from adding the forecast window: F(d) = R(X ⊕ X^(d)) − R(X). Forecast distance d′ was defined per voxel/individual as argmax_d F(d). Forecast depth k′ was defined as the GPT-2 layer maximizing the forecast score at a fixed distance d=8. Syntactic/semantic decomposition: Following Caucheteux et al., for each word and context the authors generated n=10 alternative futures that preserve the true future’s syntax (POS tags and dependency structure) while randomizing semantics, extracted GPT-2 activations for each, and averaged them to obtain the syntactic component X_syn. The semantic component was defined as residuals: X_sem = X − X_syn. Syntactic and semantic forecast windows were then built by concatenating the respective components across seven future words; corresponding forecast scores F_syn and F_sem were computed analogously. Fine-tuning GPT-2 with long-range, high-level objective: GPT-2 was fine-tuned on English Wikipedia with a mixed loss: L = α L_high-level + (1−α) L_language modelling, with balancing to keep contributions fixed over training. The language modelling loss was next-word cross-entropy. The high-level objective used contrastive predictive coding (CPC) to predict the fixed pretrained GPT-2 activations at layer k=8 for the word at distance d=8 from the current word, using cosine similarity, temperature τ=0.1, and 2,000 negatives from a queue. Fine-tuning used Hugging Face Trainer defaults (Adam, lr=5e−5), context size 256, batch size 4 per GPU on 2 GPUs, training the top layers (8–12) while freezing lower layers. Models with α ∈ {0, 0.5, 1} (and intermediate values) were trained; ~15 checkpoints per run were evaluated by computing brain scores using concatenated layers [0,4,8,12] on the Narratives dataset and averaging across steps. Statistics and regions: Whole-brain voxelwise analyses were conducted per individual; metrics averaged across individuals and/or voxels as appropriate. Significance assessed with two-sided Wilcoxon tests across individuals; FDR correction across voxels; P < 0.01 unless otherwise noted. Regions of interest used a subdivision of the Destrieux atlas (142 per hemisphere); reported ROIs include Heschl’s gyrus/sulcus, superior temporal gyrus/sulcus (aSTS, mid STS, posterior STS), IFG/IFS (pars opercularis/triangularis), angular and supramarginal gyri.
Key Findings
- Deep language models map onto brain activity: GPT-2 activations (layer 8) accurately predict fMRI across a bilateral language network. Peak brain scores in superior temporal regions (e.g., superior temporal sulcus) reached R = 0.23, corresponding to ~60% of the noise ceiling. - Long-range forecast improves brain mapping: Concatenating future word representations (forecast windows of width 7) to current activations increased brain scores, with maximal effect at distance d = 8 words (~3.15 s at 2.54 words/s). The average improvement in brain scores was 23% ± 9% across individuals. Effects were broadly bilateral, with some asymmetries (e.g., larger in left pars opercularis and supramarginal gyri; P < 0.001). - Hierarchical temporal scope: Forecast distance d′ varied along the cortical hierarchy, with prefrontal/frontoparietal regions forecasting further into the future than temporal regions. For example, d′ in IFG exceeded that in aSTS by Δd′ = 0.9 ± 0.2 (P < 0.001). - Hierarchical representational depth: The optimal forecast depth k′ aligned with cortical hierarchy: deeper forecasts (k′ > 6) best model associative cortices (middle temporal, parietal, frontal), while shallower forecasts (k′ < 6) best model low-level auditory/language areas (Heschl’s gyri/sulci, aSTS). Differences between regions were highly significant (e.g., angular vs Heschl’s gyri: k′ difference ~2.5 ± 0.3, P < 0.001). - Syntactic versus semantic predictions: Semantic forecast effects were long-range, peaking at d = 8 and involving a distributed frontoparietal network. Syntactic forecasts were shorter-range, peaking at d = 5, localized to superior temporal and left frontal areas; long-range syntactic forecasts were not detectable, with distant windows not improving and sometimes harming scores due to added dimensionality without signal. - Model adaptation enhances frontoparietal mapping: Fine-tuning GPT-2 with a high-level, long-range objective (predicting layer-8 activations at d=8) yielded additional gains in frontoparietal regions (e.g., >2% average gain in IFG and angular/supramarginal gyri; all P < 0.001), with negligible benefit in auditory/lower-level regions. Brain score gains increased with the weight of the high-level objective (α).
Discussion
The findings support predictive coding theory in language processing, demonstrating that the brain engages in hierarchical predictions across multiple timescales and representational levels. Augmenting model activations with long-range forecasts improved alignment with fMRI, particularly in frontoparietal cortices known for high-level semantics, planning, and executive functions. These regions exhibited longer forecast distances and benefited from models trained to predict high-level, distant representations, suggesting they actively anticipate upcoming semantic content rather than passively integrating past input. The hierarchical organization of optimal forecast depth aligns with known cortical processing hierarchies: low-level auditory/temporal regions are better explained by shallow, short-range predictions, while associative cortices are better explained by deep, contextualized forecasts. Decomposing predictions into syntactic and semantic components reveals long-range forecasting is primarily semantic, whereas syntactic forecasting is shorter-range and localized. These results extend prior work that related brain activity to scalar surprisal by leveraging high-dimensional predictive representations and highlight a mismatch with standard language models trained solely on adjacent next-word prediction. Predicting latent, contextual representations over longer horizons may address indeterminacy in future observations and better capture human-like language processing.
Conclusion
The study demonstrates that human language processing involves hierarchical predictive coding: cortical regions predict at different temporal ranges and representational depths, with frontoparietal areas forecasting long-range, high-level, semantic representations and temporal regions forecasting short-range, lower-level, syntactic information. Enhancing deep language models with long-range, multi-level forecasts improves brain–model alignment, and fine-tuning models to predict distant, high-level representations further increases similarity in frontoparietal regions. Future work should develop temporally precise methods to probe sublexical predictions, characterize the exact predictive representations across cortical areas, and scale and evaluate predictive coding-inspired architectures on NLP benchmarks, aiming to train models that predict across multiple timescales and representational levels.
Limitations
- Temporal resolution of fMRI (~1.5 s TR) limits investigation of rapid, sublexical predictions and fine temporal dynamics. - The precise nature of representations and predictions in each cortical region remains to be characterized; interpretability of neural and model representations is challenging. - The implemented predictive coding architecture is simplified; broader generalization, scaling, and evaluation on diverse NLP tasks are needed to establish practical utility and robustness.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny