
Computer Science
Shared functional specialization in transformer-based language models and the human brain
S. Kumar, T. R. Sumers, et al.
Discover groundbreaking insights into how transformer-based language models, like BERT, align with human brain activity in language processing. This research by Sreejan Kumar and colleagues reveals significant correlations between model computations and specific brain regions, suggesting shared computational principles that bridge machine learning and neuroscience.
~3 min • Beginner • English
Introduction
The study investigates how the brain constructs meaning from continuous language and whether the internal circuit computations of Transformer-based language models provide a useful account of this process. Prior neuroimaging work often used controlled manipulations to isolate syntactic or semantic operations, limiting generalizability to naturalistic contexts. Meanwhile, NLP has shifted to Transformer models that compute context-sensitive representations via attention heads. Most neuroscience studies have focused on contextual embeddings from these models; this work instead targets the transformations produced by attention heads—the updates that integrate contextual information from other words. The authors hypothesize that these transformations: (a) predict brain activity during naturalistic story listening on par with embeddings and better than classical linguistic features; (b) map more layer-specifically to cortical language regions than embeddings; and (c) exhibit shared functional specialization with cortical areas, such that attention heads encoding specific linguistic dependencies also better predict activity in specific regions.
Literature Review
The paper situates itself amid two lines of prior research: (1) classical neuroimaging studies that dissociate syntax from semantics using controlled stimuli, which struggle to generalize to naturalistic comprehension and to synthesize into a holistic model; and (2) recent model-based approaches using Transformer's contextual embeddings to predict brain activity, showing strong performance but focusing on representations rather than the circuit computations that generate them. In NLP, work on BERTology has uncovered emergent functional specialization at the level of attention heads, including approximations of syntactic relations, layer-wise representational progressions, and interpretable attention patterns. These findings motivate analyzing transformations (headwise outputs of self-attention) as candidates for linking brain activity to contextual computations that span words without relying on predefined syntactic labels.
Methodology
Participants: fMRI data from the open "Narratives" collection were used. Two datasets were analyzed: (i) Slumlord and Reach for the Stars (18 subjects, ages 18–27), and (ii) I Knew You Were Black (45 subjects, ages 18–53). All fMRI had TR = 1.5 s. Preprocessing used fMRIPrep (distortion correction, slice-timing, motion correction, normalization), confound regression (motion, aCompCor, high-pass, polynomials), and no spatial smoothing. Data were downsampled to a 1000-parcel cortical atlas, and parcels were grouped into 10 ROIs (HG, PostTemp, AntTemp, AngG, IFG, IFGorb, MFG, vmPFC, dmPFC, PMC).
Features: Confounds included word rate, phoneme rate, a 32D phoneme indicator matrix, and a silence indicator. Baseline linguistic features included spaCy-derived parts of speech and 25 dependency relations (binary indicators per TR), and a CCG parser effort metric (3 scalar effort measures). Non-contextual semantic features used GloVe vectors averaged per TR. Transformer-based features were extracted from BERT-base-uncased (12 layers, 12 heads per layer): (1) layer-wise embeddings (768D), (2) headwise transformations (64D per head; 768D per layer concatenating 12 heads), and (3) transformation magnitudes (L2 norm per head). A 21-TR window (current TR plus preceding 20 TRs, ~30 s) provided bidirectional context for BERT; outputs for tokens within the current TR were averaged to a single feature vector per TR. For backward attention distance analyses, fixed 128-token windows were used to compute per-head attention-weighted look-back distances.
Encoding models: Parcelwise encoding used banded ridge regression with three-fold cross-validation. The predictor matrix included lagged copies (1–4 TRs) to accommodate HRF delays. Regularization was optimized via random search over band-wise Dirichlet-sampled penalties. Performance was Pearson correlation between predicted and observed test time series, converted to percent of a noise ceiling estimated via intersubject correlation (ISC). Statistical significance used bootstrap tests (vs. zero) and permutation tests (between models), with FDR correction.
Headwise analyses: To assess functional specialization, headwise encoding performance was computed by training an encoding model on all transformations and evaluating prediction using only one head at a time. For syntactic information, logistic regression decoders (balanced accuracy; nested CV for regularization) were trained per head to predict presence of each dependency per TR from the 64D headwise transformation vector. To summarize head contributions across parcels, encoding weight matrices were z-scored per parcel, L2 norms were computed per head, and PCA across language parcels produced low-dimensional axes (PCs) capturing variance in headwise weights; PCs were related to head properties (layer, look-back distance, dependency decoding). Control analyses shuffled features across heads within layers (forming pseudo-heads), used an untrained BERT, and replicated in GPT-2.
Key Findings
- Transformer features (embeddings and transformations) significantly outperform classical linguistic features across most language ROIs (p < 0.005 in HG, PostTemp, AntTemp, AngG, IFG, IFGorb, vmPFC, dmPFC, PMC), and embeddings outperform non-contextual GloVe in several ROIs.
- Transformations perform on par with embeddings overall in predicting brain activity; transformation magnitudes (content-agnostic) outperform GloVe and classical linguistic features in lateral temporal areas but not in higher-level regions like angular gyrus.
- Layer-wise behavior differs: embeddings' performance increases monotonically and peaks in late-intermediate/final layers, whereas transformations show more layer-specific fluctuations and peak earlier. Mean preferred layer across language parcels: transformations 7.2 vs embeddings 8.9 (p < 0.001). Transformations capture more unique variance at earlier layers than embeddings; embeddings capture more unique variance at later layers (partial correlation analyses).
- Transformations exhibit greater layer specificity: mean absolute performance difference between neighboring layers is larger for transformations (14.3) than for embeddings (7.6) (p < 0.001).
- Headwise analysis reveals low-dimensional structure linking head properties to cortical organization. PCA of headwise encoding weights across parcels shows that PCs 1–2 explain 92% of variance across parcels (first nine PCs explain 95%). PC1 and PC2 project to distinct cortical maps (PC1: bilateral posterior temporal and left lateral PFC positive; medial PFC negative; PC2: prefrontal and left anterior temporal positive; partially right-lateralized temporal negative).
- Head properties align with these axes: layer correlates with PCs (max r ≈ 0.45), and backward attention distance strongly correlates with PC2 (r ≈ 0.65). Heads with longer look-back distances (>30 tokens upper quartile) align with prefrontal and anterior temporal parcels.
- Functionally specialized heads reported in prior NLP work cluster in the space consistent with intermediate layers and shorter look-back distances. Shuffling features across heads within layers collapses structure: first two PCs explain only 17% of variance; correlations with layer and look-back distance drop (max r from 0.45→0.25 and 0.65→0.26), abolishing visible gradients.
- Headwise functional correspondence: across dependencies and ROIs, heads that better decode a given dependency also better predict activity in specific ROIs. Examples: posterior superior temporal cortex shows correspondence for ccomp, dobj, pobj; IFG shows correspondence primarily for ccomp; angular gyrus and MFG show high correspondence across dependencies; vmPFC and dmPFC show little correspondence (despite being predicted by full models), suggesting shared information there is more semantic or beyond classical syntax.
- Control analyses: shuffling across heads within layers and using an untrained BERT both abolish headwise correspondence; GPT-2 replicates trends with higher correspondence in IFG but less ROI specificity. Additional controls show poor model performance in a non-language ROI (early visual cortex).
Discussion
The results support the hypothesis that the Transformer's headwise transformations are a valuable basis for modeling human brain activity during naturalistic language comprehension. Transformations, the sole mechanism by which context flows across words in the model, perform comparably to embeddings and outperform classical linguistic features, indicating that the contextual information they integrate is highly relevant to brain responses. Unlike embeddings, transformations are layer-specific updates and map more sharply to a cortical hierarchy, capturing unique variance earlier and exhibiting stronger layer specificity. Headwise analyses reveal shared functional organization: gradients in cortical space correspond to head layer and look-back distance, suggesting that cortical regions vary systematically in the temporal scope of contextual integration (e.g., posterior temporal areas favor earlier-layer, short-range dependencies; anterior temporal and prefrontal cortices align with long-range dependencies and longer temporal receptive windows). The observed headwise correspondence between syntactic dependency information and ROI prediction strengths links specific contextual computations to particular regions, though not one-to-one. Together, these findings suggest partially shared computational principles between human language processing and Transformer circuits, and demonstrate that analyzing circuit computations (transformations) provides complementary insights beyond embeddings alone.
Conclusion
This work bridges Transformer circuit computations and cortical language processing by demonstrating that headwise transformations predict brain activity during naturalistic story listening on par with contextual embeddings and better than classical linguistic features. Transformations exhibit earlier, more layer-specific mappings to cortex and organize in low-dimensional gradients tied to head layer and contextual look-back distance. Headwise correspondence analyses link specific dependency-encoding heads to particular ROIs, revealing structured, shared functional specialization across models and brain. Future research directions include: designing bottlenecked Transformer architectures to encourage hierarchical embeddings and closer structural mapping to cortical hierarchies; integrating acoustic and prosodic features or models that operate directly on continuous speech; probing differences across Transformer training regimes and architectures using headwise analyses; and developing models with more neurobiologically inspired circuit connectivity to elucidate both specialization and integration across language network nodes.
Limitations
The study is correlational and does not provide a mechanistic account of cortical language processing. Transformer features are derived from text tokens and omit acoustic/prosodic dynamics of speech. Classical syntactic annotations used for correspondence are sparse proxies and may not capture the richer contextual relations learned by heads. Aggregating parcelwise results across subjects (and using ISC-based noise ceilings) may obscure individual variability; coarse parcellation may limit localization precision, particularly in regions like IFG. Naturalistic stimuli may not sufficiently tax certain syntactic processes, potentially limiting detection of head–ROI correspondences. Backward attention distance estimates may underestimate temporal integration that accumulates across layers. Finally, results depend on specific models (BERT, GPT-2) and preprocessing choices, though control analyses mitigate concerns about trivial architectural artifacts.
Related Publications
Explore these studies to deepen your understanding of the subject.