Psychology
Natural language instructions induce compositional generalization in networks of neurons
R. Riveland and A. Pouget
Humans can rapidly perform new tasks from verbal or written instructions and can also verbalize task solutions once learned, unlike nonlinguistic animals that require many trials with simple reinforcement. The computational principles enabling instruction-based generalization and language-to-action mapping remain unclear. Prior systems-level theories suggest flexible prefrontal connectivity reuses practiced sensorimotor representations in novel settings, and neural data show abstract task-set structure in representations when stimulus–response mappings must be flexibly recruited. Multitask recurrent neural networks (RNNs) can also share dynamical motifs across related tasks. However, how natural language can reconfigure sensorimotor networks to perform unseen tasks on the first attempt is not well understood, nor is the expected representational structure in brain areas integrating language to reorganize sensorimotor mappings on the fly. Leveraging recent language models, the study aims to build a neurally interpretable framework where sentence-level semantics guide compositional generalization over sensorimotor tasks and to derive testable predictions about brain representations supporting instruction-based generalization.
- Flexible hubs and interregional connectivity in prefrontal cortex support adaptive task control and reuse of representations (Cole et al., Miller & Cohen).
- Neural representations in hippocampus and prefrontal cortex can reflect abstract task geometry; human medial frontal cortex flexibly recruits memory-based choice representations.
- Multitasking RNNs reuse shared dynamical motifs across tasks with similar demands (Yang et al.; Driscoll et al.).
- Language models: autoregressive transformers (GPT) align with neural signatures of reading/listening and predictive processing; large language models follow instructions but outputs are hard to map to biological sensorimotor mappings. Multimodal interactive agents show action-level interpretability but fuse vision and language early, complicating mapping to distinct brain areas.
- Sentence-level embedding models (e.g., SBERT) are trained for semantic relations; CLIP learns joint language–vision embeddings. The extent to which sentence-level semantics versus next-word prediction matters for configuring sensorimotor mappings is a central open question addressed here.
- Tasks and sensorimotor model: Train gated recurrent unit (GRU) sensorimotor RNNs (256 hidden units, ReLU) on 50 interrelated psychophysical tasks spanning five families: Go, Decision-making (DM), Comparison, Duration, and Matching. Inputs: two 32-unit sensory modality rings with Gaussian tuning to angles (0 to 2π) plus a fixation unit; outputs: a 32-unit response ring plus fixation unit. Trials include preparatory, stimulus (optionally with stim1, delay, stim2), and response epochs. Task-identifying information is a 64-dimensional vector presented constantly during the trial and concatenated with sensory inputs.
- Instruction embeddings and model variants:
- Instructed models use frozen pretrained transformers to embed natural language task instructions into 64-D vectors via average pooling of final hidden states and a learned linear projection. Language models tested: GPT2 (standard) and GPT2-XL (autoregressive), BERT (masked LM + next sentence), SBERT (sentence-level embeddings, including a large variant), and CLIP text encoder. Also a bag-of-words (BoW) linear embedding control.
- Nonlinguistic controls: SIMPLENET (50 orthogonal 64-D task rule vectors) and STRUCTURENET (compositions of 10 orthogonal 64-D structure vectors encoding task dimensions like Pro vs Anti, Mod1 vs Mod2, etc.).
- Additional controls: nonlinear MLP on top of language embeddings (reduced generalization); blank-slate language models trained only from task loss (poor performance) confirming necessity of language pretraining.
- Training: Supervised masked MSE loss on motor outputs; Adam optimizer, initial LR 0.001 with epoch-wise decay γ=0.95; train to 95% accuracy across tasks (85% for GPT where convergence was lower on some tasks). Five random initializations per model type. Validation tested on 5 novel instructions per task (20 instructions per task total; 15 train, 5 validation).
- Generalization evaluations:
- Zero-shot to novel tasks: Train on 45 tasks, hold out 5 tasks, test performance on first exposure (and learning curves over 100 exposures). Statistical comparisons via two-sided unequal-variance t-tests.
- Swapped-instruction test: Present sensory input from one held-out task with instructions from a different held-out task to assess reliance on instruction semantics.
- Grouped holdouts: Hold out 4–6 tasks from the same family to harden generalization.
- Fine-tuning: Allow gradients to update the last three transformer layers with small learning rates (language lr 3e-5; RNN lr 1e-4 or 5e-5 for sensitive GPT models) after initial training checkpoint; compare generalization.
- Compositional encoding: Combine rule vectors or instruction embeddings using linear algebraic compositions per prior work (e.g., AntiDMMod1 = (AntiDMMod2 − DMMod2) + DMMod1) and test performance.
- Nonlinear readout variant: Add MLP between language model outputs and the 64-D embedding; assess overfitting/generalization.
- Conditional clause/deduction analysis: Compare performance between tasks instructed with conditional clauses (requiring simple deduction) versus simple imperative instructions; construct null distributions via random splits and compare with STRUCTURENET to isolate linguistic parsing difficulty.
- Representation analyses:
- Principal components of sensorimotor hidden activity and instruction embeddings across task axes (Pro vs Anti; Mod1 vs Mod2), including held-out AntiDMMod1 examples.
- Cross-conditional generalization performance (CCGP): Train a linear decoder on one pair of conditions (e.g., DMMod2 vs AntiDMMod2) and test on analogous pairs (DMMod1 vs AntiDMMod1). Compute CCGP for sensorimotor activity and language layers; correlate with zero-shot performance; examine effects of instruction swapping.
- Single-unit tuning: Characterize neurons whose tuning changes predictably with instruction semantics across families (Go/Anti-Go; Matching; DM/AntiDM; Comparison), including held-out tasks.
- Language production and communication:
- Train a Production-RNN (GRU, 256 units) self-supervised to map sensorimotor activity back to token sequences (instructions). Initialize production hidden state from time-averaged and max sensorimotor activity via a linear layer.
- Motor-feedback inference: Withhold instructions, freeze network weights, optimize embedding vectors via motor feedback to reach performance criterion; decode instructions from sensorimotor states; generate novel phrasings.
- Partner evaluation: Feed produced instructions to a partner SBERTNET(L) model trained on all or with held-out tasks; measure task performance to assess communication quality.
- Zero-shot generalization to unseen tasks: Instructed models using sentence-level semantics generalize strongly. SBERTNET(L) and SBERTNET achieve average first-exposure performance around 83% on novel tasks; validation instruction performance reached 97% and 94%, respectively. SIMPLENET baseline: 39%; GPTNET: 57% (significantly above SIMPLENET, t=8.32, P=8.24×10^-16); GPTNET-XL, despite 1.5B parameters, showed limited gains over smaller models.
- Grouped holdouts (harder): SBERTNET(L) and SBERTNET achieved 71–72% accuracy, not significantly different from STRUCTURENET at 72%.
- Fine-tuning last language layers improved generalization across models; SBERTNET(L) matched STRUCTURENET (86%; t=1.204, P=0.229).
- Swapped-instruction tests caused the largest performance drops in the best instructed models, confirming reliance on instruction semantics.
- Compositional instruction combinations enabled much higher performance than SIMPLENET (which rose to 60%), indicating language training facilitates simple compositional schemes for configuring responses.
- Nonlinear mapping between language outputs and embeddings reduced generalization, suggesting linear readouts better preserve transferable structure.
- Determinants of success: Exposure to sentence-level semantic objectives during pretraining (SBERT) best supports generalization; large autoregressive next-word prediction (GPT) alone is insufficient; model size is not the key factor.
- Conditional clause/deduction tasks: All models performed worse on tasks with conditional clauses than on simple imperatives; instructed models performed worse than STRUCTURENET by a significant margin, indicating added difficulty stems from parsing syntactic complexity rather than deduction per se.
- Representational structure: SBERTNET(L) and STRUCTURENET showed clear factorization along abstract axes (Pro vs Anti, Mod1 vs Mod2) in both embeddings and sensorimotor activity, enabling high performance on held-out AntiDMMod1 (e.g., STRUCTURENET 92%, SBERTNET(L) 82%). GPTNET-XL failed to form a distinct Pro vs Anti axis (6% on AntiDMMod1); SIMPLENET lacked structure (22%).
- CCGP strongly correlated with zero-shot performance (Pearson r = 0.606, P = 1.57×10^-4). Swapping instructions reduced CCGP in instructed models, evidencing semantic dependence. In SBERT models, high CCGP emerged within intermediate transformer layers; in GPT/BERT/CLIP, CCGP jumped primarily at the learned linear embedding, indicating structure was imposed by the readout during task training.
- Single-unit tuning modulated by semantics: Neurons flipped direction selectivity or integration sign between Pro and Anti tasks; matching vs non-matching tasks modulated tuning as required; failures to modulate (e.g., GPT-XL) aligned with poor generalization.
- Language production and communication: The model generated many novel instructions (53% novel phrasings) that effectively guided partner models. Partner performance with decoded instructions averaged 93% (trained on all tasks) and 78% (with partner holdouts). Even for novel phrasings, partner accuracy was high (88% all tasks; 75% with holdouts).
The findings show that natural language can scaffold sensorimotor representations so that interrelated tasks share a common, compositional geometry aligned with instruction semantics. This shared geometry allows a model to compose practiced skills to solve unseen tasks on the first attempt. Sentence-level semantic pretraining (e.g., SBERT) furnishes language representations with abstract axes (e.g., Pro vs Anti, modality attention) that a simple linear mapping can align with sensorimotor dynamics, supporting robust zero-shot generalization. The strong correlation between CCGP and zero-shot performance indicates that linear decodability of abstract task axes across conditions is a key representational signature of generalization. Parsing complexity matters: conditional clause instructions degrade performance beyond nonlinguistic structure alone, implying that observed engagement of language areas in deduction tasks may reflect syntactic processing demands rather than deduction-specific mechanisms. At the single-unit level, neurons modulate tuning according to instruction semantics, flipping selectivity or integration sign as tasks demand, including for held-out tasks; lack of such modulation corresponds to generalization failure. Representational analyses predict that language areas should exhibit task-related abstract geometry mirroring that in sensorimotor regions when humans follow instructions, potentially in the language-selective left inferior frontal gyrus subregion adjacent to multiple-demand areas. Grounding high-level language layers in embodied processes during learning (e.g., motor planning) likely shapes semantic representations beneficially, mirroring improvements seen with fine-tuning. Language also serves as a shared, interpretable medium for communicating learned sensorimotor skills between independent networks, outperforming transfers using latent vectors.
This work introduces a neurally interpretable framework where natural language instructions induce compositional generalization in sensorimotor RNNs across 50 psychophysical tasks. Models leveraging sentence-level semantic embeddings (SBERT) achieve strong zero-shot performance on unseen tasks (~83%), underpinned by a shared representational geometry between instruction semantics and sensorimotor activity. Fine-tuning language layers further enhances generalization. We provide task- and unit-level evidence that instruction semantics modulate neural tuning, and demonstrate bidirectional language–sensorimotor mapping by producing effective, often novel, linguistic descriptions that transfer task knowledge to partner models. Future directions include testing the predicted shared abstract geometry across language and sensorimotor brain areas, especially in left inferior frontal gyrus; probing trial-by-trial semantic modulation of neuronal tuning in humans; improving handling of syntactically complex conditional instructions; and exploring how embodied signals shape high-level linguistic representations during learning. Comparative neuroimaging using gradients from autoregressive to sentence-structured language representations could map stages from prediction to compositional semantics to sensorimotor control.
- Task domain is limited to simulated psychophysical tasks with simplified sensory and motor representations; generalization to richer, real-world settings remains to be shown.
- Autoregressive large models (GPT-XL) underperformed despite size, highlighting a dependence on specific pretraining objectives; results may vary with other architectures or larger sentence-embedding models.
- Conditional clause instructions reduced performance, indicating limited robustness to syntactic complexity; the models may not fully capture human-level deduction in language-heavy contexts.
- Fine-tuning improves performance but risks overfitting or catastrophic forgetting without careful learning-rate control; biological plausibility of such tuning is untested.
- Statistical analyses rely on five random initializations; broader hyperparameter sweeps or larger samples could refine estimates.
- Neural predictions (e.g., shared geometry across language and sensorimotor areas) require empirical validation in humans or animals.
Related Publications
Explore these studies to deepen your understanding of the subject.

