Computer Science
Human-like systematic generalization through a meta-learning neural network
B. M. Lake and M. Baroni
Discover how Brenden M. Lake and Marco Baroni tackle the challenge of achieving human-like systematic generalization in neural networks with their innovative meta-learning for compositionality (MLC) approach. Their research reveals that optimized networks using MLC outshine both inflexible symbolic models and adaptable but unsystematic neural networks, showcasing significant advancements in systematicity and flexibility. Don't miss this exciting insight into the future of AI learning!
~3 min • Beginner • English
Introduction
The study investigates whether standard neural networks can achieve human-like systematic compositionality—the ability to generalize algebraically to novel combinations of known components. This question stems from the long-standing challenge posed by Fodor and Pylyshyn, who argued that neural networks lack such systematicity and thus are implausible cognitive models. Despite advances in neural architectures and training paradigms, modern networks continue to struggle on tests that a minimally algebraic mind should pass. The authors hypothesize that optimizing a neural network specifically for compositional skills via meta-learning can induce human-like systematic generalization. They introduce Meta-Learning for Compositionality (MLC), which guides learning through a stream of few-shot compositional tasks, and evaluate it side-by-side with human behaviour using instruction-learning in a pseudolanguage. The purpose is to demonstrate that a standard transformer architecture, when meta-trained to acquire compositional skills, can match or exceed human systematicity while reproducing human-like inductive biases, thereby addressing the systematicity debate.
Literature Review
The paper situates its contribution within a 35-year debate initiated by Fodor and Pylyshyn, who claimed connectionist networks lack systematic compositionality. Counterarguments suggest either that human compositional skills are less strictly rule-like than proposed, or that enhanced neural architectures can exhibit more systematicity. Empirical evaluations over the past decade show that modern neural networks—including recurrent and transformer-based sequence-to-sequence models—often fail systematic generalization tests such as SCAN and COGS. Concurrently, large language models show impressive few-shot learning but still exhibit open questions regarding systematic compositionality. Prior work has explored neural-symbolic hybrids, structured representations (e.g., tensor product representations, recursive distributed representations), architectural factorization, curriculum learning, and meta-learning as routes to improved generalization. Meta-learning has been proposed as a way to reverse engineer human inductive biases, with formal ties to hierarchical Bayesian models, but existing approaches have not comprehensively matched human patterns across both systematic generalization and bias-driven deviations. This study builds on these strands by using meta-learning with a standard transformer to induce both systematic rules and human-like biases.
Methodology
Behavioural experiments with humans: Two tasks were used.
- Few-shot instruction-learning task: Participants (n=25) received a curriculum of 14 study input–output pairs (instructions in a pseudolanguage and abstract output symbol sequences) and were asked to generate outputs for 10 query instructions. The study items were generated by an underlying interpretation grammar comprising compositional rewrite rules; success required inferring word meanings from few examples and composing them for novel queries, including longer outputs than seen during study.
- Open-ended instruction task: Different participants (n=29) produced outputs for seven unknown instructions without any study examples, revealing a priori inductive biases. Analyses assessed three biases: one-to-one mapping (each word maps to one symbol), iconic concatenation (preserving input order in output), and mutual exclusivity (assigning unique meanings to unique words).
Model and meta-learning approach (MLC):
- Architecture: A standard transformer encoder–decoder is used for memory-based meta-learning. Inputs to the encoder are concatenations of multiple study examples (input/output pairs, delimited by a special token) plus a query instruction. The decoder generates the output sequence for the query.
- Meta-training episodes: Each episode is a distinct seq2seq task defined by a randomly sampled latent grammar that maps inputs to outputs. Across episodes, the model must infer word meanings from study examples and systematically compose them for the query. Model weights are optimized across many such episodes; at test time, weights are frozen and no task-specific parameters are provided.
- Human-like bias modelling: To predict human responses (including characteristic errors), meta-training stochastically paired queries with either algebraic target outputs or heuristic outputs reflecting one-to-one translations or misapplied rules (iconic concatenation), approximately matching empirical rates.
- Evaluation metrics: For the few-shot task, exact-match accuracy and item-level difficulty correlations with human performance were computed. Model fit to human behaviour was quantified by log-likelihood over all human responses. For the open-ended task, models were optimized with cross-validation (fivefold), comparing log-likelihood on held-out participants.
Comparative models: Baselines included a basic seq2seq transformer trained directly on the task (no meta-learning), a copy-only MLC variant, an algebraic-only MLC variant (trained only on algebraic outputs), and probabilistic symbolic models assuming access to the gold grammar, with or without bias-based lapse processes.
Machine learning benchmarks: The study further evaluated MLC on SCAN and COGS benchmarks focusing on systematic lexical generalization. Meta-training used surface-level word-type permutations to induce few-shot inference without expanding the vocabulary. Results were reported as exact-match error rates averaged over five runs. Additional tests included within-distribution generalization (simple splits) and more challenging productivity splits (e.g., SCAN length; structural generalization types in COGS).
Key Findings
Human behaviour:
- Few-shot task: Participants achieved 80.7% exact-match outputs relative to the algebraic standard; 72.5% correct on queries requiring longer output sequences than seen during study. Error analyses showed strong inductive biases: 24.4% of all errors were one-to-one translations; 23.3% of errors involving function 3 showed iconic concatenation.
- Open-ended task: 58.6% (17/29) produced the modal pattern consistent with all three biases; overall, 62.1% followed one-to-one, 79.3% followed iconic concatenation, and 93.1% adhered to mutual exclusivity by producing unique outputs for each instruction.
Model results (MLC vs alternatives):
- MLC produced a perfectly systematic run (100% exact match when selecting best outputs) on the same few-shot task. When sampling, it generated systematic outputs at 82.4% (near human 80.7%) and handled longer outputs at 77.8% (near human 72.5%). Item-level difficulty correlation with humans was high (Pearson r = 0.788, P = 0.031, n = 10).
- Error patterns were human-like: 56.3% of model errors were one-to-one, and 13.8% of errors involving function 3 were iconic concatenations.
- Log-likelihood of human behaviour (few-shot): MLC outperformed rigid symbolic and non-meta-learning baselines; an MLC joint model achieved the best fit (−349.2). In open-ended cross-validation, MLC again outperformed alternatives with best joint model fit (−635.7).
- Open-ended generation: The MLC transformer matched the modal human response in 65.0% of samples and reproduced biases (66.0% one-to-one; 85.0% iconic concatenation; 99.0% unique responses per instruction).
Benchmark performance:
- SCAN systematic lexical generalization: MLC achieved very low error rates—Add jump 0.22%, Around right 0.04%, Opposite right 0.06%—versus basic seq2seq errors of 99.27%, 51.13%, and 100.00%, respectively. Within-distribution (simple) errors were near zero for both (MLC 0.02% on SCAN simple).
- COGS lexical generalization: MLC error 0.87% vs basic seq2seq 6.08%. However, MLC failed on productivity tasks (e.g., SCAN length split and several COGS structural splits at 100% error).
Discussion
The findings address the central research question by showing that a standard transformer, when meta-trained to acquire compositional learning skills, can achieve and even exceed human-like systematic generalization while reproducing human inductive biases. MLC bridges the gap between rigid symbolic models (highly systematic but inflexible) and conventional neural networks (flexible but often unsystematic), fitting human responses better across both few-shot and open-ended tasks. The approach demonstrates that systematicity and human-like biases are not innate to the architecture but can be induced through appropriately designed meta-training. In modelling terms, MLC acts similarly to hierarchical Bayesian methods to reverse engineer inductive biases but leverages neural networks for greater expressive power. Beyond cognitive modelling, MLC improves compositional generalization on SCAN and COGS lexical splits with near-perfect accuracy, highlighting its practical value for machine learning. At the same time, the results delineate the boundary conditions of meta-learning: success occurs when new episodes are in-distribution relative to meta-training, while out-of-distribution forms of productivity (e.g., much longer sequences or novel complex structures) remain challenging. Overall, MLC provides a unified account that captures both algebraic generalization and bias-driven deviations observed in human behaviour.
Conclusion
This work introduces Meta-Learning for Compositionality (MLC), showing that a standard transformer can be optimized to display human-like systematic generalization and reproduce human inductive biases in instruction learning. MLC outperforms symbolic and basic seq2seq alternatives in predicting human responses and achieves state-of-the-art-level accuracy on systematic lexical generalization in SCAN and COGS. The study advances the systematicity debate by demonstrating that neural networks, when optimized for compositional skills, can meet classic challenges while retaining flexibility.
Future directions include: designing meta-training procedures that promote productivity (handling much longer outputs and novel complex structures), extending to naturalistic language tasks and other modalities, integrating mechanisms for emitting genuinely new symbols (e.g., via pointer mechanisms), and scaling via specialized meta-training interleaved with standard large language model training to further improve systematicity.
Limitations
- Generalization beyond the meta-training distribution remains limited: MLC fails on productivity splits such as SCAN length and several COGS structural generalizations (100% error), indicating difficulty with much longer outputs and novel complex sentence structures.
- Sensitivity to meta-training design: The approach succeeds when test episodes are in-distribution relative to training episodes; out-of-distribution episodes are not handled reliably.
- Symbol generation: The current architecture lacks a native mechanism to emit new symbols not present in study examples (would require pointer-like mechanisms).
- Scope and ecological validity: The approach has not been tested on the full complexity of natural language or across modalities; developmental plausibility of the specific meta-training regimen is limited.
- Bias coverage: MLC may not capture subtler inductive biases that it was not explicitly optimized for, as indicated by additional experiments in supplementary material.
Related Publications
Explore these studies to deepen your understanding of the subject.

