Psychology
A foundation model to predict and capture human cognition
M. Binz, E. Akata, et al.
The human mind exhibits broad generality across tasks, learning from few examples, reasoning causally, and engaging in exploration. In contrast, most contemporary computational models in machine learning and cognitive science are narrowly specialized (for example, AlphaGo in game play, prospect theory in risky choice). To move toward unified theories of cognition, the authors propose building a data-driven, domain-general computational model that predicts and simulates human behaviour across diverse paradigms. This paper introduces Centaur, a foundation model of human cognition created by fine-tuning a large language model on an unprecedented corpus of trial-by-trial human behavioural data (Psych-101). The study evaluates whether such a model can consistently predict held-out participant behaviour, generalize to unseen experiments, and align internally with human neural activity, thereby advancing the goal of an integrated computational account of cognition.
The paper situates Centaur within longstanding efforts toward unified theories of cognition (Anderson; Newell) and recent calls to integrate prediction with explanation in computational social science and psychology. It contrasts domain-specific models (e.g., prospect theory; generalized context model; various reinforcement learning models) with the versatility of human cognition and the limitations of specialized AI systems (e.g., AlphaGo). It builds on prior work leveraging large language models for cognitive modeling and benchmarking (e.g., CogBench, metabench), and on foundational methods for parameter-efficient fine-tuning (QLORA). The authors reference empirical paradigms spanning multi-armed bandits, decision-making, memory, categorization, instrumental learning, and social/economic games, as well as neuroscientific approaches linking model representations to brain activity in tasks like the two-step decision task and sentence reading.
Dataset construction (Psych-101): The authors manually transcribed trial-by-trial data from 160 psychological experiments into natural-language prompts, each containing complete session histories for single participants, adhering to original instructions with necessary simplifications and a maximum prompt length (~32,768 tokens). Inclusion criteria were public availability of trial-level data, feasibility of text transcription without major information loss, and broad domain coverage. Psych-101 comprises 60,092 participants, 10,681,650 choices, and 253,597,411 text tokens across domains such as multi-armed bandits, decision-making, memory, supervised learning, and Markov decision processes. Model and fine-tuning: Centaur is built on Llama 3.1 70B using quantized low-rank adaptation (QLORA). The base model is frozen (4-bit quantization) and augmented with rank r = 8 low-rank adapters on all non-embedding linear layers (self-attention and feedforward). Adapter parameters represent ~0.15% of base model parameters and are trained in half precision. Training used one epoch over Psych-101 with cross-entropy loss, masking tokens that are not human responses to focus optimization on behaviour prediction. Effective batch size = 32, learning rate = 5e-5, weight decay = 0.01, 8-bit AdamW optimizer with linear warmup (first 100 steps). Training took ~5 days on an A100 80GB GPU. Implementation used the unsloth library. A smaller variant (Minitaur, based on Llama 3.1 8B) was also trained with the same recipe. Evaluation metric: Goodness-of-fit was assessed using negative log-likelihoods (NLL) averaged over responses; multi-token responses were summed per response. One-sided t-tests were used for hypothesis-driven comparisons (Centaur expected to outperform baselines). Additional analyses included noise ceiling estimation (for context-independent settings), and contamination checks (LogProber) indicating no evidence of pretraining memorization. Baselines: Fourteen domain-specific cognitive/statistical models served as baselines, fitted jointly on training participants and evaluated on held-out participants. Out-of-distribution (OOD) evaluations used parameters fitted on the most similar in-distribution experiment (e.g., spaceship two-step for magic-carpet two-step; horizon task for Maggie’s farm; none for logical reasoning). Open-loop simulations: Centaur was simulated in horizon task (explore–exploit), two-step task (model-free vs model-based learning), and social prediction games to assess whether generated behaviour mirrors human distributions and fails to match artificial agents. Neural alignment analyses: Internal representations (residual stream, pre-choice and post-feedback) were extracted across layers, PCA-reduced (95% variance), and used in regularized linear regression to predict fMRI activity. For the two-step task dataset (94 participants, 300 choices; magic-carpet or abstract cover story), beta maps were averaged within ROIs (Schaefer 2018 atlas, 100 cortical + subcortical ROIs; accumbens via Harvard–Oxford); preprocessing via fMRIPrep 24.0 with GLMs including subtrial regressors and motion/noise terms. For sentence-reading (five participants, 1,000 sentences), publicly available code from the original study was reused, replacing GPT2-XL with Centaur/Llama, and meta-analysis assessed layer-wise correlation differences. Response-time modeling: Approximately 4,000,000 RTs were analyzed via linear mixed-effects models predicting log RTs from log response entropies derived from Centaur, Llama, or cognitive models. Model-guided scientific discovery: Using Psych-101 and Centaur as a predictive reference, the authors conducted a case study in multi-attribute decision-making. They prompted DeepSeek-R1 to generate behavioural explanations, formalized a novel heuristic combination model, and then applied scientific regret minimization (SRM) to identify predictable-but-misfit responses (relative to Centaur), yielding an interpretable weighted combination model that matched Centaur’s predictive fit. Model comparisons used AIC and protected exceedance probability.
- Psych-101 scale: 160 experiments; 60,092 participants; 10,681,650 choices; 253,597,411 tokens.
- Held-out participants: Fine-tuning improved goodness-of-fit across experiments. Average difference in log-likelihoods after fine-tuning = 0.14; Centaur NLL = 0.44 vs Llama NLL = 0.58 (one-sided t-test: t(1,985,732) = −144.22, P ≤ 0.0001; Cohen’s d = 0.20).
- Against domain-specific cognitive models: Centaur outperformed in all but one experiment; average difference = 0.13 (cognitive models NLL = 0.56; t(1,985,732) = −127.58, P ≤ 0.0001; Cohen’s d = 0.18).
- Open-loop behaviour: Horizon task performance comparable to humans (Centaur mean = 54.12, SD = 2.89; humans mean = 52.78, SD = 2.90; equivalence test, two one-sided t-tests with ±3-point margin: P = 0.02). Centaur exhibited human-like uncertainty-directed exploration. Two-step task produced bimodal distributions reflecting model-free/model-based mixtures. Social prediction: Centaur predicted human strategies at 64% accuracy but artificial agent strategies at 35% (t(230) = 20.32, P ≤ 0.0001), mirroring human outcomes.
- OOD generalization: • Modified cover story (two-step, magic-carpet): Centaur NLL = 0.51; Llama = 0.63; cognitive hybrid model = 0.61 (t(9,701) = −24.7 vs Llama; t(9,701) = −20.7 vs cognitive; P ≤ 0.0001). • Modified problem structure (Maggie’s farm, three-armed bandit): Centaur NLL = 0.42; Llama = 0.62; cognitive model = 0.98 (t(510,153) = −204.2 vs Llama; t(510,153) = −559.8 vs cognitive; P ≤ 0.0001). • Entirely new domain (logical reasoning, LSAT-like items): Centaur NLL = 1.65; Llama = 1.92 (t(198,406) = −50.39, P ≤ 0.0001; Cohen’s d = 0.23). • Additional OOD paradigms (moral decision-making, economic games, naturalistic category/reward learning, behavioural propensities, deep sequential decision): Centaur consistently outperformed smaller/non-fine-tuned models; e.g., moral decision-making t(181,388) = −103.54, P ≤ 0.0001; economic games t(7,798) = −11.69, P ≤ 0.0001.
- Response times: Centaur-derived entropies explained more variance in log RTs (conditional R² = 0.87) than Llama (R² = 0.75; log BF = 53,773.5) and cognitive models (R² = 0.77; log BF = 14,995.5).
- Benchmarks: Centaur maintained/improved performance on ML benchmarks (metabench), including significant improvement on TruthfulQA (z = 2.312, P = 0.021). On CogBench, Centaur’s performance improved across tasks and became more human-like across ten behavioural metrics (e.g., model-basedness: z = 9.608, P ≤ 0.0001; temporal discounting: z = 2.594, P = 0.005).
- Neural alignment: In the two-step task fMRI dataset, Centaur’s internal representations predicted human neural activity better than Llama’s across layers (all pairwise one-sided t-tests, P ≤ 0.001). In sentence reading, inverse-weighted meta-analysis of layer-wise correlation differences showed a significant overall benefit (β = 0.007, 95% CI [0.0002, 0.013], P = 0.045), with peak predictability around layer 20.
- Model-guided scientific discovery: A DeepSeek-R1-derived heuristic combination model improved over original strategies (AIC = 181.7) but was inferior to Centaur (AIC = 72.5). Scientific regret minimization yielded an interpretable weighted combination model (SRM) matching Centaur (AIC = 71.7), with protected exceedance probability P = 0.83, suggesting humans combine heuristics rather than rely on a weighted additive strategy.
Fine-tuning a state-of-the-art language model on large-scale, trial-level behavioural data yields a foundation model—Centaur—that reliably predicts human behaviour across diverse paradigms and populations. Centaur surpasses domain-specific cognitive models and the unfine-tuned base model on held-out participants, and crucially generalizes to experiments with altered narratives, task structures, and even to domains absent from training. Its ability to generate human-like open-loop behaviour and to align internal representations with human neural activity underscores the promise of data-driven, domain-general models for cognitive science. The case study demonstrates how Centaur can guide interpretable theory formation (via regret minimization), providing a blueprint for model-guided scientific discovery. Beyond behaviour prediction, Centaur’s response entropy tracks human response times, highlighting multi-measure predictive power. The work suggests that leveraging foundation models trained on human behaviour can bridge predictive performance with theoretical insights, advancing the field toward integrated cognitive theories and enabling applications such as in silico experimental design, power estimation, and hypothesis generation.
Centaur, derived by fine-tuning a large language model on Psych-101, consistently predicts and simulates human behaviour across many domains, passes extensive out-of-distribution checks, and exhibits increased alignment with human neural activity. It provides a strong, domain-general baseline for cognitive modeling and a practical tool for model-guided scientific discovery, enabling the development of interpretable theories that match predictive performance. The results support data-driven discovery as a promising path toward unified theories of cognition. Future work should probe Centaur’s internal representations to derive hypotheses about human knowledge and processing, expand Psych-101 to cover broader domains and individual differences, and explore training models from scratch on behavioural datasets to investigate cognitive architectures and the interplay of domain-general and domain-specific modules.
Current Psych-101 coverage is biased toward learning and decision-making, with limited inclusion of domains like psycholinguistics, social psychology, and economic games (planned expansions). Individual difference variables (e.g., age, personality, SES) are not comprehensively encoded in prompts, limiting personalized modelling. The dataset remains skewed toward WEIRD populations despite some cross-cultural/meta-studies. The natural-language transcription format introduces selection bias against experiments that cannot be faithfully expressed in text; moving to multimodal formats is a stated goal. Noise-ceiling estimation is challenging for sequential tasks; standard ceilings may underestimate models that exploit context. Out-of-distribution baselines are constrained by parameter transfer from the most similar in-distribution paradigms and may not capture all structural differences. Although Centaur maintained performance on ML benchmarks, broader assessments across tasks and architectures (including from-scratch training) are needed to fully understand trade-offs and mechanistic interpretability.
Related Publications
Explore these studies to deepen your understanding of the subject.

