Psychology
A foundation model to predict and capture human cognition
M. Binz, E. Akata, et al.
The paper addresses the challenge of moving from domain-specific computational and cognitive models to a unified approach capable of predicting human behaviour across diverse domains. Existing systems (e.g., AlphaGo) and theories (e.g., prospect theory) excel in narrow settings but fail to generalize to learning, planning, exploration, and other facets of cognition. The authors propose that progress towards a unified theory requires a general computational model that predicts and simulates human behaviour across arbitrary experiments. They introduce Centaur, a foundation model of human cognition, created by fine-tuning a large language model on Psych-101, a large corpus of trial-by-trial human behavioural data transcribed into natural language across 160 canonical psychological experiments (including multi-armed bandits, decision-making, memory, supervised learning, Markov decision processes, and others). The goal is to demonstrate robust prediction on held-out participants and experiments, generalization to new cover stories and task structures, transfer to entirely new domains, and improved alignment with neural activity—thereby supporting the feasibility of domain-general predictive cognitive models.
The work situates itself within efforts to build general models of human cognition and behaviour, referencing foundational calls for unified theories of cognition and contrasting with domain-specific paradigms. It cites: domain-specific success in AI (e.g., AlphaGo) and cognitive models (e.g., prospect theory); cognitive-science paradigms for exploration (e.g., horizon task), reinforcement learning (two-step task), and decision-making; and recent work using large-scale behavioural datasets and machine learning to discover decision-making theories. It also connects to literature on large language models (Llama 3.1), parameter-efficient fine-tuning (QLoRA), evaluation benchmarks (CogBench, metabench), human-alignment analyses, and neural prediction studies using model representations. The paper argues that previous models lack the breadth to generalize across diverse tasks, cover stories, and domains, motivating a foundation-model approach grounded in large-scale trial-level behavioural data.
Data: Psych-101 was constructed by manually transcribing 160 psychological experiments into natural language prompts, each containing full trial-by-trial session histories for single participants. Inclusion criteria: publicly available trial-level data; transcribable without substantial information loss; broad domain coverage. Psych-101 includes >60,000 participants, >10,000,000 choices, 253,597,411 tokens across domains such as bandits, decision-making, memory, supervised learning, MDPs.
Model: Centaur is built by fine-tuning Llama 3.1 70B using QLoRA (quantized low-rank adaptation). The base model is frozen in 4-bit quantization; low-rank adapters (rank r=8) are added to all non-embedding linear layers of attention and feedforward networks, introducing ~0.15% additional trainable parameters (half-precision). Training objective is cross-entropy, with loss masked for all tokens except human responses to focus on behavioural prediction rather than instruction completion. Training ran for one epoch (to avoid overfitting) with effective batch size 32, learning rate 5e-5, weight decay 0.01, 8-bit AdamW optimizer with linear warm-up over the first 100 steps. Implemented with unsloth; training took ~5 days on an A100 80GB GPU. A smaller 8B version (Minitaur) was also trained with the same recipe for prototyping.
Evaluation metric: Negative log-likelihood (NLL) averaged over responses; for multi-token responses, log-likelihoods were summed per response then averaged. One-sided t-tests assessed whether Centaur outperformed competitors given a directional hypothesis.
Baselines: 14 domain-specific cognitive/statistical models (e.g., generalized context model, prospect theory, various RL models) were fitted on training participants (joint parameters), and evaluated on held-out participants using NLL. For out-of-distribution (OOD) evaluations, baseline parameters were transferred from the most similar in-distribution experiment (e.g., two-step task with spaceship cover story for magic carpet; horizon task for Maggie’s farm). No baseline was used for logical reasoning (no close proxy in training data).
Generalization tests: (1) Held-out participants within Psych-101 experiments; (2) Modified cover story (two-step task with magic-carpet cover story); (3) Modified task structure (three-armed bandit “Maggie’s farm”); (4) Entirely new domain (logical reasoning/LSAT items); plus six additional OOD paradigms (moral decision-making, economic games, naturalistic category/reward learning, behavioural propensities, deep sequential decision task).
Open-loop simulations: Centaur responses were fed back into the model to test behavioural generation in horizon task, two-step task, and a social prediction game (predicting strategies of humans vs artificial agents with matched statistics).
Response times: ~4,000,000 RTs were analyzed; three linear mixed-effects models predicted log RT from log response entropy derived from Centaur, Llama, and cognitive models; conditional R² and Bayes factors compared fits.
Neural alignment: Using fMRI data from two prior datasets: (i) Two-step task with magic-carpet or abstract cover stories (neither in training); (ii) sentence-reading task (1,000 sentences per participant). Regularized linear regression models predicted ROI-aggregated fMRI activity from model internal representations (residual stream) extracted at different layers, PCA-reduced (95% variance). ROI definition used Schaefer 2018 atlas (100 ROIs) and Harvard-Oxford for accumbens; preprocessing with fMRIPrep 24.0, GLMs per subtrial with standard nuisance regressors and SPM HRF.
Benchmarks: Performance on metabench tasks (ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, Winogrande) and CogBench (performance and behavioural metrics) compared Centaur vs base Llama.
Model-guided scientific discovery: Psych-101 and Centaur were used to refine cognitive models on a multi-attribute decision-making dataset. DeepSeek-R1 generated verbal strategy hypotheses, which were formalized and then iteratively improved via scientific regret minimization using Centaur as a reference to identify mispredicted responses; the final interpretable model combined heuristics with a weighted switch and was evaluated by AIC and group-level protected exceedance probability.
- Held-out participants (in-distribution): Fine-tuning improved goodness-of-fit across experiments. Average difference in log-likelihoods after fine-tuning was 0.14 (Centaur NLL 0.44 vs Llama NLL 0.58; one-sided t-test: t(1,985,732) = –144.22, P ≤ 0.0001; Cohen’s d = 0.20). Centaur outperformed domain-specific cognitive models in all but one experiment; average improvement 0.13 (cognitive models NLL 0.56; t(1,985,732) = –127.58, P ≤ 0.0001; d = 0.18).
- Open-loop simulations: Horizon task performance matched humans (Centaur mean = 54.12, s.d. = 2.89; Humans mean = 52.78, s.d. = 2.90; equivalence test TOST ±3 points, P = 0.02) and exhibited uncertainty-directed exploration similar to humans. Two-step task simulations reproduced human-like distributions of model-basedness (including bimodality). In social prediction, Centaur predicted human responses well (64% accuracy) but not artificial agents (35%; one-sided t(230) = 20.32, P ≤ 0.0001), mirroring human results.
- OOD generalization: • Modified cover story (two-step, magic-carpet): Centaur NLL 0.51 vs Llama 0.63 vs cognitive model 0.61; t(9,701) = –24.7 vs Llama, P ≤ 0.0001; t(9,701) = –20.7 vs cognitive model, P ≤ 0.0001. • Modified structure (Maggie’s farm, three-armed bandit): Centaur NLL 0.42 vs Llama 0.62 vs cognitive 0.98; t(510,153) = –204.2 vs Llama; t(510,153) = –559.8 vs cognitive; both P ≤ 0.0001. • Entirely new domain (logical reasoning/LSAT): Centaur NLL 1.65 vs Llama 1.92; one-sided t(198,406) = –50.39, P ≤ 0.0001; d = 0.23. • Six additional OOD paradigms: Centaur robustly improved NLLs vs Llama (e.g., moral decision-making t(181,388) = –103.54; economic games t(7,798) = –11.69; naturalistic category learning t(21,838) = –14.05; behavioural propensities t(156,230) = –11.06; naturalistic reward learning t(9,838) = –12.63; deep sequential decision task t(6,092) = –1.06, P = 0.144).
- Response times: RTs were better predicted by Centaur-derived response entropy (conditional R² = 0.87) than by Llama (0.75; log BF_Centaur,Llama = 53,773.5) or cognitive models (0.77; log BF_Centaur,cognitive = 14,995.5).
- Neural alignment: Centaur’s internal representations predicted human fMRI signals better than Llama across layers in the two-step task dataset (all pairwise one-sided t-tests, P ≤ 0.001). In sentence reading, a meta-analysis of correlation differences across layers showed a significant overall benefit of Centaur over Llama (β = 0.007, 95% CI [0.0002, 0.013], P = 0.045), with peak predictability around layer 20.
- Benchmarks: Centaur maintained base performance on metabench tasks (no degradation; TruthfulQA improved significantly: z = 2.312, P = 0.021). On CogBench, Centaur improved performance metrics across experiments (e.g., horizon task z = 22.176, P ≤ 0.0001) and moved closer to human behavioural parameter values (e.g., model-basedness z = 9.608, P ≤ 0.0001; temporal discounting z = 2.594, P = 0.005).
- Model-guided discovery: A DeepSeek-R1-inspired two-step heuristic model outperformed original candidate strategies but trailed Centaur (AIC 181.7 vs Centaur 72.5). Using scientific regret minimization guided by Centaur led to a weighted-heuristics model matching Centaur (AIC 71.7) and achieving protected exceedance probability P = 0.83 in group-level model selection.
The findings demonstrate that fine-tuning a large language model on a broad, trial-level behavioural corpus produces a general predictive model—Centaur—that captures human behaviour across diverse tasks and domains. Centaur surpasses domain-specific cognitive models and the unfine-tuned base model on held-out participants and generalizes to altered cover stories, task structures, and entirely new domains. Open-loop simulations show human-like exploration and learning profiles, while failures on predicting artificial-agent behaviour indicate specificity to human patterns. Beyond choice prediction, Centaur better explains response times through entropy–RT relationships and exhibits enhanced alignment of internal representations with human neural activity in both task-based and language-processing fMRI datasets. The case study shows how Centaur can guide the development of interpretable cognitive models via scientific regret minimization, leveraging predictive gaps. Together, these results suggest that data-driven, domain-general models can complement and inform theory, serving as a foundation for automated cognitive science workflows, experimental prototyping, and hypothesis generation about human knowledge representation and information processing.
Centaur, trained on Psych-101 via parameter-efficient fine-tuning, functions as a foundation model of human cognition that robustly predicts and simulates human behaviour across a wide range of psychological paradigms, generalizes OOD, preserves base model capabilities on standard ML benchmarks, and shows improved neural alignment. The work demonstrates that domain-general, predictive models can outperform many domain-specific cognitive models and can guide scientific discovery toward interpretable theories. Future directions include probing Centaur’s internal representations (e.g., via sparse autoencoders, attention visualization), training alternative architectures directly on Psych-101 to study cognitive architectures, expanding Psych-101 to additional domains (psycholinguistics, social psychology, economic games), incorporating individual-differences metadata to model personalized behaviour, broadening cross-cultural coverage beyond WEIRD samples, and moving towards multimodal standardized data formats to reduce selection bias against non-text-expressible experiments.
- Dataset scope: Psych-101 currently emphasizes learning and decision-making; psycholinguistics, social psychology, and additional economic game paradigms are underrepresented and planned for future inclusion.
- Population bias: The dataset remains skewed towards WEIRD populations despite some cross-cultural/meta-studies, potentially limiting generalizability.
- Modality bias: Transcription into natural language favors paradigms expressible in text, introducing selection bias against tasks requiring richer modalities; a multimodal format is proposed as a long-term solution.
- Individual differences: Limited availability of participant-level metadata (e.g., age, traits, SES) restricts modelling of individual variability; plans include integrating such information in prompts to capture individual differences.
- Baseline transferability: For some OOD settings (e.g., logical reasoning), suitable domain-specific baseline models are unclear, complicating certain comparisons.
- Smaller model generalization: The 8B Minitaur variant performs well near training distribution but is less robust OOD, indicating scale and fine-tuning choices matter for generalization.
Related Publications
Explore these studies to deepen your understanding of the subject.

