Psychology

Linear reinforcement learning in planning, grid fields, and cognitive control

P. Piray and N. D. Daw

This innovative research by Payam Piray and Nathaniel D. Daw presents a model for decision-making in the brain that incorporates a temporally abstracted map of future events, enabling dynamic choices influenced by cognitive biases. Discover how this model integrates flexible replanning and cognitive control, providing a new perspective on the brain's response to long-distance contingencies.... show more

Introduction

The paper addresses how the brain produces both flexible planning and characteristic inflexibilities in sequential decision-making. Classical model-based reinforcement learning requires computationally intensive, nonlinear, iterative optimization due to interdependence of optimal choices across states (as formalized by the Bellman equation). Model-free or cached approaches save computation but lead to inflexibilities (habits, Pavlovian biases) and cannot fully account for flexible replanning. The authors propose a new computational framework—linear reinforcement learning (linear RL)—that reformulates the decision problem to allow efficient, approximate global policy optimization by introducing a soft bias around a default policy. The central research question is whether a stable, reusable, temporally abstract representation of long-run state expectancies (the default representation, DR) can support flexible replanning while naturally accounting for graded biases and cognitive control phenomena.

Literature Review

Prior work distinguishes model-based planning (computationally expensive) from model-free cached values, with theories suggesting selective deployment and reuse of prior computation to explain habits and compulsion. The successor representation (SR) caches long-run state occupancy under a fixed policy and can support certain revaluation tasks but fails when optimal policies change (policy dependence), limiting transfer to new goals. Grid cells and related entorhinal representations have been hypothesized to provide multiscale, temporally abstract bases (e.g., SR eigenvectors/graph Laplacian), yet their utility for flexible planning is unclear given SR’s policy dependence. Advances in control engineering (linearly solvable MDPs; Todorov, Kappen) show that adding KL-type control costs relative to a default policy linearizes the Bellman equation, enabling tractable solutions. The paper builds on these to propose DR as a stable map useful across goals, tying together findings in replanning, cognitive control (effort costs, Stroop), Pavlovian-instrumental transfer, and entorhinal grid/border cells.

Methodology

The authors introduce linear RL for tasks with a set of terminal (goal) states and initially assume deterministically controllable dynamics (one-to-one mapping from actions to successor states). They modify the objective to maximize cumulative reward minus a control cost proportional to the KL divergence between the decision policy and a default policy. This yields an analytically solvable, linear form of the Bellman equation in exponential space: exp(v) = M P exp(r), where v are optimal values (under reward minus control costs), r are rewards at terminal states, P encodes one-step transitions from nonterminal to terminal states, and M is the default representation (DR) capturing long-run closeness among nonterminal states under the default policy. The optimal policy has a weighted softmax form around the default policy. Simulation program:

Performance benchmark: Seven-level binary decision tree with random one-step costs, comparing linear RL to exact (model-based) and depth-limited pruning approximations (D=1..6), with D1 equivalent to SR under random walk.
Planning and transfer: Compute a single DR for tasks and reuse it to plan to arbitrary goals and update values under reward revaluation (e.g., Tolman latent learning) without recomputing DR. Demonstrate shortest-path planning in grid mazes using DR.
Policy revaluation: Three-stage task where new terminal rewards imply changes in optimal intermediate choices; test whether linear RL with fixed default policy re-plans correctly, in contrast to on-policy SR.
Transition revaluation (detour/shortcut): Introduce barriers altering the transition graph and update DR efficiently via low-rank corrections using Woodbury matrix inversion identity (M = Mold + MB), where MB’s rank equals number of changed states. Re-plan with updated DR.
Neural mapping: Propose DR eigenvectors as a basis for grid-like fields; predict stability to policy changes but sensitivity to environmental barriers. Model border cells as columns of DR update terms (MB) corresponding to walls.
Stochastic extension: For stochastic transitions, first compute desired optimal transition probabilities u(s′|s) via linear RL, then find an action policy π that best realizes u given the environment’s transition kernel S(s′|a,s) (e.g., via matrix inversion or constrained optimization). Apply to the two-step task (common=0.7, rare=0.3) and construct a counterexample where the approximation fails.
Habits and default learning: Learn a non-uniform default policy via a slow delta rule from experienced transitions; examine how overtraining biases the decision policy, modulated by control-cost scaling parameter λ; test in four-room navigation revaluation scenarios.
Cognitive control: Quantify control costs (greater for actions unlikely under default); simulate Stroop-like biases and improvement under prospective reward by balancing reward against control cost.
Pavlovian-instrumental transfer (PIT): Learn default state contingencies during Pavlovian training; at test, default biases favor outcome-congruent actions even when rewards are devalued or absent. Implementation details, parameters, and equations provided in Methods, including formal derivation of linear solution and update formulas for DR and planning vectors, and simulation settings (e.g., λ typically 1; λ=10 in some tasks to avoid overflow; perseveration 0.75 in two-step task).

Key Findings

Efficiency and accuracy: Linear RL achieved near-optimal average costs on a challenging seven-level decision tree, outperforming depth-limited model-based approximations (D=1..6) and closely approaching exhaustive optimal solution. D1 (equivalent to SR under random walk) was substantially worse than linear RL.
Reuse for flexible planning: A single DR supports planning to multiple goals and recomputation of values under reward revaluation without updating the DR; in spatial mazes, linear RL found shortest paths between arbitrary start–goal pairs, matching exhaustive search.
Policy revaluation: Linear RL correctly adapted top-level choices in a multistage task when terminal rewards changed, overcoming the interdependence of optimal actions across states; on-policy SR fails without recomputing S for the new policy.
Transition revaluation: With barriers altering transitions (Tolman’s detour), DR was efficiently updated via a low-rank correction using Woodbury identity, enabling correct replanning (e.g., choosing a detour path) without recomputing the entire map.
Neural predictions: DR eigenvectors form periodic, grid-like bases that are stable across policy changes but sensitive to changes in environmental geometry (barriers), aligning with empirical findings (e.g., hairpin maze vs. virtual hairpin). Columns of the DR update term for walls resemble entorhinal border cell fields, unifying grid and border cells as basis functions for a compositional cognitive map.
Stochastic domains: Extended linear RL reproduced model-based signatures in the two-step task (reward-by-transition interaction on stay/switch). A constructed counterexample revealed potential failures when action stochasticity critically alters predecessor-state optimal choices.
Habits and soft biases: Learning a non-uniform default policy via overtraining produced graded habitual biases in choices; overtraining helped when new goals were near old ones (same room), but harmed when goals moved to distant regions (different room), yielding suboptimal, habitual trajectories.
Cognitive control: Control costs (KL divergence from default) naturally quantify effort; actions less likely under default incur higher costs and more errors (Stroop-like effects). Increasing prospective reward improves performance by offsetting control costs, consistent with empirical data.
Pavlovian-instrumental transfer: Default policy learned during Pavlovian phase biased instrumental choices toward outcome-congruent actions during test, even under reward devaluation or absence, matching outcome-specific PIT phenomena.

Discussion

The findings demonstrate that linear RL offers a tractable, neurally plausible approximation to model-based planning by introducing control costs relative to a default policy, solving the interdependence of optimal actions across states with a linear computation. The DR serves as a stable, reusable cognitive map that supports one-shot replanning under changing goals and some changes in structure, unifying flexible planning with graded biases. This framework accounts for behavior across revaluation tasks, cognitive control phenomena, and Pavlovian biases, and links to entorhinal grid and border cells as basis functions supporting compositional map representations. The model predicts stability of grid-like codes to policy changes, sensitivity to barriers, and modulation by path costs (e.g., terrain), offering testable neurobiological hypotheses. In stochastic settings, linear RL often suffices but may fail when stochasticity at one state critically propagates to predecessor choices, predicting increased errors or reliance on slower iterative planning. Overall, the approach bridges flexible planning and control-demand costs within a single normative framework and suggests how different neural subsystems (map learning, value prediction, policy learning, control) may interact.

Conclusion

The paper introduces linear RL, a unified framework that enables efficient approximate planning by soft optimization around a default policy and a stable default representation (DR) of long-run state expectancies. This model: (i) supports flexible replanning across changing goals and certain transition changes via efficient low-rank updates; (ii) explains graded behavioral biases, habits, and cognitive control costs; (iii) offers a computational account of entorhinal grid and border cells as basis functions for compositional cognitive maps; and (iv) extends to common stochastic tasks. Future directions include: refining neural mechanisms for learning and representing the DR (e.g., eigen-decompositions, attractor/recurrent implementations, Woodbury-based composition), exploring parameter scaling (λ) in relation to neural precision and normalization, integrating information-theoretic and planning-as-inference perspectives, identifying principles for selecting/learning adaptive default policies, extending methods to broader stochastic environments, and testing neurobehavioral predictions (e.g., terrain cost effects on grid codes, conditions eliciting iterative planning).

Limitations

Deterministic controllability: Core formulation assumes a one-to-one mapping between actions and successor states; stochastic tasks require an additional approximation step and can yield failures when action stochasticity has upstream effects.
Terminal vs. nonterminal partition: Linear solvability requires distinguishing terminal goal states; multi-goal sequencing must be treated episodically or via subgoals.
DR dependence and updates: Although DR is stable across goals and weakly dependent on default policy, environmental transition changes (e.g., barriers) require updates; strong overtraining can eventually bias DR via default policy changes.
Approximation trade-offs: Control-cost scaling (λ) balances fidelity versus numerical/biological constraints; extreme values reduce to exact optimization or default dominance.
Representation choices: Eigen-decomposition is a descriptive tool; not all grid-like properties are captured, and partial bases lose information; compositional border-cell updates may need nonlinearities for translation.
Neural implementation and scope: While computations are linear and plausibly neurally implementable, precise neural circuits and mechanisms remain to be established, and different contexts may recruit distinct systems.
Specific failures: Constructed tasks reveal errors in stochastic settings (optimistic valuations of uncontrollable transitions), predicting conditions where humans may err or require slower planning.

Related Publications

Explore these studies to deepen your understanding of the subject.

Medicine and Health

Similar cognitive deficits in mice and humans in the chronic phase post-stroke identified using the touchscreen-based paired-associate learning task

W. Z. Chow, L. K. Ong, et al.

Education

Understanding the influence of children's mental health, cognitive development, and environmental factors on learning outcomes in Chile

G. Gómez, M. Rivas, et al.

Psychology

Neural and computational underpinnings of biased confidence in human reinforcement learning

C. Ting, N. Salem-garcia, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny