Interdisciplinary Studies
Memory consolidation from a reinforcement learning perspective
J. W. Lee and M. W. Jung
The paper examines why and how some memories are selectively consolidated while others fade, revisiting classic and contemporary theories of systems consolidation. Standard systems consolidation theory posits that memories initially stored in the hippocampus are gradually reorganized across neocortical networks, reducing hippocampal dependence over time. However, observations of persistent hippocampal involvement in vivid episodic recall across the lifespan and the lack of a clear temporal gradient in retrograde amnesia challenge this view. Alternative accounts include multiple trace theory, which argues for enduring hippocampal dependence for detailed episodic memories via new traces formed at each retrieval, and trace transformation theory, which emphasizes lifelong, bidirectional hippocampal–neocortical interactions and coexisting memory forms. A central unresolved issue is selectivity: not all experiences are consolidated equally. Emotional arousal and stress hormones (e.g., epinephrine, cortisol), via the amygdala, modulate consolidation, favoring behaviorally significant events, but this mechanism lacks specificity over broad time windows. The authors propose a different perspective grounded in two lines of evidence: the hippocampus supports imagination/future simulation and CA1 encodes robust value signals. They outline a simulation-selection model of CA3–CA1 function and relate it to the Dyna reinforcement learning framework, suggesting that consolidation can be viewed as offline reinforcement learning that selectively reinforces high-value future strategies.
Evidence for hippocampal support of imagination: Human studies show hippocampal damage impairs construction of detailed hypothetical episodes; the hippocampus, as part of the default mode network, is engaged during both autobiographical recall and future thinking. In rodents, hippocampal replay during rest/sleep reactivates past trajectories and can recombine them into novel sequences; preplay suggests anticipatory activation of sequences before actual experience, indicating forward models for prediction and planning. Evidence for value representation in the hippocampus: Imaging in humans reveals value-related BOLD signals in the hippocampus alongside classical valuation regions. Electrophysiology in rats and monkeys and calcium imaging in mice show robust value coding in CA1, comparable to orbitofrontal cortex and striatum. CA1 integrates choice and reward signals and temporally overlaps with reward prediction error components. Along the transverse axis, CA1 exhibits stronger value signals than CA3 or subiculum; CA1 place fields remap with reward changes, and chemogenetic inactivation of CA1 impairs value learning, highlighting CA1's critical valuation role. Collectively, these studies across species support conserved hippocampal valuation processes and set the stage for a CA3–CA1 division of labor in simulation and selection.
This is a perspective/theoretical article. The authors synthesize empirical findings from human neuroimaging, rodent and primate electrophysiology/calcium imaging, and hippocampal circuit anatomy/physiology to motivate a computational framework. They propose the simulation-selection model: CA3, with extensive weak recurrent collaterals and dynamics during sharp-wave ripples, stochastically generates diverse experienced and novel activity sequences; CA1, lacking strong recurrence but carrying robust value signals, evaluates and selectively strengthens high-value sequences. They map this division of labor onto the Dyna reinforcement learning architecture, wherein simulated experiences (CA3) complement direct experience to update value functions/policies (CA1). The paper outlines testable predictions (e.g., stronger value dependence of CA1 vs. CA3 replay; effects of blocking CA3–CA1 plasticity on CA1 replay value dependence) but does not conduct new experiments.
- The hippocampus contributes to imagination and future simulation: replay and preplay phenomena indicate construction and recombination of sequences beyond direct experience.
- CA1 robustly encodes value across species and surpasses CA3/subiculum in value signal strength; CA1 integrates choice and reward timing consistent with valuation and learning roles. Inactivation of CA1 (but not CA3) impairs incremental value learning without disrupting value-dependent action selection.
- During sharp-wave ripples (SWRs), CA3 initiates replays including novel combinations; CA1 preferentially reactivates and strengthens sequences associated with reward or high value (e.g., trajectories to reward locations), with reward increasing replay rate and fidelity and enhancing consolidation of associated content.
- Circuit-level quantitative features support the model: CA3 pyramidal neurons receive up to ~12,000 recurrent collateral synapses (≈75% of excitatory inputs); EC layer II perforant path inputs to CA3 ≤3,750 synapses; mossy fiber inputs ~46 synapses. CA1 receives Schaffer collateral inputs ~12,000 synapses and EC layer III perforant path inputs ≤2,000 synapses, but lacks strong recurrent collaterals.
- Proposed simulation-selection model: CA3 serves as a generator of diverse experienced and unexperienced sequences; CA1 evaluates and selectively reinforces high-value sequences, prioritizing representations that support optimal future decisions.
- Conceptual alignment with Dyna reinforcement learning: consolidation construed as offline RL that leverages simulations from limited experiences to accelerate value learning and strategy optimization.
Framing consolidation as offline reinforcement learning addresses the selectivity problem: rather than uniformly strengthening all recent memories or relying solely on broad emotional arousal, the hippocampus selectively reinforces high-value trajectories and strategies. CA3-generated simulations expand the space of candidate futures beyond direct experience, while CA1 valuation filters and strengthens those likely to maximize rewards. This mechanism explains preferential consolidation of reward-related content, aligns with observed value-biased reactivation during SWRs and sleep, and integrates hippocampal roles in episodic construction, planning, and decision-making. By mapping hippocampal computations onto Dyna-like architectures, the perspective bridges neural phenomena (replay/preplay, CA3 recurrence, CA1 valuation), behavior (improved decision-making), and computation (model-based simulation complementing value updates). The framework generates testable predictions about replay value dependence, the impact of manipulating CA3–CA1 plasticity, and differential contributions of CA3 vs. CA1 to diversity vs. selection, providing avenues to evaluate consolidation as a strategy-derivation process.
Integrating recent empirical and theoretical advances, the authors propose that hippocampal processes underlying memory consolidation can be understood as a form of offline reinforcement learning. In this view, CA3 simulates diverse potential experiences, and CA1 selectively reinforces high-value sequences, transforming past experiences into actionable future strategies. This perspective complements existing roles attributed to consolidation, such as schema formation and semantic abstraction, and emphasizes the hippocampus’s predictive, decision-oriented functions. Conceptualizing consolidation through reinforcement learning offers a unifying framework that links neural dynamics, behavior, and computation, and suggests concrete, testable predictions for future research.
As a theoretical perspective, the article presents no new empirical data. Key mechanisms remain incompletely characterized, including the precise value dependence of CA1 activity during SWRs, the differentiation of positive versus negative value processing during replay, and the specificity with which emotional arousal influences which experiences are consolidated. The serial organization of the CA3–CA1 circuit complicates causal interpretation of manipulations targeting each subregion. Predictions such as stronger value dependence in CA1 than CA3 replay and the effects of blocking CA3–CA1 plasticity require targeted experimental testing. Generalizability beyond spatial tasks and across species, contexts, and timescales also needs empirical validation.
Related Publications
Explore these studies to deepen your understanding of the subject.

