Computer Science
MASTERING MEMORY TASKS WITH WORLD MODELS
M. R. Samsami, A. Zholus, et al.
Model-based RL agents struggle with long-term dependencies—Recall to Imagine (R2I) fixes this by integrating a new family of state space models into world models to boost long-term memory and long-horizon credit assignment. R2I sets new state-of-the-art on memory and credit-assignment benchmarks like BSuite and POPGym, achieves superhuman results on Memory Maze, matches performance on Atari and DMC, and converges faster than DreamerV3. This research was conducted by Mohammad Reza Samsami, Artem Zholus, Janarthanan Rajendran, and Sarath Chandar.
~3 min • Beginner • English
Introduction
The paper addresses the problem that model-based reinforcement learning agents struggle with long-term temporal dependencies, which impairs both memory (recalling distant observations) and long-horizon credit assignment (associating actions with delayed rewards). The authors propose integrating structured state space models (SSMs), specifically variants inspired by S4, into world models to improve temporal coherence. The research question is whether SSM-backed world models can enable agents to learn long-range dependencies efficiently, thereby improving performance on memory- and credit-assignment-heavy tasks while maintaining generality in standard RL benchmarks. The importance lies in overcoming limitations of RNNs (vanishing gradients) and Transformers (quadratic complexity and instability on long sequences) for scaling to long contexts required in RL. The proposed method, Recall to Imagine (R2I), aims to deliver long-term memory and efficient credit assignment with computational efficiency, enabling superior performance in memory-demanding domains and competitive results in general RL tasks.
Literature Review
The paper situates its work within model-based RL methods leveraging world models such as Dreamer (and DreamerV2/V3), highlighting their use of RSSM (GRU-based) sequence models for latent imagination. It reviews limitations of RNNs (vanishing/exploding gradients) and growing interest in Transformer-based world models (TransDreamer, other transformer world models) which face quadratic complexity and instability on very long sequences. It then focuses on state space models (SSMs), particularly S4 and its variants, which have shown strong performance in long-range sequence modeling in supervised/self-supervised tasks by parameterizing continuous-time linear dynamical systems with structured matrices and efficient discretization, offering parallel training and fast recurrent inference. Recent works explore diagonal parameterizations (S4D), MIMO SSMs, and parallel scan computation (S5), as well as hybrid SSM-attention architectures. The authors also reference recent analyses of memory versus credit assignment in RL and note that improving world model likelihood alone may not guarantee policy performance in online RL, motivating careful integration of SSMs within an MBRL system.
Methodology
R2I integrates SSMs into DreamerV3’s world model, forming a Structured State-Space Model (S3M) backbone designed for long-range dependency capture and efficient computation. Key design components: 1) Non-recurrent representation model: replace qφ(z_t|h_t, o_t) with qφ(z_t|o_t), enabling independent posterior sampling per time step and parallel computation of deterministic states h_{1:T} via an SSM-based sequence model fθ((a_{1:T}, z_{1:T}), x0). 2) Architecture: The world model includes a representation model qφ(z_t|o_t), a dynamics/prior pθ(z_t|h_t), and a sequence model fθ composed of multiple SSM layers operating per Equation 2. Each SSM layer output is passed through GeLU, a fully connected GLU transformation, and LayerNorm, following S5-style design. Prediction heads p(o_t|z_t,h_t), p(r_t|z_t,h_t), and p(c_t|z_t,h_t) reconstruct observations, rewards, and continuation flags. In image domains, CNNs serve as encoder/decoder; in tabular domains, MLPs are used. 3) Training objective: Optimize L(θ) = Σ_t [Lpred + Lrep + Ldyn] under posterior samples z_{1:T} ~ qφ, with KL clipping at 1, KL balancing, and scaling coefficients (βpred, βrep, βdyn), mirroring DreamerV3’s ELBO-derived objective with modifications. Deterministic states h_{1:T}, hidden states x_{1:T} are obtained in parallel: h_{1:T}, x_{1:T} = fθ((a_{1:T}, z_{1:T}), x0). 4) SSM computational modeling: Choose parallel scan (instead of convolution) to compute u_{1:T}, x0 → y_{1:T}, x_{1:T}, enabling exposure of hidden states x_t to the policy, constant-time imagination steps, efficient sequence-length scaling across devices, and resettable hidden states at episode boundaries. The binary operator for parallel scan is modified to incorporate done flags while preserving associativity, allowing hidden state resets within sampled sequences containing multiple episodes. 5) Actor-critic: Unlike GRU-RSSM where h_t is used both for recurrence and heads, R2I passes h_t only to prediction heads and uses SSM hidden state x_t for recurrent updates. Empirically, policies conditioned on (z_t, x_t) are more stable and performant in memory-intensive environments than (z_t, h_t) or (z_t, h_t, x_t), likely due to non-stationarity of features as the world model evolves. Actor-critic training follows DreamerV3, including imagination from every context time t for H steps, two-hot value regression, fixed entropy bonus, and percentile-based return normalization. Policy inputs are selected per domain: output state policy in non-memory tasks, hidden state policy or full state policy in memory tasks (with hidden state preferred to mitigate instability). 6) SSM design choices: Diagonal parameterization of state matrices (for simplicity and speed), MIMO SSMs (to reduce latent sizes and remove mixing layers), bilinear discretization, post-normalization, GLU post-SSM transformation. Hyperparameters largely follow DreamerV3 with fixed world model settings across domains; sequence lengths in training batches are set to encompass full episodes to fully leverage temporal dependencies.
Key Findings
- R2I achieves state-of-the-art performance on memory and credit assignment tasks:
• BSuite (Memory Length, Discounting Chain): R2I maintains high success rates across longer memory lengths and reward delays where DreamerV3 degrades; able to learn dependencies beyond ~30 steps and maintain performance up to ~100-step memory/credit delays.
• POPGym (RepeatPrevious, Autoencode, Concentration): R2I sets new SOTA in Autoencode (Easy, Medium) and RepeatPrevious (Medium, Hard) and improves over baselines in Concentration (Medium). It consistently outperforms DreamerV3 and 13 model-free baselines (PPO variants with GRU/LSTM/MLP/PosMLP). Memory demands reach up to 104–208 parallel items (depending on environment/difficulty), and R2I pushes memory limits substantially.
- Memory Maze (3D, long-horizon memory, up to 4K steps per episode): After 400M environment steps, R2I outperforms Dreamer and IMPALA at all sizes and surpasses human-level performance in 9x9, 11x11, and 13x13 mazes. Reported detailed metrics (Appendix T):
• 9x9: Score 33.55, Human-normalized 127%, Oracle-normalized 96%, at 17M steps.
• 11x11: Score 51.96, Human-normalized 117%, Oracle-normalized 89%, at 66M steps.
• 13x13: Score 58.14, Human-normalized 104.7%, Oracle-normalized 78%, at 206M steps.
• 15x15: Score 40.86, Human-normalized 60%, Oracle-normalized 46%.
- Generality retained: In Atari 100K and DMC (proprioception and vision), R2I matches DreamerV3’s performance profiles, indicating no sacrifice in standard RL tasks.
- Computational efficiency: R2I is faster than DreamerV3, with up to 9.7× speedup in vector environments and ~3.2× in image environments (train step time vs. sequence length). This yields faster wall-time convergence; R2I achieved superhuman Memory Maze performance in fewer training days than Dreamer.
- Policy input ablations: Conditioning policies on SSM hidden state x_t improves stability and performance in memory tasks versus using h_t; full-state inputs can introduce instability due to non-stationarity.
- Design ablations: Non-recurrent representation model q(z|o) maintains or improves performance; full episode inclusion in training batches markedly improves memory task results; longer batch sequence lengths lead to better Memory Maze performance.
Discussion
The findings demonstrate that integrating SSMs into the world model backbone enables agents to capture and leverage long-range dependencies crucial for memory and delayed credit assignment. By enabling parallel training and recurrent inference via parallel scan, R2I provides both the representational capacity and computational pathways to maintain long temporal context. The improved results in BSuite and POPGym confirm enhanced memory retention and credit assignment, while superhuman performance in Memory Maze shows that the method scales to complex, long-episode 3D navigation tasks where building and recalling a map over thousands of steps is essential. Crucially, R2I preserves performance on non-memory-dominant tasks (Atari, DMC), supporting the claim of generality. The policy input analyses suggest that making the policy depend on stable, Markovian-like hidden states (x_t) is key to robust optimization in online MBRL, addressing instability from non-stationary features. Overall, R2I addresses the research question by showing that SSM-backed world models can substantially improve long-term memory and credit assignment without compromising general performance, and with significant computational speedups.
Conclusion
R2I is a general, fast, and memory-enhanced model-based RL agent that integrates SSMs into DreamerV3’s world model. It achieves state-of-the-art results in memory-intensive and credit-assignment tasks, including superhuman performance in Memory Maze, while maintaining competitive performance on standard RL benchmarks. The method demonstrates up to 9× faster training step times and stable policy training when conditioned on SSM hidden states. Contributions include a non-recurrent representation model enabling parallel computation, a parallel-scan SSM operator with episode resets, and principled actor-critic integration that benefits from exposed hidden states. Future work includes exploring hybrid attention-SSM architectures to combine local attention with long-range memory, extending batch sequence lengths and imagination horizons further, and investigating strategies to mitigate feature non-stationarity for policies in online training.
Limitations
R2I currently relies on fixed world model hyperparameters and selected policy inputs per domain; policies conditioned on h_t can be unstable due to non-stationarity, and even full-state inputs may introduce instability. The batch sequence lengths and imagination horizons, while sufficient for the reported tasks, are not extremely long compared to the maximum possible contexts, suggesting room to push memory further. The approach does not incorporate attention mechanisms, which could complement SSMs with local inductive biases. Computational design choices (e.g., diagonal parameterization, bilinear discretization) trade some expressivity for speed; while effective empirically, alternative parameterizations might yield further gains at higher cost. Lastly, precise publication timelines and broader cross-domain generalization beyond tested benchmarks remain to be explored.
Related Publications
Explore these studies to deepen your understanding of the subject.

