Computer Science

Simplifying Latent Dynamics with Softly State-Invariant World Models

T. Saanum, P. Dayan, et al.

To solve control problems an agent must predict how its actions change the world. The Parsimonious Latent Space Model (PLSM) regularizes latent dynamics by minimizing mutual information between latent states and action-induced changes, producing softly state-invariant world models that improve prediction, generalization, and downstream performance. This research was conducted by Tankred Saanum, Peter Dayan, and Eric Schulz.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses how to learn world models that capture systematic, predictable effects of actions on environment states in reinforcement learning. Standard world models encode high-dimensional observations into latent states and predict next latents, often by modeling deltas; yet, the same action may induce different changes across similar states due to how latents are constructed. The authors propose the Parsimonious Latent Space Model (PLSM) to regularize latent dynamics so actions have more coherent, predictable effects, minimizing the dependence of predicted dynamics on the current latent state, thereby achieving softly state-invariant dynamics. They explore integrating PLSM with contrastive latent state prediction and both model-based and model-free control to improve accuracy, generalization, robustness, and downstream performance.

Literature Review

Related work includes latent world models that compress state representations to ease dynamics modeling, such as RSSM and Dreamer, which impose bottlenecks on latent states via KL regularization; RePo further compresses by focusing on reward-relevant aspects; and SPR models learn by predicting target encoder representations for model-free and model-based RL. Mutual information minimization has been widely used for representation learning, generalization, robustness, and exploration. The closest is the past-future information bottleneck that minimizes information between past and future states; however, PLSM differs by treating the dynamics operator itself as a random variable and directly constraining its conditional dependence on the current latent state, aiming for softly state-invariant action effects. Additional connections include group-theoretic views of invariant transformations and discrete/categorical latent world models.

Methodology

PLSM encourages parsimonious dynamics by minimizing the mutual information between current latent states z_t and predicted latent dynamics Δ given actions a_t. The base world modeling setup encodes observations s_t into latents z_t via e_θ, predicts latent change Δ via a dynamics model d_ϕ, and forms next latents by z_{t+1} = z_t + Δ. To reduce state dependence of dynamics, PLSM introduces a query network f_θ that maps (z_t, a_t) to a low-information code h_t, and predicts Δ from (h_t, a_t): h_t = f_θ(z_t, a_t), Δ̂_t = d_ϕ(h_t, a_t). The mutual information I(z_t; Δ̂_t | a_t) is approximately minimized by penalizing the L2 norm of h_t with coefficient β, leveraging a deterministic variant of an information bottleneck (equivalent to minimizing KL to a zero-mean, constant-variance Gaussian in prior work). The training objective combines next-latent prediction error with the bottleneck term: L = || e_ϕ(s_{t+1}) − (z_t + d_ϕ(h_t, a_t)) ||^2_2 + β ||h_t||^2_2. To avoid representational collapse, PLSM is integrated with collapse-mitigating frameworks: (1) Self-Predictive Representations (SPR) for model-based planning via TD-MPC, replacing direct next-latent prediction with residual dynamics and adding the h_t bottleneck in the SPR loss using a stop-gradient target encoder; (2) Contrastive World Models (CWM), adding the ||h_t||^2_2 term to a margin-based contrastive loss that enforces distinguishable latent states and correct long-horizon predictions. PLSM is also applied to RePo (a Dreamer-based model) by feeding a compressed h_t into the recurrent GRU dynamics to achieve softly state-invariant recurrent updates. The approach is evaluated on continuous-control tasks (DMC), visually distracting control (Distracting Control Suite), latent prediction datasets (cubes, shapes, dSprites, Pong, Space Invaders, MOVi-E), and model-free RL in Atari, with ablations (e.g., no h_t, top-k bottleneck, weight decay controls) and analysis of representations (decoding generative factors from z_t vs h_t).

Key Findings

• Model-based RL (TD-MPC with PLSM): Clear performance gains in four of five DMC tasks (acrobot-swingup, finger-turn hard, quadruped-walk, humanoid-walk), with β=0.1 generally and β=0.001 for humanoid-walk; quadruped-run showed no improvement. Overall, parsimonious dynamics improved planning in complex, high-dimensional control. • Distracting visual control (RePo + PLSM): Training on the Distracting Control Suite with random video backgrounds, softly state-invariant recurrent dynamics via h_t regularization improved performance under visual distractions (one million steps, five seeds). • Contrastive latent prediction (CWM + PLSM): PLSM improved long-horizon prediction accuracy in five of six datasets, with particularly strong gains in cubes and shapes, approaching near-perfect accuracy at horizon 10. In Pong, enforcing fully parsimonious dynamics was less beneficial due to non-controllable elements; a hybrid model that constrains only half the latent space performed best in Atari. • Generalization and robustness: PLSM generalized better to scenes with fewer objects than training (cubes and shapes) and was more robust to Gaussian noise in dSprites, maintaining ~60% 10-step accuracy at σ=0.2 vs <50% for unconstrained dynamics. • Representation analysis: z_t latents carry decodable information about generative factors (positions, scales, orientations) with high accuracy; h_t encodes less about these factors overall but, when conditioned on an action affecting one object, carries more information about that specific object’s attributes, consistent with its role as a minimal action-relevant query. • Slot-based variant: PLSM with slots outperformed C-SWM at horizon t=10 in cubes 9 (Accuracy: 99.05% ± 0.18 vs 97.45% ± 1.3). • Model-free RL (Atari, SPR vs SPR+PLSM): Average performance improved, with notable gains in some games (e.g., UpNDown: 29572.0 vs 10859.2; Pong: 1.1 vs −5.4), while some tasks saw declines where non-controllable dynamics are prominent, highlighting the need to tune β and consider hybrid constraints.

Discussion

The findings support the central hypothesis that regularizing latent dynamics to minimize their dependence on the current state—conditional on action—produces more systematic and predictable action effects in learned world models. PLSM’s softly state-invariant dynamics led to improved planning efficiency, long-horizon latent prediction accuracy, generalization to shifts in object count, and robustness to observation noise. In environments where significant dynamics lie outside the agent’s control, fully parsimonious dynamics can omit relevant features; the hybrid approach that constrains only part of the latent space helps reconcile this. Overall, treating action effects as consistent transformations across latent states encourages world models that generalize better and are more data-efficient for downstream control and prediction.

Conclusion

PLSM introduces an information bottleneck on latent dynamics via a query representation, making action-induced changes in latent states more predictable and softly state-invariant. Integrated with contrastive modeling and SPR-based planning/control, PLSM consistently improves latent prediction accuracy, generalization, robustness, and RL performance in complex pixel-based environments. Future directions include applying the bottleneck in recurrent, history-conditioned settings to enforce softly history-invariant action effects; combining PLSM with discrete/categorical latent dynamics (e.g., transition matrices); and leveraging parsimonious action modeling in controllable video generation systems such as Genie and UniSim to improve accuracy and controllability.

Limitations

The approach assumes actions have predictable, largely deterministic effects; components with dynamics not well explained by actions may be underrepresented in the learned model, potentially harming performance in tasks where such factors are important. Performance varied across Atari games, with some decreases observed. The regularization strength β requires tuning per environment to achieve the right balance between compression and predictive accuracy.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics

C. Gumbsch, N. Sajid, et al.

Computer Science

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

N. Dainese, M. Alakuijala, et al.

Computer Science

MASTERING MEMORY TASKS WITH WORLD MODELS

M. R. Samsami, A. Zholus, et al.

Psychology

Latent brain state dynamics distinguish behavioral variability, impaired decision-making, and inattention

W. Cai, S. L. Warren, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny