logo
ResearchBunny Logo
Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics

Computer Science

Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics

C. Gumbsch, N. Sajid, et al.

THICK (Temporal Hierarchies from Invariant Context Kernels) learns hierarchical world models with discrete latent dynamics: a low level that sparsely updates invariant contexts and a high level that predicts context changes, producing interpretable temporal abstractions and improved model-based RL and planning. Research conducted by Christian Gumbsch, Noor Sajid, Georg Martius, and Martin V. Butz.

00:00
00:00
~3 min • Beginner • English
Introduction
The paper addresses the challenge of enabling long-horizon planning in model-based reinforcement learning by learning hierarchical world models that can reason across multiple time scales. Current state-of-the-art world models are typically flat and limited to fixed prediction horizons, constraining planning over long durations. Inspired by hierarchical representations in biological cognition, the authors hypothesize that learning higher-level temporal abstractions from discrete changes in latent dynamics can yield adaptive timescales independent of fixed intervals or specific tasks. The proposed approach, THICK, aims to discover when prediction-relevant latent generative factors change and to train a higher-level model to anticipate those events, thereby improving planning and value learning in sparse and long-horizon tasks.
Literature Review
The related work covers: (1) sparsity in RNNs and hierarchical activation, from early proposals like Schmidhuber’s Neural History Compressor to modern sparse gating, attention, and phased/time-dependent updates; (2) temporal abstractions via regularized latent dynamics and fixed time scales, noting prior use of sparse latent changes for segmentation or clockwork models, and contrasting THICK’s adaptive segmentation via learned sparse context gates; (3) segmentation via predictability or boundary detection, including methods like ASI, TAP, predictability-based bottlenecks, error-based segmentation, and explicit boundary detectors; (4) hierarchical RL approaches where high-level policies set subgoals or operate on fixed schedules; THICK differs by learning a hierarchical world model and using high-level predictions to assist a single policy (for MBRL) or high-level planning (for MPC). The paper also situates its contribution alongside works on faster credit assignment (e.g., reward redistribution) and scaling world model capacity, arguing hierarchy provides benefits not achieved by scaling alone.
Methodology
The authors introduce a two-level hierarchical world model and its integration into MBRL and MPC. - Low-level model (C-RSSM): Extends Dreamer’s RSSM by splitting the latent state into three components: a stochastic state z_t, a deterministic hidden state h_t updated continuously, and a low-dimensional context c_t designed to change sparsely. A GateLORD cell governs context updates via an L0-regularized update gate, encouraging infrequent changes in c_t. The model provides two priors for z_t: a precise prior using h_t and c_t, and a coarse prior using c_t, z_{t-1}, and a_{t-1}, enabling predictions without h_t. It has precise and coarse output heads to reconstruct observations (image, reward, discount), which incentivizes encoding prediction-relevant information into c_t. The loss combines prediction, KL (for both priors vs posterior), and sparsity terms, with hyperparameters largely following DreamerV2. - High-level model (THICK): Uses the discrete context dynamics of C-RSSM to define an adaptive timescale by detecting timepoints τ(t) at which context changes or episodes terminate. The high level predicts variables associated with events immediately before a context change: the stochastic state z_{τ(t)-1}, action a_{τ(t)-1}, elapsed time Δτ(t), and accumulated discounted reward r^{γ}_{t:τ(t)}. Inputs to the high level are the current c_t and z_t plus a categorical high-level action code A_t that disambiguates different potential outcomes. A posterior action encoder uses privileged information (next context and state) to form result-conditioned action codes for training; a prior action encoder learns to model likely outcomes without privileged info for evaluation. Losses include prediction terms for the targets and a KL term aligning the prior and posterior action distributions with KL balancing. - Temporal abstract rollouts: The high-level predicted action and state are fed into the low-level coarse pathway to update c_{τ(t)} and produce a coarse prior for z_{τ(t)}, enabling multi-step abstract rollouts by iterating this process. - Downstream integrations: - THICK Dreamer (MBRL): Combines short-horizon value estimates from imagined low-level rollouts with long-horizon value estimates derived from temporal abstract predictions, mixing them with a balance parameter ψ to improve reward propagation in sparse tasks. Two critics are trained to regress the combined value. - THICK PlaNet (MPC): Performs high-level planning using Monte Carlo Tree Search over sequences of high-level actions A_{1:K} to set subgoals z^{goal}, with replanning triggered by context changes. Low-level CEM searches action sequences to both maximize predicted task rewards and approach the subgoal, using cosine similarity of categorical logits as a proximity measure with trade-off parameter κ.
Key Findings
- THICK learns interpretable, categorical temporal abstractions: High-level actions encode distinct agent-environment interactions (e.g., grasping vs pushing a ball; attacking monsters; stepping on pads), and context changes align with meaningful events such as item pickups, object interactions, door operations, or boundary transitions. Quantitatively, context changes coincide with task-relevant events at high rates (e.g., in MiniHack, 88.6%–97.1% around key events depending on task). - MBRL improvements (THICK Dreamer vs DreamerV2 and Director in MiniHack and VisualPinPad): THICK Dreamer matches or exceeds Dreamer’s returns and success rates, especially as task horizons increase. In MiniHack KeyCorridor, performance gains grow with corridor length until exploration limitations prevent reward discovery. Director performs well on easier KeyRoom but struggles on other MiniHack tasks; analyses attribute this to insufficient diversity of initial data and difficulty encoding unobservable goal aspects. In VisualPinPad (with replay buffers pre-filled via exploration), THICK Dreamer is more sample-efficient than Dreamer and Director in longer sequences. Sample efficiency improvements include: - PinPadFour: reaching 95% of maximum reward requires 200k vs 280k steps (difference 80k) with 250k exploration; 120k vs 180k (difference 60k) with 500k; 140k vs 200k (difference 60k) with 1M. - PinPadFive: 260k vs 360k (difference 100k) with 500k; 340k vs 590k (difference 250k) with 1M. - Zero-shot MPC improvements (THICK PlaNet vs PlaNet in MultiWorld): Comparable performance on short-horizon dense-reward tasks (Pusher-Dense). Clear gains on sparse reward and long-horizon tasks (Pusher-Sparse, Door, PickUp). High-level subgoals are human-interpretable via reconstruction (e.g., moving to the door handle or grasping the ball). - Ablations and sensitivity: - Using C-RSSM in Dreamer yields similar or slightly improved performance over RSSM; removing h_t and precise pathway impedes learning. - THICK Dreamer benefits from combining short- and long-horizon value estimates; overly relying on long horizon can hurt learning, with ψ ≈ 0.8–0.9 effective. - The sparsity regularization β^{sparse} must be tuned; too weak causes frequent context updates and identity-like high-level predictions; too strong prevents learning meaningful transitions early. THICK Dreamer shows robustness across a range; THICK PlaNet is more sensitive but tolerates stronger sparsity better than weak. - Scaling model size alone (larger RSSM hidden states) did not match the performance gains from adding hierarchy.
Discussion
The study demonstrates that learning a hierarchy from sparsely changing discrete latent contexts provides adaptive temporal abstractions that align with meaningful environmental events. By predicting context-altering states and associated actions at an abstract timescale, THICK enhances long-horizon value estimation and planning. In MBRL, combining temporal abstract predictions with standard imagined rollouts speeds credit assignment in sparse reward settings, improving sample efficiency and success rates as horizons grow. In MPC, planning at the high level over abstract actions reduces the exponential difficulty of sampling long sequences, yielding better zero-shot performance on tasks requiring multi-step manipulation and coordination, while offering interpretable subgoals. These findings support the hypothesis that context-conditioned, event-predictive structures are a suitable basis for temporal flexibility and compositional reasoning in agents, and they validate the practical benefits of hierarchical world models over flat counterparts without relying on fixed timescales or task-specific decompositions.
Conclusion
The paper introduces C-RSSM and THICK, enabling hierarchical world models by learning sparsely updated latent contexts and training a high-level predictor for context transitions. This yields interpretable temporal abstractions and seamless integration of abstract and precise predictions. Empirically, THICK improves long-horizon reinforcement learning and planning across diverse visual domains, outperforming flat baselines and offering explainable subgoals. Future work includes extending THICK to deeper hierarchies, integrating high-level planning with RL (e.g., MuZero-style combinations), enriching high-level predictions to directly operate from context alone, and enhancing exploration to consolidate context codes and transitions. The approach aims to advance agents capable of multi-timescale planning and reasoning.
Limitations
THICK depends on tuning the sparsity regularization parameter β^{sparse}, which controls the segmentation and adaptive timescale; optimal values may vary by task, though similar tasks can share a value. The downstream applications inherit constraints of their base methods (e.g., if Dreamer cannot discover a solution, THICK cannot decompose it). THICK PlaNet adds computational overhead due to high-level MCTS; THICK Dreamer’s training is more time-consuming due to added networks and logging. Over- or under-regularization of context updates can degrade performance if not adjusted to the environment.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny