Computer Science

Learning World Models for Unconstrained Goal Navigation

Y. Duan, W. Mao, et al.

Discover MUN, a goal-directed exploration algorithm that enables world models to predict transitions between arbitrary subgoal states from replay buffers, boosting exploration efficiency and policy generalization under sparse rewards. Research conducted by Yuanlin Duan, Wensen Mao, and He Zhu.... show more

Introduction

The paper addresses exploration and generalization challenges in goal-conditioned reinforcement learning (GCRL) with sparse rewards. While model-based RL (MBRL) with learned world models can provide rich learning signals and imagined rollouts to improve exploration efficiency, conventional world models trained from replay buffers often generalize only along forward transitions of recorded trajectories and struggle with backward transitions or transitions across distinct trajectories. This limits the model’s fidelity and leads to hallucinated rollouts that harm policy learning. The authors propose MUN, a goal-directed exploration algorithm that explicitly trains world models to characterize transitions between arbitrary subgoal states drawn from the replay buffer, including reversing along trajectories and navigating between states from separate trajectories. By constructing a bidirectional replay buffer and discovering key subgoals that mark milestones in complex tasks, MUN aims to improve world model reliability and enable policies that generalize to novel goal configurations.

Literature Review

The paper situates its contributions within model-based reinforcement learning (MBRL) and goal-conditioned RL (GCRL). MBRL has demonstrated success in control and games using learned dynamics for planning or generating synthetic data, with approaches like Dreamer and model-ensemble methods mitigating long-horizon planning and model bias. Integrations of MBRL and GCRL exploit world models to accelerate exploration and training but face severe sparse reward and exploration issues that induce world model bias. Prior work on exploration includes count-based and intrinsic motivation strategies, Go-Explore variants (MEGA, PEG), maximum-entropy goal distributions, and planning-based goal selection. For world model reliability, methods use demonstrations, improved replay buffer management, or altered sampling objectives, but often overlook the directionality of transitions in the buffer. Subgoal discovery and latent landmark methods have been proposed, yet can be complex or environment-specific. MUN complements these by focusing on bidirectional and cross-trajectory transitions in replay buffers and a simple, effective subgoal discovery via distinct action discovery (DAD) using farthest point sampling over actions.

Methodology

Problem setup: The authors consider goal-conditioned RL under an MDP (S, A, P, G, η, R, p0) with sparse rewards, assuming S = G and η as identity. A goal-conditioned policy π(a | s, g) maximizes discounted goal-reaching rewards. World model framework: Built on Dreamer’s recurrent state-space model (RSSM), the world model M predicts next latent states given current latent states and actions, trained end-to-end with dynamics, reward, and reconstruction losses. Actor and critic are trained on imagined rollouts; the goal-reaching reward r^G is defined by a self-supervised temporal distance estimator D_t that predicts normalized action steps between imagined states, encouraging fewer steps from current state to goal. Core idea (MUN): Train world models to capture transitions between arbitrary subgoal states in replay buffers, including backward along trajectories and across different trajectories, by constructing a bidirectional replay buffer via subgoal-to-subgoal navigation. MUN maintains a set of pivot subgoals G_subgoals drawn from the buffer and periodically updates them. During each training episode, the agent is commanded to sequentially reach Ns subgoals (typically Ns = 2) within a step limit Ts for each, thereby collecting trajectories that record bidirectional transitions between subgoals. The world model is trained on data from both environment-provided goals and subgoal navigation, and the policy is improved via imagined rollouts on the updated model. Distinct Action Discovery (DAD): To identify key subgoal states that serve as milestones in complex tasks, MUN selects time steps with maximal action differences within batches of trajectories using Farthest Point Sampling (FPS) over actions. The corresponding states form subgoals. This simple, practical method targets distinctive action phases (e.g., grasp, lift, release in block stacking) and avoids complex environment-dependent heuristics. Algorithmic details: Algorithm 1 outlines MUN’s training loop with buffers for environment goal-conditioned data and DAD-induced subgoal navigation data; Algorithm 2 details DAD subgoal generation by selecting N_subgoals actions via FPS and mapping their states to goals. Comparison to Go-Explore: Unlike Go-Explore’s two-phase (Go and Explore) structure with a separate exploration policy and intrinsic rewards (often requiring ensembles), MUN effectively replaces the Explore-phase with another Go-phase by navigating between two interesting goals. This reduces computational overhead and trains the world model on unconstrained transitions between goals, improving generalization. Implementation: MUN is implemented on the Dreamer/GC-Dreamer backbone. The RSSM architecture and temporal distance training are provided (Appendix A). Hyperparameters include the number of candidate subgoals, Ns, episode length L, and per-subgoal limit Ts, with typical settings Ns = 2 and environment-specific values. Baselines (GC-Dreamer, PEG-G, MEGA-G) are implemented within the same framework.

Key Findings

Across six tasks (Ant Maze, Walker, 3-Block Stacking, Block Rotation, Pen Rotation, Fetch Slide), MUN consistently outperforms GC-Dreamer and Go-Explore-based baselines (PEG-G, MEGA-G) in both learning speed and final success rate. Notable results include: • 3-Block Stacking: MUN achieves over 95% success within 2.5M steps; baselines reach around 60%. In navigation between arbitrary subgoals (225 combinations, 2250 eval trajectories), MUN achieves 95% success vs MUN-noDAD 81%, GC-Dreamer 56%, MEGA-G 42%, PEG-G 47%. • Ant Maze: Navigation experiments report success rates: MUN 96%, MUN-noDAD 89%, GC-Dreamer 75%, MEGA-G 94%, PEG-G 93%. • Walker: Navigation success rates: MUN 89%, MUN-noDAD 73%, GC-Dreamer 67%, MEGA-G 81%, PEG-G 62%. • World model quality: One-step prediction errors (mean squared error) are significantly lower for MUN across tasks. Examples: Ant Maze (MUN 1.6740 vs GC-Dreamer 2.9666), Walker (MUN 0.8165 vs GC-Dreamer 2.1824), 3-Block Stacking (MUN 0.0070 vs PEG-G 0.0476, MEGA-G 0.0853, GC-Dreamer 0.0392). Compounding (multi-step) errors are also lower: Ant Maze (MUN 18.83 vs GC-Dreamer 40.36), Walker (MUN 13.03 vs GC-Dreamer 39.72), 3-Block Stacking (MUN 0.45 vs MEGA-G 0.95, GC-Dreamer 0.94). • Fetch Slide: Despite asymmetric, infeasible transitions outside the workspace, MUN with DAD discovers key subgoals and learns a generalizable model that predicts infeasible transitions with low probabilities, improving success rates over baselines. • Ablations: Ns = 2 is sufficient; increasing Ns to 3 slows learning (MUN-Ns-3). Replacing DAD with fixed-interval subgoal selection (MUN-noDAD) or state-cluster centroids (MUN-KeyObs) reduces performance, especially in 3-Block Stacking, underscoring the value of action-space distinctiveness in subgoal discovery.

Discussion

The results demonstrate that training world models on bidirectional and cross-trajectory transitions between key subgoals reduces the generalization gap between imagined and real rollouts, yielding higher-fidelity models that in turn produce better goal-conditioned policies for sparse-reward, long-horizon tasks. By leveraging DAD to select subgoals tied to distinct actions, MUN captures critical task structure (e.g., grasping, lifting, releasing) that enables policies to generalize across new goal settings, including reversing tasks (e.g., unstacking). Compared to Go-Explore strategies that depend on separate exploration policies and sensitive goal selection, MUN’s unified subgoal-to-subgoal navigation is computationally simpler, produces richer dynamic coverage, and improves policy generalization. The improved world model prediction metrics corroborate the observed performance gains, supporting the hypothesis that enhancing replay buffer diversity and directionality of transitions is central to reliable world modeling and effective exploration.

Conclusion

MUN introduces a goal-directed exploration framework that trains world models to capture unconstrained transitions between arbitrary subgoal states in replay buffers—both backward along trajectories and across distinct trajectories. Coupled with DAD for simple, effective key subgoal discovery based on action distinctiveness, MUN markedly improves world model reliability and enables goal-conditioned policies that generalize to novel goals and settings. Experiments across six challenging environments show superior success rates and faster learning compared to strong baselines. Future work includes developing robust mechanisms to filter infeasible goals from current states, advancing subgoal discovery beyond DAD for complex action/goal spaces, and exploring MUN’s application to model-free RL to further simplify the architecture while retaining unconstrained goal navigation benefits.

Limitations

MUN’s performance depends on the quality of discovered subgoals; DAD, while simple and efficient, can select ineffective or task-irrelevant states in environments with highly complex action spaces or weak action–goal correlations, potentially degrading sample efficiency. The current mechanism lacks filtering for infeasible goals from the agent’s state, which can hinder efficiency in environments with asymmetric or constrained transitions (e.g., Fetch Slide). Broader deployment may require more robust, self-supervised subgoal discovery methods and goal feasibility checks to maintain performance across diverse tasks.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Operator World Models for Reinforcement Learning

P. Novelli, M. Pontil, et al.

Computer Science

The Goldilocks paradigm: comparing classical machine learning, large language models, and few-shot learning for drug discovery applications

S. H. Snyder, P. A. Vignaux, et al.

Computer Science

A general approach for determining applicability domain of machine learning models

L. E. Schultz, Y. Wang, et al.

Medicine and Health

Comparison of NLP machine learning models with human physicians for ASA Physical Status classification

S. B. Yoon, J. Lee, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny