Computer Science
Diffusion for World Modeling: Visual Details Matter in Atari
E. Alonso, A. Jelley, et al.
The paper addresses whether diffusion models, which excel at high-fidelity image generation, can serve as effective world models for training reinforcement learning (RL) agents in imagination. Existing world models often compress environment dynamics into sequences of discrete latent variables, reducing compounding error but potentially losing critical visual details necessary for optimal policy learning, especially in partially observed and visually complex domains. The authors propose DIAMOND, a diffusion-based world model conditioned on past observations and actions, hypothesizing that improved visual fidelity and stable conditioning will enhance agent performance and credit assignment. The study situates this work within the broader context of sample-inefficient RL and demonstrates that diffusion’s strengths—multi-modal modeling, strong conditioning, and avoidance of mode collapse—are advantageous for world modeling.
The paper reviews prior work on world models and generative vision models. Early neural world models (Ha and Schmidhuber, 2018; SimPLe, Kaiser et al., 2019) introduced learning in imagination on Atari. Dreamer and its variants (Hafner et al., 2020, 2021, 2023) used recurrent state-space models with discrete latents to mitigate compounding error, while transformer-based world models (TWM, Robine et al., 2023; STORM, Zhang et al., 2023) improved sequence modeling. IRIS (Micheli et al., 2023) uses a discrete autoencoder to tokenize images and an autoregressive transformer to compose tokens over time. Parallel developments in generative vision shifted from discrete tokenization (VQGAN, Esser et al., 2021; DALL-E, Ramesh et al., 2021) to diffusion for images and video (Sohl-Dickstein et al., 2015; Ho et al., 2020; Rombach et al., 2022; Saharia et al., 2022a; Peebles and Xie, 2023; Bar-Tal et al., 2024; Brooks et al., 2024). Diffusion has been combined with RL as policies, planners, reward models, and for data augmentation (e.g., Wang et al., 2022; Ajay et al., 2022; Janner et al., 2022; Nuti et al., 2023; Lu et al., 2023; Ding et al., 2024; Jackson et al., 2024). The paper also situates DIAMOND among neural generative game engines (GameGAN, Kim et al., 2020; Genie, Bruce et al., 2024; GameNGen, Valevski et al., 2024), highlighting its interactive, image-space simulation capability.
The environment is modeled as a POMDP with image observations and discrete actions. The authors build a conditional, score-based diffusion world model that generates x_{t+1} conditioned on history (x_{<t}, a_{<t}). They adopt the EDM (Karras et al., 2022) paradigm rather than DDPM, using a Gaussian perturbation kernel with affine drift f(x,τ)=0 and diffusion g(τ)=√2 σ(τ) σ̇(τ). They employ EDM preconditioning: D_θ(x^{τ}, y') = c_skip x^{τ} + c_out F_θ(c_in x^{τ}, y'), where y' includes transformed noise level and conditioning (past observations/actions). The training objective for F_θ mixes signal and noise adaptively: E[||F_θ(c_in x_{t+1}^{τ}, y_t') − (1/c_out)(x_{t+1} − c_skip x_{t+1}^{τ})||^2], focusing on medium-noise regions via a log-normal noise sampling schedule. Architecture: a 2D U-Net predicts the next frame; conditioning uses frame stacking of L past observations concatenated channel-wise with the next noisy frame, and actions are injected via adaptive group normalization in residual blocks. Sampling: deterministic Euler solver for the probability-flow ODE, with n denoising steps (chosen as n=3), balancing visual quality and inference cost (NFE). Reinforcement learning in imagination: the world model is augmented with separate reward and termination predictors R_ψ (CNN + LSTM to handle partial observability). An actor-critic agent (shared CNN-LSTM with policy and value heads) is trained with REINFORCE (baseline V_φ) and value learning via λ-returns. The training loop alternates between collecting data in the real environment, training the diffusion and reward/termination models on accumulated data, and training the RL agent in the learned world model. Hyperparameters and procedures include: image observations at 64×64×3, discrete actions (up to 18), frame skip 4, reward clipping to {−1,0,1}, Euler sampling with 3 steps, imagination horizon H=15, discount γ=0.985, entropy weight η=0.001, λ=0.95, AdamW optimizer at 1e−4, with burn-in for LSTM states before imagination.
DIAMOND achieves a mean human-normalized score (HNS) of 1.46 and an IQM of 0.64 on Atari 100k over 26 games, outperforming other agents trained entirely within world models (SimPLe: 0.332 mean, TWM: 0.956, IRIS: 1.046, DreamerV3: 1.097, STORM: 1.266). It is superhuman on 11 games. Gains are notable on games where small visual details matter (e.g., Asterix: 3698.5, Breakout: 132.5, RoadRunner: 20673.2). The agent training used 5 seeds per game, ~12GB VRAM, ~2.9 days/run on a single RTX 4090 (total 1.03 GPU years). Analysis of diffusion frameworks shows EDM yields far more stable long-horizon rollouts than DDPM with few denoising steps (n≤10), avoiding compounding error; EDM produces stable trajectories even with n=1, while DDPM drifts quickly. Although single-step sampling is possible, multimodal posteriors under partial observability benefit from iterative sampling; n=3 provided a robust balance across games. Qualitative comparison with IRIS indicates DIAMOND’s imagined frames are visually consistent across time, correctly maintaining rewards, enemies, bricks, and scores, whereas IRIS exhibits frame-to-frame inconsistencies that can mislead RL credit assignment. Scaling to CS:GO: a 381M-parameter diffusion dynamics model plus a 51M upsampler (total ~381M including upsampler) trained on 87 hours of gameplay for 12 days on an RTX 4090, running at ~10 Hz on an RTX 3090. The model generates stable, interactive trajectories, with some drift in less-visited map areas and memory limitations. In additional visual-quality evaluations on CS:GO and motorway driving datasets, DIAMOND (frame-stack) outperforms DreamerV3 and IRIS variants on FID/FVD/LPIPS (e.g., CS:GO FID 9.6, FVD 34.8, LPIPS 0.107), consistent with diffusion’s strengths for high-fidelity visual modeling.
The findings support the hypothesis that preserving visual details via diffusion-based world models improves downstream RL performance, especially in environments where pixel-level semantics (e.g., rewards, enemies, brick states, score counters) affect action selection. EDM’s adaptive training objective stabilizes generation with few denoising steps, reducing inference costs and compounding error, which is critical for long imagined rollouts in world-model training. The improved temporal visual consistency observed relative to discrete-token world models (IRIS) likely enhances credit assignment and policy learning. Operating directly in image space enables drop-in environment replacement, yielding interpretable agent–world interactions and demonstrating broader applicability to complex 3D settings (CS:GO) and real-world-like driving. While DIAMOND focuses on world-model-only agents, the approach is complementary to search-based or more advanced RL strategies (e.g., MCTS in EfficientZero, scheduling in BBF), suggesting potential for hybrid systems.
DIAMOND introduces diffusion as a practical and effective world model for training RL agents entirely in imagination, achieving state-of-the-art mean HNS on Atari 100k among world-model agents. Key design choices—EDM preconditioning, low NFE with stable sampling, and image-space conditioning—lead to temporally consistent visuals that improve agent performance. The work demonstrates extensibility to complex 3D environments and interactive neural game engines. Future directions include application to continuous control domains, incorporating longer-term memory via autoregressive transformers over environment time, and integrating reward/termination predictions into the diffusion framework. The broader impact emphasizes improved sample efficiency and safety via faithful simulators, while acknowledging risks from imperfect models and advocating for continued development of interactive, accurate world models.
Primary limitations include: (1) Evaluation focused on discrete-action environments (Atari); extending to continuous control may yield further insights. (2) Conditioning via frame stacking is a minimal memory mechanism; integrating an autoregressive transformer over environment time could enable longer-term memory and better scalability. Initial cross-attention investigations were less effective than frame stacking. (3) Reward and termination prediction are modeled separately from the diffusion model; integrating these into a unified framework is non-trivial and left for future work. Additional observed limitations in CS:GO include drift in less-visited map regions and limited memory causing occasional generation of incorrect scene elements, and offline training issues leading to causal confusion under atypical action sequences.
Related Publications
Explore these studies to deepen your understanding of the subject.

