Computer Science
Diffusion for World Modeling: Visual Details Matter in Atari
E. Alonso, A. Jelley, et al.
The paper addresses the limitations of existing world models that compress environment dynamics into discrete latent sequences, potentially losing critical visual details that affect reinforcement learning policies. It proposes using diffusion models—known for high-fidelity image generation, strong conditioning, and robustness to mode collapse—as world models to better capture visually important details. The authors introduce DIAMOND, a diffusion-based world model and RL agent trained entirely in imagination, aiming to improve stability over long horizons with few denoising steps and to enhance agent performance by preserving visual fidelity. The study evaluates DIAMOND on Atari 100k to test sample efficiency and generalization and demonstrates a standalone neural game engine trained on Counter-Strike: Global Offensive gameplay.
World models: Prior work includes SimPLe (Kaiser et al., 2019) introducing Atari 100k for sample efficiency, Dreamer and DreamerV2/V3 (Hafner et al., 2020; 2021; 2023) using RSSMs with discrete latents to reduce compounding error and achieve strong performance across domains. TWM (Robine et al., 2023) and STORM (Zhang et al., 2023) adapt RSSMs with transformer architectures. IRIS (Micheli et al., 2023) uses a discrete autoencoder with an autoregressive transformer over image tokens. Generative vision: Discrete-token image/text models such as VQGAN (Esser et al., 2021) and DALL-E (Ramesh et al., 2021) leverage transformers; diffusion models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Song et al., 2020) and latent diffusion (Rombach et al., 2022; Podell et al., 2023) dominate high-resolution generation; video diffusion (Ho et al., 2022; Singer et al., 2023; Bar-Tal et al., 2024) and transformer-based diffusion (Peebles and Xie, 2023; Brooks et al., 2024) scale well. Diffusion in RL: Used as policies (Wang et al., 2022; Ajay et al., 2022; Pearce et al., 2023), planners (Janner et al., 2022; Liang et al., 2023), reward models (Nuti et al., 2023), and for offline augmentation (Lu et al., 2023; Ding et al., 2024; Jackson et al., 2024). DIAMOND is the first to use diffusion as world models for online imagination training. Generative game engines: GameGAN (Kim et al., 2020), Neural GPU approaches (Bamford and Lucas, 2020), and recent works like Genie (Bruce et al., 2024) and GameNGen (Valevski et al., 2024) demonstrate playable neural environments.
Preliminaries: The environment is a POMDP with image observations; world models generatively model p(s_{t+1}, r_t | s_t, a_t) and enable imagination-based training cycles. Diffusion foundations: Score-based diffusion trains a time-dependent score model S_θ(x, τ) to reverse a noising process described by an SDE, enabling generation by solving the reverse-time SDE. Denoising score matching yields an L2 reconstruction loss via a known Gaussian perturbation kernel. Diffusion for world modeling: The model is conditioned on past observations and actions to estimate p(x_{t+1} | X_{<t}, a_{<t}). Training samples trajectory segments from replay; the noised next observation is obtained via the perturbation kernel. Sampling the next observation iteratively solves the reverse SDE; trade-offs exist between visual quality and Number of Function Evaluations (NFE). Practical diffusion paradigm: DIAMOND adopts the EDM framework (Karras et al., 2022) rather than DDPM. The perturbation kernel is Gaussian with a noise schedule σ(τ); drift and diffusion coefficients are set to f(x, τ)=0 and g(τ)=√2·σ̇(τ)σ(τ). Network preconditioning mixes signal and noise adaptively, with D_θ(ẋ_{t+1}^τ)=c_skip x_{t+1}^τ + c_out F_θ(c_in ẋ_{t+1}^τ, y_t^τ), where y_t^τ includes c_noise, past observations, and actions. The training objective for F_θ is an L2 loss targeting (ẋ_{t+1}^τ − c_skip x_{t+1}^o)/c_out, yielding stability at extreme noise levels. Noise levels are sampled from a log-normal distribution to focus training on medium-noise regions. Architecture: A standard 2D U-Net (Ronneberger et al., 2015) is used for F_θ; conditioning uses frame stacking over L past observations concatenated channel-wise with the next noisy observation, and actions are injected via adaptive group normalization in residual blocks. Sampling: Euler's method is used for efficient deterministic sampling with low NFE; in practice DIAMOND uses n=3 denoising steps per frame. RL in imagination: A separate reward/termination model R_ψ (CNN + LSTM) predicts scalar reward and termination under partial observability. The RL agent uses a shared CNN-LSTM actor-critic; policy π_φ is trained via REINFORCE with a value baseline and entropy regularization; value V_φ is trained with a Bellman error using λ-returns. Training loop alternates between (1) collecting real environment data, (2) updating the diffusion world model on all collected data, and (3) training the RL agent entirely in imagination within the updated world model. Hyperparameters: Observations 64×64×3, discrete actions (up to 18), frameskip 4, reward clipping to {−1,0,1}, Euler sampling with 3 steps; imagination horizon H=15, discount γ=0.985, entropy weight η=0.001, λ=0.95; optimizer AdamW (lr 1e-4, ε 1e-8), weight decay 1e-2 for D_θ and R_ψ; burn-in length L=4 frames; batch size 32; 1000 epochs with 400 training steps per epoch; environment steps per epoch 100; ε-greedy collection ε=0.01. Algorithm 1 summarizes the full training procedure including burn-in of LSTM states and rollout in imagination.
DIAMOND achieves a mean human-normalized score (HNS) of 1.46 and an interquartile mean (IQM) of 0.64 on Atari 100k over 5 seeds, outperforming other agents trained entirely within world models (SimPLe, TWM, IRIS, DreamerV3, STORM). It is superhuman on 11 of 26 games. DIAMOND particularly excels on games where small visual details matter (e.g., Asterix, Breakout, Road Runner). Compared to IRIS, DIAMOND generates visually consistent trajectories without frame-to-frame inconsistencies (e.g., enemies vs rewards, scores), despite fewer NFE (3 vs 16) and fewer parameters (~13M vs 30M). Choice of diffusion framework: EDM-based diffusion is substantially more stable than DDPM under low denoising steps (n ≤ 10), with reduced compounding error over long horizons (t up to 1000). Single-step sampling can blur multimodal predictions; multi-step (n=3) drives samples to a specific mode and improves sharpness (e.g., Boxing). Compute: Each Atari run uses ~12GB VRAM, ~2.9 days on a single RTX 4090; total training across games/seeds ~1.03 GPU years. Performance profiles show DIAMOND’s mean and IQM superior among world-model baselines. CS:GO scaling: A 381M-parameter diffusion world model (including a 51M upsampler) trained on 87 hours of static gameplay produces interactive, stable trajectories at ~10 Hz on an RTX 3090, with occasional drift in rare map regions and memory-related state resets. In extended visual quality evaluations on 3D environments (CS:GO and driving), DIAMOND frame-stack outperforms DreamerV3 and IRIS variants across FID, FVD, and LPIPS, while maintaining reasonable sampling rates.
The findings support the hypothesis that preserving visual details in world models improves reinforcement learning performance, especially in domains where small visual features critically affect decisions. By conditioning diffusion models on past observations and actions and adopting EDM’s adaptive signal-noise mixing, DIAMOND achieves stability over long horizons with few denoising steps, enabling efficient imagination-based training. The improved visual consistency compared to discrete-token world models translates to better credit assignment and policy learning. Although single-step sampling can be efficient, multi-step denoising is necessary in multimodal or partially observable settings to avoid averaging artifacts. The CS:GO experiments demonstrate that diffusion world models can act as interactive neural simulators beyond Atari, revealing limitations around memory and rare states but showcasing scalability and potential for broader applications. Techniques orthogonal to world modeling (e.g., tree search or scheduled resets) could further boost performance, suggesting promising future directions combining DIAMOND with search-based or advanced RL components.
DIAMOND introduces diffusion-based world modeling for online imagination training in reinforcement learning, with design choices (EDM framework, preconditioning, low-step Euler sampling, frame-stacked conditioning) that yield stable, high-fidelity generations and strong agent performance. It sets a new state of the art among world-model-only agents on Atari 100k (mean HNS 1.46) and demonstrates interactive, real-time neural simulation on CS:GO. Future research directions include extending to continuous control domains, integrating longer-term memory via autoregressive transformers, and unifying reward/termination modeling within the diffusion framework. As world models advance, they offer pathways to more sample-efficient and safer real-world learning, while emphasizing the need to mitigate risks from model imperfections.
Primary limitations include: (1) Focus on discrete control environments; extension to continuous control remains to be explored. (2) Conditioning via frame stacking provides only short-term memory; integrating longer-term memory mechanisms (e.g., autoregressive transformers or cross-attention) could improve scalability and temporal coherence. (3) Reward and termination modeling is separated from the diffusion model; integrating these within a unified diffusion representation is non-trivial and left for future work. Additional practical limitations observed in CS:GO include drift in less frequently visited areas, limited memory causing state resets, and generalized but incorrect behaviors (e.g., successive jumps), which may be alleviated by scaling data and model capacity.
Related Publications
Explore these studies to deepen your understanding of the subject.

