logo
ResearchBunny Logo
Operator World Models for Reinforcement Learning

Computer Science

Operator World Models for Reinforcement Learning

P. Novelli, M. Pontil, et al.

Policy Mirror Descent (PMD) is powerful but hard to apply in Reinforcement Learning because action-value functions are not directly accessible. This work learns a world model via conditional mean embeddings and—using operator-theoretic matrix operations—derives closed-form action-value estimates. Combining these with PMD yields POWR, an RL algorithm with proven global convergence; this research was conducted by Pietro Novelli, Massimiliano Pontil, Marco Pratticò, and Carlo Ciliberto.... show more
Introduction

The paper tackles the challenge of applying Policy Mirror Descent (PMD) to reinforcement learning when the action-value function qπ is unknown and cannot be computed exactly. In RL, exploration-exploitation trade-offs and unknown transitions and rewards make directly evaluating qπ infeasible. Prior inexact PMD results require uniform sup-norm accuracy of qπ approximations for all intermediate policies, often relying on unrealistic sampling assumptions (e.g., starting the MDP from arbitrary states or repeated re-learning per policy). The authors propose learning a model of the environment (world model) comprising the transition operator and reward function, then expressing any policy’s action-value function via operator identities. The research question is whether PMD can be efficiently deployed in RL with strong theoretical guarantees by estimating qπ through an operator world model learned from data, avoiding repeated sampling per policy and enabling convergence guarantees.

Literature Review

The work builds upon: policy optimization and RL advances (Q-learning, policy gradients, actor-critic) and the theoretical study of PMD showing global convergence (sublinear and linear rates). Inexact PMD (Xiao, 2022) extends convergence under approximation errors but demands uniform sup-norm accuracy, typically requiring extensive sampling. World models (Ha & Schmidhuber; Hafner et al.) learn simulators in latent spaces but introduce both model and sampling errors. Conditional Mean Embeddings (CMEs) provide a non-sampling framework to estimate conditional expectations with learning bounds. Prior CME-based RL includes value iteration via RKHS embeddings and linear mixture MDP approaches (optimistic planning and value-targeted regression). The paper situates its method within linear MDPs possibly with infinite latent dimensions, connecting operator-theoretic formulations of RL with CME estimators to compute qπ in closed form.

Methodology

The authors develop an operator-based formulation of RL and a CME-based learning strategy for the world model. Key components: 1) Transfer operator T: maps bounded functions on states to bounded functions on state-action pairs via conditional expectation under the transition kernel. 2) Policy operator Pπ: maps bounded functions on state-action pairs to bounded functions on states via conditional expectation under the policy. The action-value function admits the operator representation qπ = (Id − γ T Pπ)^{-1} r and the RL objective J(π) = ⟨Pπ qπ, ν⟩. World model learning via CMEs: Under a linear MDP assumption restricting T to a Hilbert-Schmidt operator between feature spaces (F,G) with feature maps ϕ for states and ψ for state-action pairs, the conditional mean embedding of τ is learned by ridge regression over samples (xi, ai, x′i): Tn = Sn* Kλ^{-1} Zn, where Sn and Zn are evaluation operators on ψ(xi,ai) and ϕ(x′i), and Kλ is the regularized Gram matrix over ψ. The reward function is similarly learned via ridge regression rn = Sn* b with b = Kλ^{-1} y. Estimating qπ: For a policy π compatible with (G,F) (i.e., Pπ maps G to F and is bounded), the authors derive a kernelized estimator: q̂π = (Id − γ Tn Pπ)^{-1} rn = Sn* (Id − γ B Mπ)^{-1} b, where B and b come from the CME/ridge estimators, and Mπ = Zn Pπ Sn* is an n×n matrix computable from evaluations of π on data without explicit operator knowledge. Algorithm (POWR): Two phases: learn the world model (Tn, rn) from replay data; then perform PMD updates using KL divergence (equivalent to Natural Policy Gradient). The PMD update has a closed form: πt+1(x) = Softmax(log π0(x) + η Σ_{s=0}^t q̂πs(x,·)). In practice, compute Mπt+1 from data, solve a linear system for c = (Id − γ Kx^{-1} Mπt+1)^{-1} b, and update cumulative weights to form q̂ and π via Softmax. Feature spaces are chosen as separable tensor products (F = H⊗H, G = R^{|A|}⊗H) with policies represented in H to ensure (G,F)-compatibility. Theory: Theorem 5 and Corollary 6 show PMD iterates remain in Sobolev spaces H = W^{2,s}(X), ensuring compatibility and well-definedness of the estimator. Theorem 7 proves convergence of inexact PMD under decreasing estimation errors, yielding J(π*) − J(πT) ≤ εT + O(1/T + Σ εt). Lemma 8 bounds q̂π − qπ in sup-norm via errors in rn and Tn using a simulation lemma. Theorem 9 combines fast learning rates for CME and ridge regression under a strong source condition to yield convergence of POWR: J(π*) − J(πT) ≤ O(1/T + δ^2 n^{−α}) with high probability, where α depends on smoothness β.

Key Findings
  • Operator world-model formulation: qπ can be expressed as (Id − γ T Pπ)^{-1} r, enabling closed-form computation from learned operators and avoiding resampling per policy.
  • CME-based estimators: q̂π is computed via solving an n×n linear system using kernel matrices over training data, leveraging conditional mean embeddings for transitions and ridge regression for rewards.
  • Well-definedness: PMD iterates remain in Sobolev spaces W^{2,s}(X), ensuring policy-operator compatibility for applying the estimator to finite and infinite state spaces.
  • Convergence of inexact PMD: With estimation errors εt decreasing (e.g., O(1/t)), PMD achieves J(π*) − J(πT) = O(log T / T). General bound: J(π*) − J(πT) ≤ εT + O(1/T + Σ εt).
  • Error decomposition: Sup-norm error in q̂π bounded by terms depending on ||rn − r|| and ||Tn − T|| (Hilbert-Schmidt norm), scaled by factors of γ and 1−γ.
  • Sample complexity: Under strong source condition, with n samples, high-probability bound J(π*) − J(πT) ≤ O(1/T + δ^2 n^{−α}), with α ∈ (β/(2+2β), β/(1+2β)).
  • Empirical results: On Gym environments (FrozenLake-v1, Taxi-v3, MountainCar-v0), POWR demonstrates superior sample efficiency compared to A2C, DQN, TRPO, PPO across seven runs, often reaching success thresholds faster. Some instability observed early in training, likely due to exploration-exploitation interplay.
Discussion

The proposed operator world-model approach addresses the core hurdle of PMD in RL: computing action-value functions for changing policies without direct access to environment dynamics. By learning transition and reward operators via CMEs and expressing qπ through operator inversion, the method eliminates repeated resampling for each policy iterate and provides controllable error bounds. The theoretical results show that, under regularity conditions, inexact PMD with CME-based q̂π converges to globally optimal policies with polynomial rates and extends to infinite state spaces via Sobolev/RKHS constructions. Empirically, the method achieves strong sample efficiency against standard baselines, supporting the claim that operator world models combined with PMD can effectively balance planning with limited data. The approach is relevant for model-based RL where exact simulators are impractical and supports planning in high-dimensional or continuous domains.

Conclusion

The paper introduces POWR, combining Policy Mirror Descent with operator world models learned via Conditional Mean Embeddings. It provides an operator formulation of RL, closed-form qπ estimators from learned reward and transition operators, and convergence guarantees to the global optimum under regularity assumptions. Empirical evaluations on classic Gym environments show improved sample efficiency compared to strong baselines. Future directions include extending PMD to infinite action spaces, improving scalability with approximate CME estimators (e.g., Nyström, reduced-rank regressors), studying alternating exploration/exploitation schedules, and developing representation learning to ensure policy compatibility beyond Sobolev spaces.

Limitations
  • Requires finite action spaces for the presented PMD with KL-divergence update; extending to infinite action spaces needs approximations (e.g., Monte Carlo) and further convergence analysis.
  • Theoretical guarantees rely on strong regularity assumptions, including the strong source condition for uniform-norm learning rates and Sobolev/RKHS structure for policy compatibility.
  • Computational scalability may be limited by n×n linear system solves; practical deployment in large-scale environments likely requires approximations (Nyström, sketching, reduced-rank operators).
  • Empirical instability early in training suggests sensitivity to exploration-exploitation scheduling and hyperparameters; alternating phases and adaptive strategies warrant study.
  • World model accuracy is critical; mismatch between sampling distribution and target policies can impact estimator quality despite operator-based formulation.
Listen, Learn & Level Up
Over 10,000 hours of research content in 25+ fields, available in 12+ languages.
No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.
listen to research audio papers with researchbunny