Computer Science
Operator World Models for Reinforcement Learning
P. Novelli, M. Pontil, et al.
Policy Mirror Descent (PMD) is powerful but hard to apply in Reinforcement Learning because action-value functions are not directly accessible. This work learns a world model via conditional mean embeddings and—using operator-theoretic matrix operations—derives closed-form action-value estimates. Combining these with PMD yields POWR, an RL algorithm with proven global convergence; this research was conducted by Pietro Novelli, Massimiliano Pontil, Marco Pratticò, and Carlo Ciliberto.
Related Publications
Explore these studies to deepen your understanding of the subject.

