Computer Science

Generating Code World Models with Large Language Models Guided by Monte Carlo Tree Search

N. Dainese, M. Alakuijala, et al.

This work presents Code World Models—world models generated as Python code by LLMs for model-based RL—alongside GIF-MCTS, a new code-generation strategy, and the Code World Models Benchmark (CWMB). GIF-MCTS outperforms baselines and yields models that enable planning with much better sample efficiency and faster inference. Research conducted by Nicola Dainese, Minttu Alakuijala, Matteo Merler, and Pekka Marttinen.

00:00

~3 min • Beginner • English

Index

Introduction

The paper addresses how to best leverage large language models for world modeling in reinforcement learning by generating precise, executable Python simulators called Code World Models (CWMs) instead of querying LLMs at inference time. The research question is whether LLM-guided code synthesis can produce accurate, interpretable, and fast world models suitable for model-based planning, and how to effectively guide LLMs to generate, debug, and refine long programs using environment feedback. The authors propose GIF-MCTS, a Monte Carlo Tree Search-driven code generation method that iteratively generates, improves, and fixes CWMs based on unit-test-like feedback from offline trajectories and environment descriptions. They introduce the CWMB benchmark of 18 diverse RL environments (discrete and continuous) with textual descriptions and curated transitions to evaluate code synthesis and planning performance. The motivation is that code is faster, cheaper, and more reliable for repeated planning than calling LLMs, while preserving interpretability; the goal is to achieve sample-efficient, high-speed model-based agents using synthesized CWMs.

Literature Review

The related work situates CWMs among programmatic world modeling and LLM-based code generation. Prior programmatic world models include AutumnSynth with a custom language and finite-state automata, and PDDL formulations requiring access to predicates and high-level planning, as well as probabilistic-code models. The most similar concurrent work, WorldCoder, uses LLMs to synthesize Python world models refined via Thompson Sampling; however, it focuses on online learning in grid-worlds and transfer within task variants, while this work studies broader discrete/continuous environments, offline data, and proposes a distinct MCTS-based synthesis pipeline. For code generation with LLMs, methods like Chain-of-Thought and tree-like prompting improve reasoning; MCTS-guided token-level approaches are impractical for long code. LATS uses MCTS with self-reflection but generates multiple programs in parallel and lacks code-specific fixing strategies. Pseudocode approaches like Parsel decompose problems but do not focus on producing directly executable code for RL world models. This work targets sequential refinement of long Python programs using unit-test feedback with specialized generate/improve/fix actions.

Methodology

Framework: Environments are deterministic and fully observable MDPs with transition s' = f(s,a) and reward R(s,a,s'). A natural-language description of the environment provides observation/action spaces and dynamics/reward details. An offline dataset D of single-step transitions (s,a,r,s',d) is available for validation (unit-tests). The goal is to synthesize a Python Environment class with methods __init__, set_state, and step that predicts (s', r, d) for given (s, a). Accuracy A is the fraction of correctly predicted next state, reward, and done across D, each weighted 1/3. Benchmark (CWMB): 18 RL environments (Gymnasium-derived) spanning discrete and continuous control (e.g., classic control, PyGame physics, MuJoCo tasks). For each environment, D includes a small set of curated trajectories (typically 5 random and 5 suboptimal demonstrations) providing both low and relatively high returns; each transition is used as a validation sample. Each environment includes a textual description distilled from Gymnasium docs. GIF-MCTS algorithm: The search tree’s nodes are programs; edges are actions. Each LLM action produces a complete program which is split into a state (partial program built by appending L lines; L=2) and a rollout (the remaining completion), allowing evaluation via running the full program. Phases per iteration: selection via a modified UCT balancing exploration across action types and penalizing repeated same-type expansions; expansion by prompting the LLM; evaluation by running unit tests to compute accuracy A; and backpropagation of values. For unvisited actions, a simple linear model predicts node value from local/global averages per action type. Three specialized actions: - Generate new lines: stochastically extend code to explore diverse completions from a partial program; provides breadth. - Improve predictions: given a full current program and a specific failing input-output example from D, the LLM explains the error (CoT) and refines the entire program, enabling non-local edits. - Fix bugs: when syntax/runtime errors occur, the LLM receives the code and error trace to repair the program (with CoT). Buggy nodes are temporarily assigned high value (initially 0.99) to permit exploration of fixes; if repeated fixes fail, this temporary value decays linearly to zero after f attempts (f=3), preventing wasted search. Temporary values are not backpropagated. Action selection uses a modified UCT: UCT(node_i) = v_i + C * sqrt(ln N_i / (n_{a=a_i} + eps)), where n_{a=a_i} counts siblings with same action type to avoid excessive horizontal growth with identical actions. Hyperparameters include L=2, exploration C=0.1, discount gamma=1.0, priors for action-type value models, and f=3 allowed fixes. LLM decoding uses standard generation settings (e.g., for Llama 3: max_new_tokens 1500, temperature 1.0, top_k 100, top_p 0.8). Planners for evaluation: For discrete actions, a vanilla MCTS planner; for continuous actions, a Cross Entropy Method (CEM) planner. Normalized return R(CWM) = (R(pi_CWM) - R(pi_rand)) / (R(pi_true) - R(pi_rand)), comparing CWM-based planning to random and oracle (true model) planning under the same planner. Additional experiments include APPS (public unit tests guide rewards; pass@B setting) and a simplified RTFM grid-world to test language grounding.

Key Findings

- APPS Competition split (strict accuracy, pass@20, Llama 3 70B): GIF-MCTS 28.3±1.4%, outperforming Zero-shot CoT 23.2±1.3% and WorldCoder 25.1±1.4%, and prior baselines (Parsel 25.5% with larger evaluation budget; CodeRL 17.9%). - CWMB (18 environments): - Llama 3 70B, budget 50 calls (3 seeds): Discrete — Accuracy 0.84±0.03, Normalized Return 0.76±0.03; Continuous — Accuracy 0.35±0.03, Return 0.22±0.01. WorldCoder trails (Discrete: 0.79±0.04, 0.60±0.04; Continuous: 0.32±0.03, 0.19±0.01). - GPT-4 Turbo, budget 10 calls (1 seed): Discrete — Accuracy 0.91±0.08, Return 0.81±0.06; Continuous — Accuracy 0.40±0.03, Return 0.26±0.01. WorldCoder trails (Discrete: 0.87±0.09, 0.79±0.06; Continuous: 0.24±0.06, 0.20±0.01). - RTFM (simplified) with MCTS planning: - Llama 3 70B, 50 calls: GIF-MCTS Accuracy 0.58±0.02 vs WorldCoder 0.23±0.01; Return both -0.11±0.12. - GPT-4 Turbo, 10 calls: Accuracy 0.71±0.01 vs 0.33±0.01; Return 0.31±0.19 vs 0.22±0.18. - GPT-4 Turbo, 50 calls: Accuracy 1.00±0.00 and Return 1.00±0.00 vs WorldCoder 0.64±0.02 and -0.06±0.12. - Planning efficiency: Synthesized CWMs enable model-based planning with major speedups versus querying LLMs as simulators: 4–7 orders of magnitude faster inference (e.g., CartPole: 2.2 s vs 0.00005 s; Humanoid: 146.7 s vs 0.0001 s per step), while achieving near-oracle performance when CWMs are accurate. - Sample efficiency: CWMs validated with small datasets (often 10 trajectories of up to 100 steps) suffice to synthesize usable models; evidence that language descriptions help generalize beyond observed transitions (Appendix F). - Ablations: Removing any action hurts performance, especially removing Fix. Generate tends to dominate optimal paths early; Improve contributes less but still helps. Qualitative tree analysis indicates a bias towards Generate on best paths and a large role for Fix in harder continuous domains. - Comparison with Offline RL (CQL): CWMs compare favorably on discrete tasks; CQL stronger in complex continuous physics tasks but exhibits overfitting with small datasets; CWMs still competitive in some continuous tasks (e.g., Pendulum, Reacher).

Discussion

The results demonstrate that guiding LLMs with GIF-MCTS can synthesize accurate, executable world models that support fast and effective planning. By framing code generation as a tree search over partial programs with specialized generate/improve/fix actions and unit-test-like feedback, the method explores a diverse solution space and repairs promising but buggy code paths, outperforming Thompson Sampling refinement (WorldCoder) across APPS, CWMB, and RTFM. The findings answer the core question by showing that code-based world models can deliver large computational and sample-efficiency gains while preserving interpretability, and that a tailored MCTS strategy is key to reliable long-form code synthesis. The observed accuracy–return gap highlights the importance of correctly modeling terminal and reward-relevant edge cases; sparse coverage of success states in offline data can degrade planning despite reasonable average accuracy. This underscores the need for improved data coverage or complementary CWMs specializing in rare events. Compared to offline RL, CWMs shine in low-data regimes and discrete tasks where environment logic aligns well with code, while continuous, complex physics present challenges without external simulators. Overall, the approach provides a scalable path to language-informed, interpretable model-based agents, with performance that tightens toward oracle planning as model accuracy improves.

Conclusion

The paper introduces Code World Models (CWMs) and GIF-MCTS, a code-synthesis strategy that iteratively generates, improves, and fixes Python simulators using MCTS guided by environment feedback. It also contributes the CWMB benchmark of 18 environments with language descriptions and curated trajectories. Empirically, GIF-MCTS achieves state-of-the-art results on APPS (Competition), outperforms WorldCoder on CWMB and RTFM, and enables model-based planning with massive inference-speedups and strong sample efficiency. The work establishes code as a practical, interpretable, and efficient medium for world modeling when combined with LLMs. Future directions include handling stochasticity and partial observability, improving coverage of terminal/reward edge cases, composing or retrieving multiple CWMs, using external tools (e.g., physics engines) within generated code, and enabling self-generated test cases for broader code synthesis without provided unit tests.

Limitations

Assumptions include deterministic, fully observable environments and availability of sufficiently informative natural-language environment descriptions, which may be unrealistic for image-based or partially observable settings. Evaluation relies on unit tests (validation transitions); tasks lacking test cases would require self-generated tests. CWMs may be rigid under changing dynamics, potentially requiring on-the-fly rewriting or modularization. Performance degrades on complex physics environments without access to simulators; incorporating external libraries or tools may be necessary. The method’s effectiveness depends on the quality and coverage of the offline dataset, particularly for terminal and reward-critical states.

Related Publications

Explore these studies to deepen your understanding of the subject.

Computer Science

Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

J. Liu, S. Chunqiu, et al.

Computer Science

AI-AI bias: Large language models favor communications generated by large language models

W. Laurito, B. Davis, et al.

Computer Science

Accelerating materials language processing with large language models

J. Choi and B. Lee

Computer Science

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs

Y. Qin, S. Liang, et al.

Listen, Learn & Level Up

Over 10,000 hours of research content in 25+ fields, available in 12+ languages.

No more digging through PDFs, just hit play and absorb the world's latest research in your language, on your time.

listen to research audio papers with researchbunny