World Models: Recurrent World Models Facilitate Policy Evolution

Source

Raw Markdown: paper_world-models-2018.md
PDF: paper_world-models-2018.pdf
Preprint: https://arxiv.org/abs/1803.10122
Official interactive article: https://worldmodels.github.io/
NeurIPS paper: https://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution
Official code: https://github.com/hardmaru/WorldModelsExperiments
Article source: https://github.com/worldmodels/worldmodels.github.io

Credibility

This is a 2018 NeurIPS source by David Ha and Juergen Schmidhuber. It is older than one year and is not current SOTA, but it is a canonical historical anchor for modern visual model-based RL and action-conditioned world-model language. Use it as landmark background for the VAE + MDN-RNN + controller decomposition, learned latent rollout, and model exploitation caveats, then compare current claims against newer JEPA, Dreamer-style, latent-diffusion, and foundation-model world-model sources.

Core Claim

A compact controller can solve visual control tasks when it acts through a learned latent world model: a VAE compresses image observations into latent codes, an MDN-RNN predicts future latent codes conditioned on action and recurrent state, and a small controller uses the latent and recurrent state to choose actions.

Key Contributions

Popularized the V/M/C decomposition: vision encoder, memory/dynamics model, and small controller.
Trained the dynamics model from random-policy rollouts without reward supervision, then optimized the controller with CMA-ES.
Demonstrated that the controller can use recurrent predictive state directly for reflex-like action selection.
Showed an agent trained inside a learned latent “dream” environment can transfer back to the actual VizDoom environment.
Made world-model exploitation concrete: policies can discover adversarial behavior that works in the learned simulator but fails in the real environment.
Used MDN-RNN temperature as a practical uncertainty knob to reduce reward hacking against an imperfect model.
Sketched an iterative training loop where model prediction loss can drive curiosity and data collection in unfamiliar parts of the environment.

Method Notes

The central dynamics interface is:

P (z_{t + 1} ∣ a_{t}, z_{t}, h_{t})

For Doom, the model also predicts the termination event:

P (z_{t + 1}, d_{t + 1} ∣ a_{t}, z_{t}, h_{t})

flowchart LR
  Obs["image observation"]
  VAE["V: VAE encoder"]
  Z["latent code z_t"]
  Action["action a_t"]
  RNN["M: MDN-RNN"]
  H["recurrent state h_t"]
  C["C: small controller"]
  Env["environment or learned dream"]

  Obs --> VAE --> Z
  Z --> C
  H --> C
  C --> Action --> Env --> Obs
  Z --> RNN
  Action --> RNN
  H --> RNN --> H

For this wiki, the important interface is not the exact 2018 architecture. It is the action-conditioned latent transition contract: observation history + action -> future latent state, plus a controller or planner that consumes that state before acting.

Evidence And Results

In CarRacing-v0, the full VAE + MDN-RNN world model with a linear controller reports 906 +/- 21 over 100 random trials, compared with 632 +/- 251 for the V-only linear controller and 788 +/- 141 for the V-only controller with a hidden layer.
In VizDoom TakeCover, the paper trains the controller in the learned latent environment and reports transfer back to the actual environment with 1092 +/- 556 time steps survived at the selected temperature.
The paper reports that low-temperature deterministic dreams can create policies that exploit model errors and fail in the actual environment, while moderate stochasticity improves transfer.
The experiments use visual game trajectories, explicit actions, and rollout rewards, so they are strong evidence for a narrow visual-control setting rather than a general digital-system world model.

Limitations

The architecture is now historical: VAE + MDN-RNN + CMA-ES is not the current frontier for large-scale world models.
Training is staged rather than end-to-end, and the VAE may preserve visually salient but task-irrelevant detail while dropping task-relevant detail.
The controller can exploit the learned simulator, especially when the dynamics model is too deterministic or out of distribution.
The evidence is from OpenAI Gym game environments, not multivariate operational time series, irregular event streams, robotics manipulation at scale, or real digital-system interventions.
The work does not solve long-horizon hierarchical planning, large memory capacity, continual update, or cross-system transfer.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Control and counterfactuals	partially closes	Learns latent next-state dynamics conditioned on explicit actions and uses imagined rollouts to train or evaluate a controller.	Visual Gym environments only; no digital-system actions, operator interventions, confounding analysis, or real operational telemetry.
Multi-modal future distributions	partially closes	Uses an MDN-RNN to model a distribution over next latent states and exposes temperature as an uncertainty/exploitability control.	No calibrated decision-facing future distributions for numeric time-series systems.
Representation quality: semantic state vs dense detail	warning	Shows that compressed visual latents can support control, but also that VAE reconstruction can preserve irrelevant detail and miss task-relevant detail.	Needs task-conditioned or representation-space objectives that retain intervention-relevant state under scale.
Benchmarks: what level of modeling is tested	warning	The CarRacing and VizDoom tests include action-conditioned rollouts and transfer from learned simulator to real environment.	Benchmarks are small game environments and can be exploited by simulator-specific policies.

Links Into The Wiki

Open Questions

Which modern world-model architectures preserve the useful action-conditioned interface while avoiding 2018-era simulator exploitation?
How should uncertainty be represented so a controller can avoid brittle high-reward hallucinations without making the learned environment too noisy to plan in?
Can the V/M/C decomposition be transferred to observability or digital-system time series as encoder + action-conditioned latent dynamics + intervention-ranking controller?
What benchmark would test the same “learn inside the model, transfer outside it” claim for operator actions, rollbacks, autoscaling, or remediation choices?
Can prediction-loss-driven curiosity become a safe curriculum signal without over-sampling corrupt, adversarial, or irrelevant high-surprise states?

Alex Open Research Wiki

Explorer

World Models: Recurrent World Models Facilitate Policy Evolution

World Models: Recurrent World Models Facilitate Policy Evolution

Source

Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks