World Models: Recurrent World Models Facilitate Policy Evolution

Source

Credibility

This is a 2018 NeurIPS source by David Ha and Juergen Schmidhuber. It is older than one year and is not current SOTA, but it is a canonical historical anchor for modern visual model-based RL and action-conditioned world-model language. Use it as landmark background for the VAE + MDN-RNN + controller decomposition, learned latent rollout, and model exploitation caveats, then compare current claims against newer JEPA, Dreamer-style, latent-diffusion, and foundation-model world-model sources.

Core Claim

A compact controller can solve visual control tasks when it acts through a learned latent world model: a VAE compresses image observations into latent codes, an MDN-RNN predicts future latent codes conditioned on action and recurrent state, and a small controller uses the latent and recurrent state to choose actions.

Key Contributions

  • Popularized the V/M/C decomposition: vision encoder, memory/dynamics model, and small controller.
  • Trained the dynamics model from random-policy rollouts without reward supervision, then optimized the controller with CMA-ES.
  • Demonstrated that the controller can use recurrent predictive state directly for reflex-like action selection.
  • Showed an agent trained inside a learned latent “dream” environment can transfer back to the actual VizDoom environment.
  • Made world-model exploitation concrete: policies can discover adversarial behavior that works in the learned simulator but fails in the real environment.
  • Used MDN-RNN temperature as a practical uncertainty knob to reduce reward hacking against an imperfect model.
  • Sketched an iterative training loop where model prediction loss can drive curiosity and data collection in unfamiliar parts of the environment.

Method Notes

The central dynamics interface is:

For Doom, the model also predicts the termination event:

flowchart LR
  Obs["image observation"]
  VAE["V: VAE encoder"]
  Z["latent code z_t"]
  Action["action a_t"]
  RNN["M: MDN-RNN"]
  H["recurrent state h_t"]
  C["C: small controller"]
  Env["environment or learned dream"]

  Obs --> VAE --> Z
  Z --> C
  H --> C
  C --> Action --> Env --> Obs
  Z --> RNN
  Action --> RNN
  H --> RNN --> H

For this wiki, the important interface is not the exact 2018 architecture. It is the action-conditioned latent transition contract: observation history + action -> future latent state, plus a controller or planner that consumes that state before acting.

Evidence And Results

  • In CarRacing-v0, the full VAE + MDN-RNN world model with a linear controller reports 906 +/- 21 over 100 random trials, compared with 632 +/- 251 for the V-only linear controller and 788 +/- 141 for the V-only controller with a hidden layer.
  • In VizDoom TakeCover, the paper trains the controller in the learned latent environment and reports transfer back to the actual environment with 1092 +/- 556 time steps survived at the selected temperature.
  • The paper reports that low-temperature deterministic dreams can create policies that exploit model errors and fail in the actual environment, while moderate stochasticity improves transfer.
  • The experiments use visual game trajectories, explicit actions, and rollout rewards, so they are strong evidence for a narrow visual-control setting rather than a general digital-system world model.

Limitations

  • The architecture is now historical: VAE + MDN-RNN + CMA-ES is not the current frontier for large-scale world models.
  • Training is staged rather than end-to-end, and the VAE may preserve visually salient but task-irrelevant detail while dropping task-relevant detail.
  • The controller can exploit the learned simulator, especially when the dynamics model is too deterministic or out of distribution.
  • The evidence is from OpenAI Gym game environments, not multivariate operational time series, irregular event streams, robotics manipulation at scale, or real digital-system interventions.
  • The work does not solve long-horizon hierarchical planning, large memory capacity, continual update, or cross-system transfer.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Control and counterfactualspartially closesLearns latent next-state dynamics conditioned on explicit actions and uses imagined rollouts to train or evaluate a controller.Visual Gym environments only; no digital-system actions, operator interventions, confounding analysis, or real operational telemetry.
Multi-modal future distributionspartially closesUses an MDN-RNN to model a distribution over next latent states and exposes temperature as an uncertainty/exploitability control.No calibrated decision-facing future distributions for numeric time-series systems.
Representation quality: semantic state vs dense detailwarningShows that compressed visual latents can support control, but also that VAE reconstruction can preserve irrelevant detail and miss task-relevant detail.Needs task-conditioned or representation-space objectives that retain intervention-relevant state under scale.
Benchmarks: what level of modeling is testedwarningThe CarRacing and VizDoom tests include action-conditioned rollouts and transfer from learned simulator to real environment.Benchmarks are small game environments and can be exploited by simulator-specific policies.

Open Questions

  • Which modern world-model architectures preserve the useful action-conditioned interface while avoiding 2018-era simulator exploitation?
  • How should uncertainty be represented so a controller can avoid brittle high-reward hallucinations without making the learned environment too noisy to plan in?
  • Can the V/M/C decomposition be transferred to observability or digital-system time series as encoder + action-conditioned latent dynamics + intervention-ranking controller?
  • What benchmark would test the same “learn inside the model, transfer outside it” claim for operator actions, rollbacks, autoscaling, or remediation choices?
  • Can prediction-loss-driven curiosity become a safe curriculum signal without over-sampling corrupt, adversarial, or irrelevant high-surprise states?