Genie: Generative Interactive Environments

Source

Raw Markdown: paper_genie-2024.md
PDF: paper_genie-2024.pdf
Preprint: https://arxiv.org/abs/2402.15391
ICML / PMLR: https://proceedings.mlr.press/v235/bruce24a.html
OpenReview: https://openreview.net/forum?id=bJbSbJskOS
Official DeepMind publication: https://deepmind.google/research/publications/genie-generative-interactive-environments/
Official project page: https://sites.google.com/view/genie-2024/

Credibility

This is a 2024 Google DeepMind paper that was accepted at ICML 2024 as an oral. As of 2026-05-25, it is older than one year and should not be treated as current SOTA for interactive world generation because later Genie-family systems and other world-model work exist. It remains important for this wiki because it is a credible, peer-reviewed anchor for learning an action-controllable visual world model from unlabeled image/video trajectories without ground-truth actions.

Core Claim

Genie learns a playable, action-conditioned visual world model from video-only data by inferring a small discrete latent-action space between frames, then using those latent actions to condition next-frame generation.

The paper-supported claim is narrower than “solved generalist agents”: Genie demonstrates controllable platformer-like image/video trajectory generation, a robotics proof of concept, and a latent-action imitation experiment. It does not release the large model, does not solve long-horizon interactive simulation, and does not evaluate numeric time-series or operational interventions.

Key Contributions

Introduces generative interactive environments: generated environments that can be stepped through with user- or agent-provided actions rather than only producing passive video.
Learns a discrete latent-action codebook from unlabeled image/video trajectories, with no ground-truth action annotations in the main training data.
Combines a spatiotemporal video tokenizer, a latent action model, and an autoregressive dynamics model using ST-Transformer blocks.
Scales the main Platformers model to about 11B parameters after training on a filtered corpus of 6.8M sixteen-second clips, about 30k hours, from 2D platformer videos.
Trains a smaller robotics model on RT-1-style robot videos while ignoring action labels, showing that consistent latent control inputs can emerge outside platformer games.
Tests whether inferred latent actions can support imitation from unseen videos in CoinRun, using a small action-labeled set to map latent actions back to real environment actions.

Method Notes

Genie’s useful abstraction for this wiki is the action-conditioned latent transition contract. The separate Latent Action Models topic generalizes the LAM part of this recipe:

LAM (x_{\leq t}, x_{t + 1}) \to a_{t}

P (z_{t + 1} ∣ z_{\leq t}, a_{t})

where $x_{t}$ is an image observation, $z_{t}$ is a discrete video-token state, and $a_{t}$ is a learned latent action. The latent action is not a typed real-world control input. It is an inferred discrete code that becomes usable only after a human or downstream adapter maps it to environment controls.

flowchart LR
  Video["unlabeled image/video trajectory"]
  Tokenizer["spatiotemporal video tokenizer"]
  Z["video tokens z_t"]
  LAM["latent action model"]
  A["latent action code a_t"]
  Dynamics["autoregressive dynamics model"]
  Next["next generated observation"]
  User["user or agent action choice"]

  Video --> Tokenizer --> Z
  Video --> LAM --> A
  Z --> Dynamics
  A --> Dynamics --> Next
  User --> A

For time-series terminology, Genie is a model over image/video trajectories. The platformer and robotics signals are observations; learned latent actions are action-like control inputs for generation, but they are not logged human or robot actions.

Evidence And Results

The data pipeline starts from public platformer videos, filters clips for gameplay quality, and keeps 6.8M clips, about 30k hours, after curation.
Scaling experiments show improved training loss as dynamics-model size increases from 40M to 2.7B parameters and as batch size increases for a 2.3B model.
The final Genie model uses a 10.1B-parameter dynamics model; combined with tokenizer and action model, the paper reports about 10.7B parameters and refers to the system as an 11B model.
Qualitative platformer results show controllable generations from out-of-distribution prompt images, including generated images, sketches, and real photos.
The robotics proof-of-concept trains a 2.5B model on action-free robot videos and reports consistent latent actions with semantic-looking effects such as up, down, and left.
The latent-action model ablation favors pixel-input LAMs over token-input LAMs for controllability, even when the token-input version can look competitive on FVD.
The tokenizer ablation favors ST-ViViT over spatial-only ViT and C-ViViT for the reported video-quality and controllability metrics.
The CoinRun imitation experiment shows that latent actions from a frozen Genie LAM can support behavior cloning after mapping latents to real actions with a small labeled set.

Author Narrative Context

The official project page frames Genie as a foundation model for playable worlds and emphasizes that one image, sketch, or photo can become an interactive environment. It also frames generated worlds as a possible curriculum for future generalist agents.

The paper supports the first part in a bounded way: it demonstrates action-controllable image/video trajectory generation for platformer-like scenes and a smaller robotics proof of concept. The broader “future generalist agents” claim remains a research direction. The paper itself names the main blockers: hallucinated unrealistic futures, only 16 frames of memory, about 1 FPS interaction, and no release of the large trained model or training data.

Limitations

Genie is not current SOTA as of 2026-05-25; it is a 2024 anchor for a specific latent-action-from-video recipe.
The strongest evidence is qualitative platformer generation and author-reported metrics, not closed-loop planning success in rich environments.
The model has only 16 frames of memory in the reported setup, which limits long-horizon consistency.
The reported interaction rate is about 1 FPS, which is below practical real-time play or control.
The large model checkpoints, main training dataset, and examples from that dataset were not released with the paper or website.
Learned latent actions are not typed real actions, control inputs, or interventions. They need mapping before they can drive a real environment.
The robotics evidence is a proof of concept from action-free videos, not a full robot policy or contact-rich manipulation benchmark.
The source is about image/video trajectories, not numeric time series, graph time series, observability telemetry, or digital operations.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Control and counterfactuals	partially closes outside time series	Genie predicts future image observations conditioned on latent actions inferred from video-only trajectories.	No numeric observations, typed actions, intervention logs, causal controls, or operational telemetry.
Data diversity and scaling substrate	partially closes outside time series	Uses a large curated video corpus and reports scaling behavior with model size and batch size.	Evidence is platformer video; no high-dimensional numeric channel scaling or event-stream scaling law.
Representation quality: latent action and video tokens	adjacent	Separates video tokens from a learned latent-action codebook and shows pixel-input LAMs improve controllability.	No evidence that the latent state preserves decision-relevant numeric regimes, rare failures, or causal variables.
Robotics/action interface	adjacent	The robotics model shows latent actions can emerge from action-free robot videos.	Proof-of-concept only; no policy evaluation, explicit control-input schema, or cross-embodiment benchmark.
Benchmarks: what level of modeling is tested?	warning	FVD and PSNR-difference measure fidelity and action effect, while CoinRun tests a small latent-action imitation bridge.	Needs long-horizon, closed-loop, decision-facing, and simulator-exploitation evaluations.

Links Into The Wiki

Open Questions

How far do latent actions learned from unlabeled image/video trajectories transfer once the target system has typed actions, delayed effects, failures, or safety constraints?
Can a Genie-like latent-action model become useful for observability or industrial time series, where the action channel should be explicit rather than inferred?
What evaluation would catch simulator exploitation in generated interactive environments before an agent learns brittle policies inside the model?
Which later Genie-family or game-world-model sources should supersede this page for current SOTA claims?
Can the official Tim Rocktaschel X announcement thread be verified from a public source without login-gated X access?

Alex Open Research Wiki

Explorer

Genie: Generative Interactive Environments

Genie: Generative Interactive Environments

Source

Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Author Narrative Context

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks