Genie: Generative Interactive Environments
Source
- Raw Markdown: paper_genie-2024.md
- PDF: paper_genie-2024.pdf
- Preprint: https://arxiv.org/abs/2402.15391
- ICML / PMLR: https://proceedings.mlr.press/v235/bruce24a.html
- OpenReview: https://openreview.net/forum?id=bJbSbJskOS
- Official DeepMind publication: https://deepmind.google/research/publications/genie-generative-interactive-environments/
- Official project page: https://sites.google.com/view/genie-2024/
Credibility
This is a 2024 Google DeepMind paper that was accepted at ICML 2024 as an oral. As of 2026-05-25, it is older than one year and should not be treated as current SOTA for interactive world generation because later Genie-family systems and other world-model work exist. It remains important for this wiki because it is a credible, peer-reviewed anchor for learning an action-controllable visual world model from unlabeled image/video trajectories without ground-truth actions.
Core Claim
Genie learns a playable, action-conditioned visual world model from video-only data by inferring a small discrete latent-action space between frames, then using those latent actions to condition next-frame generation.
The paper-supported claim is narrower than “solved generalist agents”: Genie demonstrates controllable platformer-like image/video trajectory generation, a robotics proof of concept, and a latent-action imitation experiment. It does not release the large model, does not solve long-horizon interactive simulation, and does not evaluate numeric time-series or operational interventions.
Key Contributions
- Introduces
generative interactive environments: generated environments that can be stepped through with user- or agent-provided actions rather than only producing passive video. - Learns a discrete latent-action codebook from unlabeled image/video trajectories, with no ground-truth action annotations in the main training data.
- Combines a spatiotemporal video tokenizer, a latent action model, and an autoregressive dynamics model using ST-Transformer blocks.
- Scales the main Platformers model to about 11B parameters after training on a filtered corpus of 6.8M sixteen-second clips, about 30k hours, from 2D platformer videos.
- Trains a smaller robotics model on RT-1-style robot videos while ignoring action labels, showing that consistent latent control inputs can emerge outside platformer games.
- Tests whether inferred latent actions can support imitation from unseen videos in CoinRun, using a small action-labeled set to map latent actions back to real environment actions.
Method Notes
Genie’s useful abstraction for this wiki is the action-conditioned latent transition contract. The separate Latent Action Models topic generalizes the LAM part of this recipe:
where is an image observation, is a discrete video-token state, and is a learned latent action. The latent action is not a typed real-world control input. It is an inferred discrete code that becomes usable only after a human or downstream adapter maps it to environment controls.
flowchart LR Video["unlabeled image/video trajectory"] Tokenizer["spatiotemporal video tokenizer"] Z["video tokens z_t"] LAM["latent action model"] A["latent action code a_t"] Dynamics["autoregressive dynamics model"] Next["next generated observation"] User["user or agent action choice"] Video --> Tokenizer --> Z Video --> LAM --> A Z --> Dynamics A --> Dynamics --> Next User --> A
For time-series terminology, Genie is a model over image/video trajectories. The platformer and robotics signals are observations; learned latent actions are action-like control inputs for generation, but they are not logged human or robot actions.
Evidence And Results
- The data pipeline starts from public platformer videos, filters clips for gameplay quality, and keeps 6.8M clips, about 30k hours, after curation.
- Scaling experiments show improved training loss as dynamics-model size increases from 40M to 2.7B parameters and as batch size increases for a 2.3B model.
- The final Genie model uses a 10.1B-parameter dynamics model; combined with tokenizer and action model, the paper reports about 10.7B parameters and refers to the system as an 11B model.
- Qualitative platformer results show controllable generations from out-of-distribution prompt images, including generated images, sketches, and real photos.
- The robotics proof-of-concept trains a 2.5B model on action-free robot videos and reports consistent latent actions with semantic-looking effects such as up, down, and left.
- The latent-action model ablation favors pixel-input LAMs over token-input LAMs for controllability, even when the token-input version can look competitive on FVD.
- The tokenizer ablation favors ST-ViViT over spatial-only ViT and C-ViViT for the reported video-quality and controllability metrics.
- The CoinRun imitation experiment shows that latent actions from a frozen Genie LAM can support behavior cloning after mapping latents to real actions with a small labeled set.
Author Narrative Context
The official project page frames Genie as a foundation model for playable worlds and emphasizes that one image, sketch, or photo can become an interactive environment. It also frames generated worlds as a possible curriculum for future generalist agents.
The paper supports the first part in a bounded way: it demonstrates action-controllable image/video trajectory generation for platformer-like scenes and a smaller robotics proof of concept. The broader “future generalist agents” claim remains a research direction. The paper itself names the main blockers: hallucinated unrealistic futures, only 16 frames of memory, about 1 FPS interaction, and no release of the large trained model or training data.
Limitations
- Genie is not current SOTA as of 2026-05-25; it is a 2024 anchor for a specific latent-action-from-video recipe.
- The strongest evidence is qualitative platformer generation and author-reported metrics, not closed-loop planning success in rich environments.
- The model has only 16 frames of memory in the reported setup, which limits long-horizon consistency.
- The reported interaction rate is about 1 FPS, which is below practical real-time play or control.
- The large model checkpoints, main training dataset, and examples from that dataset were not released with the paper or website.
- Learned latent actions are not typed real actions, control inputs, or interventions. They need mapping before they can drive a real environment.
- The robotics evidence is a proof of concept from action-free videos, not a full robot policy or contact-rich manipulation benchmark.
- The source is about image/video trajectories, not numeric time series, graph time series, observability telemetry, or digital operations.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Control and counterfactuals | partially closes outside time series | Genie predicts future image observations conditioned on latent actions inferred from video-only trajectories. | No numeric observations, typed actions, intervention logs, causal controls, or operational telemetry. |
| Data diversity and scaling substrate | partially closes outside time series | Uses a large curated video corpus and reports scaling behavior with model size and batch size. | Evidence is platformer video; no high-dimensional numeric channel scaling or event-stream scaling law. |
| Representation quality: latent action and video tokens | adjacent | Separates video tokens from a learned latent-action codebook and shows pixel-input LAMs improve controllability. | No evidence that the latent state preserves decision-relevant numeric regimes, rare failures, or causal variables. |
| Robotics/action interface | adjacent | The robotics model shows latent actions can emerge from action-free robot videos. | Proof-of-concept only; no policy evaluation, explicit control-input schema, or cross-embodiment benchmark. |
| Benchmarks: what level of modeling is tested? | warning | FVD and PSNR-difference measure fidelity and action effect, while CoinRun tests a small latent-action imitation bridge. | Needs long-horizon, closed-loop, decision-facing, and simulator-exploitation evaluations. |
Links Into The Wiki
- Genie
- World Models
- Foundation Time-Series Model Research Agenda
- Latent Action Models
- Robotics Time-Series Modeling
- Digital World Models
- Latent-Space Predictive Learning
- Vision Foundation Models
Open Questions
- How far do latent actions learned from unlabeled image/video trajectories transfer once the target system has typed actions, delayed effects, failures, or safety constraints?
- Can a Genie-like latent-action model become useful for observability or industrial time series, where the action channel should be explicit rather than inferred?
- What evaluation would catch simulator exploitation in generated interactive environments before an agent learns brittle policies inside the model?
- Which later Genie-family or game-world-model sources should supersede this page for current SOTA claims?
- Can the official Tim Rocktaschel X announcement thread be verified from a public source without login-gated X access?