World Models

Summary

World models are learned predictive representations of the relevant state and dynamics of an environment or system. They let an agent or analyst evaluate plausible futures, constraints, and action consequences before acting. In the LeCun/AMI framing, the important move is to learn abstract representations from observation, predict in representation space, maintain usable state or memory, and support reasoning and planning rather than merely predicting the next token or reconstructing every pixel.

For time series, this page should be read through Foundation Time-Series Model Research Agenda. A forecasting model becomes world-model-adjacent only when it helps maintain system state, understand context, reason about plausible futures, or evaluate action consequences. What’s Wrong With The Current Time-Series Deep Learning? remains the landmark position source for why that standard is broader than observation forecasting.

What The Wiki Currently Believes

  • World Models is the historical visual-control anchor for action-conditioned latent dynamics: VAE image compression, MDN-RNN latent prediction under actions, and a small controller optimized through rollout reward.
  • Genie is the video-only scaling anchor for generative interactive environments: it introduces a latent action model that learns action-like codes from unlabeled image/video trajectories, then uses those codes to make a visual world model controllable frame by frame.
  • Agentic World Modeling is the current survey/taxonomy anchor for separating L1 predictors, L2 simulators, and L3 evolvers across physical, digital, social, and scientific law regimes.
  • World Model for Robot Learning Survey is the robotics-specific survey/taxonomy anchor: it separates world models for policies, learned simulators/evaluators, and robotic video generation, and makes action-conditioned consistency and control utility stricter than visual realism.
  • On Training in Imagination is the data-economics theory anchor for learned world-model training: it separates dynamics-transition error from reward-model error, derives a budget split under power-law assumptions, and warns that zero-mean reward noise is not the same as systematic reward bias.
  • CWM is the code-domain anchor for computational-environment world modeling: it trains an LLM on Python execution traces and agentic Docker trajectories where actions and observations are explicit.
  • APTAMI treats configurable predictive world models as essential for autonomous agents.
  • LeWorldModel gives a compact end-to-end JEPA world model from pixels for control.
  • stable-worldmodel is the current infrastructure anchor for reproducible world-model evaluation: it standardizes trajectory data handling, MPC solvers, baselines, and factors of variation, while also showing that current visual/control world models remain brittle under distribution shift.
  • When Does LeJEPA Learn a World Model? gives the state-identifiability boundary for the LeJEPA line: a learned state can be a faithful world-model coordinate system under Gaussian/OU-style assumptions, but the action-conditioned transition still has to be learned separately.
  • Reconstruction or Semantics? shows that latent-space choice matters for robotic diffusion world models and that semantic latents can be more policy-relevant than reconstruction latents.
  • RAEv2 adds an action-conditioned navigation boundary case: a multi-layer representation autoencoder improves autoregressive future-frame rollouts on RECON, but the evaluation is still visual-video prediction rather than closed-loop planning.
  • VL-JEPA is world-model-adjacent rather than a complete world model: it predicts vision-language target embeddings and supports selective text readout, but does not model future states under candidate actions.
  • Beyond Language Modeling reports that unified multimodal pretraining can naturally induce world-modeling capabilities.
  • ChronoGraph is a graph-temporal telemetry near-miss: it has service topology, multivariate metrics, and incident labels, but not controllable action or intervention logs.
  • Toto 2.0 TSALM Workshop Presentation is a roadmap source for multimodal observability world models, but the speaker explicitly uses the term loosely and the current Toto 2.0 system remains a passive forecasting model.
  • π0.7 adds a robotics boundary case: a lightweight world model generates near-future multi-view subgoal images from current observations, subtask text, and metadata, then a separate VLA action expert executes control-input chunks. Treat this as a future-observation/subgoal bridge for policy conditioning, not as full candidate-action rollout unless candidate action sequences are modeled directly.
  • Gemini Robotics 1.5 is another robotics boundary case: it uses embodied reasoning for planning, progress checking, and subtask handoff, but the source does not describe a learned future-state simulator under candidate actions.
  • EBT is a world-model-adjacent mechanism rather than a demonstrated action-conditioned world model: its future-work section sketches jointly scoring current context, future states, and future actions, then optimizing actions through energy minimization, but the reported experiments are text, video, and image-denoising tasks without action channels.
  • RATE is an action-trajectory boundary case: it models return-conditioned offline RL trajectories with recurrent memory and explicit actions, but it is a policy/decision model rather than a learned action-conditioned simulator of next-state dynamics.
  • Dragon Hatchling is a state-maintenance architecture boundary case: it updates a large recurrent fast state and probes sparse synapse-like concepts, but the reported evidence has no action, control-input, or intervention channel.

Observability Boundary

Observability data belongs mostly in Observability Time Series, not in the core world-model evidence set. It tempts world-model language because it contains metrics, traces, logs, topology, code changes, events, alerts, and incident timelines. In this wiki’s terminology, metrics are observations, traces and logs may be event streams, topology is context, incidents may be events or exogenous shocks, and deployments or remediations become actions or interventions only when they are logged as controllable decisions with downstream consequences.

That means an observability forecasting model can be an excellent passive dynamics model without yet being an action-conditioned world model. The missing step is to join forecasted telemetry with operator actions such as deployments, rollbacks, autoscaling, traffic shaping, feature flags, remediation playbooks, or incident-response choices.

That join is not only extra control metadata. It can simplify the learning target. Without action logs, a passive model must fit one history -> future mapping that mixes waits, rollbacks, restarts, deploys, traffic shifts, remediation steps, and external incidents. With explicit actions, the model can learn history + action -> future conditional dynamics and compare candidate interventions. The complexity moves into the data contract: targets, timing, parameters, action status, and outcomes must be recorded well enough that operator behavior is not invisible background noise.

For cross-system transfer, an SRE world model also needs an explicit system embodiment descriptor. In robotics, cross-embodiment transfer depends on the contract between shared policy state and robot-specific controllers. In production operations, the analogous contract is service graph + telemetry schema + intervention capabilities -> typed actions/control inputs -> system-specific executor. Without that descriptor, a model can overfit to one monitoring stack’s channel order or one service graph’s topology rather than learning transferable operational dynamics.

The Toto TSALM roadmap names time series plus logs as the first multimodal step and learned simulation for SRE agents as a target. In this wiki, that is a useful direction-of-travel signal, not proof that the released Toto 2.0 checkpoints can evaluate interventions or plan action sequences.

Digital World Boundary

Digital World Models are the software-defined branch of world modeling. In the Agentic World Modeling taxonomy, the governing laws are API contracts, UI state machines, file-system logic, type constraints, permissions, error branches, and other executable or mechanically checkable transition rules. This makes web, GUI, code, game, and desktop environments attractive world-model testbeds because rollouts can often be replayed or verified.

The boundary matters for this wiki’s observability agenda. A web or GUI simulator can be a true digital world model without being a telemetry-native action-conditioned world model. Production operations add numeric time series, graph time series, event streams, hidden concurrent users, delayed effects, failed actions, and human-approval semantics. The useful transfer is the action/state/constraint contract, not a claim that screenshots or DOM prediction solve SRE control.

Hierarchy And Compute Budget

For world models, hierarchy is not only an efficiency trick. The useful hierarchy should preserve state variables and event boundaries that matter for future observations under actions, control inputs, or interventions. In time-series terms, the desired stack is closer to samples -> local motifs -> events -> regimes -> latent state than to “a cheaper long-context Transformer”.

H-Net and ConceptMoE are useful architecture analogs because they move from fine units to compressed chunks or concepts before expensive processing. A world model must still protect rare events, change points, and intervention effects that may look cheap to compress but remain decision-relevant.

Dragon Hatchling adds an adjacent memory/state hypothesis: a world model might benefit from a large sparse fast state whose updates are interpretable and local. For this page, that is only a design hypothesis until the architecture is paired with observations, actions or control inputs, and transition objectives that can evaluate candidate futures.

The local design note Hierarchical Modeling with a Fixed FLOPs Budget frames this as fixed-FLOPs adaptive hierarchy: spend compute where it improves latent-state maintenance and action consequence prediction, not merely where it lowers reconstruction or forecasting loss.

Evidence

The corpus moves from historical latent rollout to conceptual architecture and then to model selection: build predictive latent dynamics, but choose the latent space according to downstream planning relevance rather than visual fidelity alone. World Models (2018) gives the early VAE + MDN-RNN + controller pattern and the core simulator-exploitation warning: a controller can overfit to hallucinated dynamics unless model uncertainty and transfer are tested. CWM transfers the action-conditioned frame into code: Python source lines, shell commands, and edits become actions, while local variables, command output, tests, and files become observations. It is strong evidence that digital-world action-observation traces can be useful for LLM training, but it remains code-centric rather than telemetry-native. VL-JEPA sharpens the interface distinction: a semantic embedding stream can be useful for perception, monitoring, and selective language readout without yet being an action-conditioned simulator. EBT adds a candidate-future scoring mechanism: a world model could use low energy as a compatibility score for proposed futures or actions, but that remains a transfer hypothesis until tested with action-conditioned rollouts. RATE adds the complementary policy-side warning: explicit action trajectories and long-horizon memory are not enough by themselves; a world model also needs a transition/dynamics interface for comparing candidate actions. The observability boundary adds a second constraint: passive metric forecasts are not enough for action consequence reasoning unless the action or control input channel is present.

Genie adds a complementary action-discovery result: if ground-truth actions are absent, a latent action model can recover a small action-like interface that makes generated image/video trajectories controllable. That is valuable for unlabeled video scale, but it is not a substitute for typed action logs when the system can expose actions, control inputs, interventions, status, timing, and outcomes.

RAEv2 extends the latent-space selection evidence into navigation-video rollouts: retaining local spatial information through multi-layer encoder aggregation can reduce flicker and improve FVD under action-conditioned autoregressive rollout. The caveat is that this is still rollout fidelity, not proof that the model can rank or optimize candidate action sequences.

LeJEPA Identifiability makes the world-model label more precise. It supports the claim that a representation can recover useful latent state coordinates, but only under explicit assumptions about the data-generating process, embedding distribution, and alignment optimum. It should therefore be used as evidence about the state representation side of world models, not as evidence that action-conditioned dynamics, exploration coverage, or candidate-action rollout are solved.

stable-worldmodel adds the evaluation-infrastructure side of the same boundary. It does not propose a new world-model objective, but it makes action-conditioned evaluation more auditable by sharing trajectory storage, baseline implementations, solvers, and controllable factors of variation. Its Push-T analyses strengthen the hygiene warning that low prediction error and in-distribution success do not prove robust planning under distribution shift.

The robot-learning survey broadens the same warning into a policy-centric map. It treats explicit video rollout, latent prediction, and symbolic/planner-facing state as alternative world-model interfaces, but requires each interface to preserve action consequences and downstream policy utility. This is the bridge between robotics-specific world models and the wiki’s broader time-series agenda: prediction quality is useful only insofar as it preserves the variables needed to compare candidate actions, control inputs, or interventions.

On Training in Imagination adds the reward side of that boundary. A learned simulator used for policy optimization is at least two learned objects: a dynamics model and a reward model. Their errors, annotation costs, scaling exponents, and noise/bias profiles should be reported separately before turning imagined-rollout success into a data-collection rule.

Relation To Foundation TSFM Agenda

This page is the general world-model counterpart to the Foundation Time-Series Model Research Agenda. It provides the problem framing for why state, plausible futures, hierarchy, and action consequences matter. It does not by itself close the TSFM-specific slots for high-dimensional numeric observations, irregular event streams, context schemas, dense numeric generation, or digital-world intervention logs.

Open Questions

  • How should long-horizon planning be layered on top of compact latent predictors?
  • Are semantic latents sufficient for control tasks that require precise geometry?
  • What observability benchmark would join metrics, traces, logs, topology, alerts, and operator actions strongly enough to test intervention-aware world models?
  • How should passive forecasting scores be combined with counterfactual or action-conditioned evaluation when both matter operationally?
  • How should semantic embedding streams be connected to candidate-action rollout without turning every internal state into natural language?
  • Which data-collection policies make learned world-model state identifiable without collapsing onto policy-biased, non-Gaussian trajectory marginals?
  • How should memory-augmented policy models like RATE be paired with learned dynamics so they can evaluate candidate interventions rather than only choose actions from logged trajectories?
  • Can fixed-FLOPs hierarchy preserve rare events and intervention effects better than uniform token processing at the same serving cost?
  • Can explicit energy minimization over candidate futures or interventions avoid mode averaging while remaining cheap enough for operational control loops?
  • What system embodiment descriptor is sufficient for transferring operational world models across different service graphs, telemetry schemas, and deployment stacks?
  • Can layer-aggregation choices in pretrained encoders change world-model planning quality, not only visual rollout fidelity?
  • Can CWM-style execution-trace training transfer from deterministic code environments to noisy telemetry systems with delayed, partially observed intervention effects?
  • Can a BDH-like sparse fast state become useful for action-conditioned rollout once typed actions and control inputs are first-class channels?
  • Which digital-world abstractions transfer best to operations: DOM-like state, executable code, typed action logs, replayable tests, or explicit error branches?
  • Which modern uncertainty and evaluation protocols prevent controllers from exploiting a learned simulator instead of learning policies that transfer?
  • When should a world model infer latent actions from observation-only trajectories, and when should it require typed actions, control inputs, intervention status, timing, and outcomes?
  • How should dynamics-transition data and reward-annotation data be budgeted when reward labels are expensive, noisy, delayed, or biased?