World Models

Summary

World models are learned predictive representations of the relevant state and dynamics of an environment or system. They let an agent or analyst evaluate plausible futures, constraints, and action consequences before acting. In the LeCun/AMI framing, the important move is to learn abstract representations from observation, predict in representation space, maintain usable state or memory, and support reasoning and planning rather than merely predicting the next token or reconstructing every pixel.

For time series, this page should be read through Foundation Time-Series Model Research Agenda. A forecasting model becomes world-model-adjacent only when it helps maintain system state, understand context, reason about plausible futures, or evaluate action consequences. What’s Wrong With The Current Time-Series Deep Learning? remains the landmark position source for why that standard is broader than observation forecasting.

What The Wiki Currently Believes

World Models is the historical visual-control anchor for action-conditioned latent dynamics: VAE image compression, MDN-RNN latent prediction under actions, and a small controller optimized through rollout reward.
Genie is the video-only scaling anchor for generative interactive environments: it introduces a latent action model that learns action-like codes from unlabeled image/video trajectories, then uses those codes to make a visual world model controllable frame by frame.
Motion Attribution for Video Generation is a data-curation analogue, not a demonstrated world model. It traces target motion dynamics to influential fine-tuning clips and shows targeted temporal-quality gains, but it has no action, control-input, intervention, state-transition, planning, or counterfactual interface. Its relevance is to choosing observation trajectories that may teach useful dynamics before an action-conditioned world model is trained.
VLA-JEPA is the robotics VLA pretraining branch of the same latent-action/world-model theme: it uses target-side V-JEPA2 future-state embeddings and learner-side latent-action tokens, then attaches a flow-matching action head for continuous robot control inputs.
OTF-LAM is the agent-ambiguity branch: it factorizes observation transitions into reusable local observed effects before aggregating them into action-like latents, warning that monolithic latent actions can absorb distractor, camera, and background dynamics.
Agentic World Modeling is the current survey/taxonomy anchor for separating L1 predictors, L2 simulators, and L3 evolvers across physical, digital, social, and scientific law regimes.
Agentic Automata Learning is the current controlled benchmark warning for LLM-agent world-model inference: even with exact membership/equivalence-query tools and deterministic DFA targets, frontier agents lose robustness, query efficiency, and evidence consistency as hidden-state complexity grows.
Awesome Agentic Time Series is the time-series-specific survey/list source for closed-loop temporal agents, including perception, reasoning, planning/action, memory, temporal world models, and reliability.
World Model for Robot Learning Survey is the robotics-specific survey/taxonomy anchor: it separates world models for policies, learned simulators/evaluators, and robotic video generation, and makes action-conditioned consistency and control utility stricter than visual realism.
On Training in Imagination is the data-economics theory anchor for learned world-model training: it separates dynamics-transition error from reward-model error, derives a budget split under power-law assumptions, and warns that zero-mean reward noise is not the same as systematic reward bias.
CWM is the code-domain anchor for computational-environment world modeling: it trains an LLM on Python execution traces and agentic Docker trajectories where actions and observations are explicit.
APTAMI treats configurable predictive world models as essential for autonomous agents.
VJEPA gives the probabilistic JEPA branch: a time-indexed predictor defines a conditional distribution over future latent states, supports belief propagation without pixel reconstruction, and can be conditioned on explicit control inputs. Its evaluated diagonal-Gaussian head is unimodal, so the paper formalizes distributional latent prediction without yet demonstrating separated multi-modal futures.
Superhuman Adaptable Intelligence is a north-star terminology source rather than world-model evidence: it argues that fast adaptation should rely on self-supervised learning, latent prediction, world models, and modular specialization, but it does not add a concrete action-conditioned model.
LeWorldModel gives a compact end-to-end JEPA world model from pixels for control.
Temporal Straightening adds the planner-geometry branch: it regularizes consecutive latent velocities so Euclidean latent distance better tracks feasible progress and gradient-based candidate-action optimization becomes easier in the tested visual goal-reaching tasks.
Sensorimotor World Models gives an inverse-dynamics-regularized branch of the same pixel/action line: logged actions become the anti-collapse signal that shapes the latent state toward controllable degrees of freedom.
AdaJEPA adds the adaptive-deployment branch: a JEPA latent world model is updated from observed transitions inside the MPC loop before replanning, improving robustness under visual, geometric, dynamics, and layout shifts.
SkyJEPA gives a low-dimensional physical-control counterpart: state/action histories and candidate motor-control sequences are rolled forward in latent space, decoded through a structured prober, and used by MPPI for outdoor quadrotor control.
Looped World Models adds the recurrent-depth architecture branch: a parameter-shared Transformer dynamics core refines action-conditioned latent state with spectral retention, adaptive early exit, and deferred decoding. Treat it as important world-model architecture evidence, with the caveat that the current technical report has no verified public code/model artifact and selective disclosure of supporting results.
NextLat adds a Transformer belief-state objective: next-hidden-state supervision can make autoregressive hidden states more compact and future-predictive, but its transition input is the next token rather than a typed action, control input, or intervention.
stable-worldmodel is the current infrastructure anchor for reproducible world-model evaluation: it standardizes trajectory data handling, MPC solvers, baselines, and factors of variation, while also showing that current visual/control world models remain brittle under distribution shift.
When Does LeJEPA Learn a World Model? gives the state-identifiability boundary for the LeJEPA line: a learned state can be a faithful world-model coordinate system under Gaussian/OU-style assumptions, but the action-conditioned transition still has to be learned separately.
Reconstruction or Semantics? shows that latent-space choice matters for robotic diffusion world models and that semantic latents can be more policy-relevant than reconstruction latents.
RAEv2 adds an action-conditioned navigation boundary case: a multi-layer representation autoencoder improves autoregressive future-frame rollouts on RECON, but the evaluation is still visual-video prediction rather than closed-loop planning.
VL-JEPA is world-model-adjacent rather than a complete world model: it predicts vision-language target embeddings and supports selective text readout, but does not model future states under candidate actions.
VLWM is the high-level language-state branch: video context is mapped to goals plus interleaved textual actions and world-state changes, with direct System-1 decoding or System-2 critic-ranked candidate rollouts. It is more inspectable than opaque latent rollout, but free-form text is not a calibrated physical state and the exact model/data artifacts remain unreleased.
Action100M is the data-engine continuation of that branch. Its hierarchical HowTo100M annotations scale action, actor, and caption supervision, but observed action labels are events rather than typed control inputs, and the released schema does not contain explicit before/after state-change targets.
Beyond Language Modeling reports that unified multimodal pretraining can naturally induce world-modeling capabilities.
ChronoGraph is a graph-temporal telemetry near-miss: it has service topology, multivariate metrics, and incident labels, but not controllable action or intervention logs.
Toto 2.0 TSALM Workshop Presentation is a roadmap source for multimodal observability world models, but the speaker explicitly uses the term loosely and the current Toto 2.0 system remains a passive forecasting model.
π0.7 adds a robotics boundary case: a lightweight world model generates near-future multi-view subgoal images from current observations, subtask text, and metadata, then a separate VLA action expert executes control-input chunks. Treat this as a future-observation/subgoal bridge for policy conditioning, not as full candidate-action rollout unless candidate action sequences are modeled directly.
Gemini Robotics 1.5 is another robotics boundary case: it uses embodied reasoning for planning, progress checking, and subtask handoff, but the source does not describe a learned future-state simulator under candidate actions.
EBT is a world-model-adjacent mechanism rather than a demonstrated action-conditioned world model: its future-work section sketches jointly scoring current context, future states, and future actions, then optimizing actions through energy minimization, but the reported experiments are text, video, and image-denoising tasks without action channels.
RATE is an action-trajectory boundary case: it models return-conditioned offline RL trajectories with recurrent memory and explicit actions, but it is a policy/decision model rather than a learned action-conditioned simulator of next-state dynamics.
CityLearn is a non-vision energy-control environment: multivariate building observations plus weather/pricing context and continuous storage/device control inputs can generate observation/action/reward/next-observation trajectories, but schema and version pinning are required.
Topological Neural Operators is scientific-ML operator-learning evidence for topology-aware passive dynamics: cell-complex support and DEC operators preserve geometric structure for PDE fields, but the paper does not provide action-conditioned rollouts, rewards, or planning evidence.
Grid2Op is the power-grid counterpart: graph-structured numeric observations, topology/control inputs, exogenous contingencies, physical simulators, and safety/cost outcomes make it a strong non-vision action-conditioned world-model testbed. The historical evidence spans pre-benchmark counterfactual action-label extraction, expert-system action discovery, topology-controller challenges, policy/search systems such as Afterstate Actor-Critic and AlphaZero-style topology optimization, distributed active-power correction, adversarial robustness, LEAP-Net perturbation, LEAP system-identification, and Graph Neural Solver surrogate work. The fresh branch adds RL2Grid and MARL2Grid-TR as benchmark anchors, soft-label imitation learning as simulator-distilled action ranking, Gibbs-prior topology control as a one-step action-conditioned overload-risk surrogate, targeted exploration as physics-guided action proposal for line-switching under cascading-failure risk, runtime safety shielding as the physical-simulator runtime-filter branch, and interpretable policy distillation as an auditable action-head companion rather than dynamics modeling. LLM-guided safe RL is a training-time transition-refinement branch, not a learned simulator. Varying Grid Topology is the historical metadata-only precedent for learned GCN line-loading prediction plus MCTS planning, and the Graph RL survey is the secondary-source support for that claim until the full text is available. It still lacks a TSFM-native learned transition benchmark that records candidate-action rollouts, simulator-query budgets, residual-risk estimates, and control-quality gaps.
A World Model Based Reinforcement Learning Architecture for Autonomous Power System Control is power-systems world-model precedent, but not Grid2Op evidence: it studies IEEE 14-bus FACTS setpoint control with a learned model and safety shield. Treat it as context for the phrase “world model” in power systems, not as current L2RPN SOTA.
Dragon Hatchling is a state-maintenance architecture boundary case: it updates a large recurrent fast state and probes sparse synapse-like concepts, but the reported evidence has no action, control-input, or intervention channel.
DiGA and MarS add a financial-market branch: generated order-flow or market trajectories can support scenario analysis, injected-order what-if tests, and trading-agent training. They remain domain-specific learned simulators whose causal market-impact validity, data access, multi-asset transfer, and simulator assumptions need separate validation.

Probabilistic Belief Boundary

A world model used for planning under partial observability should usually maintain a belief over state and future trajectories, not only one latent state and one rollout. This creates three distinct evidence levels:

Passive observation distributions model several possible future measurements but do not expose latent state or candidate control inputs.
Latent-state distributions model uncertainty over future internal state but may still omit actions, interventions, and decision utility.
Action-conditioned latent beliefs propagate state distributions under candidate control inputs and support risk-sensitive planning.

VJEPA formalizes the second level and writes down the third. Its control-sufficiency claim remains conditional on the learned latent being predictively sufficient for cost-relevant consequences under candidate control inputs. The ICML poster’s small single-seed DMC result is not enough to establish robust stochastic control, and the diagonal-Gaussian head cannot represent separated future modes. For this page, VJEPA is therefore a strong interface/theory source and a limited empirical control source.

The evaluation target is not merely sample diversity. A model should preserve distinct valid regimes, assign calibrated mass, avoid invalid between-mode trajectories, expose tail risk, and show that candidate actions move the predictive distribution in the right direction. Otherwise a probabilistic head can be distributional in notation while remaining operationally equivalent to a smoothed point predictor.

Observability Boundary

Observability data belongs mostly in Observability Time Series, not in the core world-model evidence set. It tempts world-model language because it contains metrics, traces, logs, topology, code changes, events, alerts, and incident timelines. In this wiki’s terminology, metrics are observations, traces and logs may be event streams, topology is context, incidents may be events or exogenous shocks, and deployments or remediations become actions or interventions only when they are logged as controllable decisions with downstream consequences.

That means an observability forecasting model can be an excellent passive dynamics model without yet being an action-conditioned world model. The missing step is to join forecasted telemetry with operator actions such as deployments, rollbacks, autoscaling, traffic shaping, feature flags, remediation playbooks, or incident-response choices.

That join is not only extra control metadata. It can simplify the learning target. Without action logs, a passive model must fit one history -> future mapping that mixes waits, rollbacks, restarts, deploys, traffic shifts, remediation steps, and external incidents. With explicit actions, the model can learn history + action -> future conditional dynamics and compare candidate interventions. The complexity moves into the data contract: targets, timing, parameters, action status, and outcomes must be recorded well enough that operator behavior is not invisible background noise.

For cross-system transfer, an SRE world model also needs an explicit system embodiment descriptor. In robotics, cross-embodiment transfer depends on the contract between shared policy state and robot-specific controllers. In production operations, the analogous contract is service graph + telemetry schema + intervention capabilities -> typed actions/control inputs -> system-specific executor. Without that descriptor, a model can overfit to one monitoring stack’s channel order or one service graph’s topology rather than learning transferable operational dynamics.

The Toto TSALM roadmap names time series plus logs as the first multimodal step and learned simulation for SRE agents as a target. In this wiki, that is a useful direction-of-travel signal, not proof that the released Toto 2.0 checkpoints can evaluate interventions or plan action sequences.

Digital World Boundary

Digital World Models are the software-defined branch of world modeling. In the Agentic World Modeling taxonomy, the governing laws are API contracts, UI state machines, file-system logic, type constraints, permissions, error branches, and other executable or mechanically checkable transition rules. This makes web, GUI, code, game, and desktop environments attractive world-model testbeds because rollouts can often be replayed or verified.

The boundary matters for this wiki’s observability agenda. A web or GUI simulator can be a true digital world model without being a telemetry-native action-conditioned world model. Production operations add numeric time series, graph time series, event streams, hidden concurrent users, delayed effects, failed actions, and human-approval semantics. The useful transfer is the action/state/constraint contract, not a claim that screenshots or DOM prediction solve SRE control.

Hierarchy And Compute Budget

For world models, hierarchy is not only an efficiency trick. The useful hierarchy should preserve state variables and event boundaries that matter for future observations under actions, control inputs, or interventions. In time-series terms, the desired stack is closer to samples -> local motifs -> events -> regimes -> latent state than to “a cheaper long-context Transformer”.

H-Net and ConceptMoE are useful architecture analogs because they move from fine units to compressed chunks or concepts before expensive processing. A world model must still protect rare events, change points, and intervention effects that may look cheap to compress but remain decision-relevant.

Dragon Hatchling adds an adjacent memory/state hypothesis: a world model might benefit from a large sparse fast state whose updates are interpretable and local. For this page, that is only a design hypothesis until the architecture is paired with observations, actions or control inputs, and transition objectives that can evaluate candidate futures.

The local design note Hierarchical Modeling with a Fixed FLOPs Budget frames this as fixed-FLOPs adaptive hierarchy: spend compute where it improves latent-state maintenance and action consequence prediction, not merely where it lowers reconstruction or forecasting loss.

Looped World Models makes this compute-budget question explicit inside the world-model transition itself. Its looped dynamics core can allocate more recurrent depth to complex transitions and fewer iterations to easy ones, while deferred decoding avoids per-step observation heads during latent action rollouts. The open question is whether that budget improves planning utility after rollout latency, decoder cost, hidden-state drift, and simulator exploitation are counted.

Evidence

The corpus moves from historical latent rollout to conceptual architecture and then to model selection: build predictive latent dynamics, but choose the latent space according to downstream planning relevance rather than visual fidelity alone. World Models (2018) gives the early VAE + MDN-RNN + controller pattern and the core simulator-exploitation warning: a controller can overfit to hallucinated dynamics unless model uncertainty and transfer are tested. CWM transfers the action-conditioned frame into code: Python source lines, shell commands, and edits become actions, while local variables, command output, tests, and files become observations. It is strong evidence that digital-world action-observation traces can be useful for LLM training, but it remains code-centric rather than telemetry-native. VL-JEPA sharpens the interface distinction: a semantic embedding stream can be useful for perception, monitoring, and selective language readout without yet being an action-conditioned simulator. EBT adds a candidate-future scoring mechanism: a world model could use low energy as a compatibility score for proposed futures or actions, but that remains a transfer hypothesis until tested with action-conditioned rollouts. RATE adds the complementary policy-side warning: explicit action trajectories and long-horizon memory are not enough by themselves; a world model also needs a transition/dynamics interface for comparing candidate actions. The observability boundary adds a second constraint: passive metric forecasts are not enough for action consequence reasoning unless the action or control input channel is present.

Genie adds a complementary action-discovery result: if ground-truth actions are absent, a latent action model can recover a small action-like interface that makes generated image/video trajectories controllable. VLA-JEPA adds the VLA-policy version: use future video only as a latent target, avoid future-frame leakage into the learner, and test whether the resulting latent-action tokens improve continuous robot control. OTF-LAM adds the caution that observation-only transitions may be mixtures of multiple transition sources, so factorizing observed effects can be necessary before calling anything an action. These are valuable for unlabeled video scale, but they are not substitutes for typed action logs when the system can expose actions, control inputs, interventions, status, timing, and outcomes.

RAEv2 extends the latent-space selection evidence into navigation-video rollouts: retaining local spatial information through multi-layer encoder aggregation can reduce flicker and improve FVD under action-conditioned autoregressive rollout. The caveat is that this is still rollout fidelity, not proof that the model can rank or optimize candidate action sequences.

NextLat adds the sequence-modeling counterpart: even when next-token legality is perfect, hidden states can fail to encode a coherent internal map, so world-model evidence should inspect latent structure, rollout consistency, and planning utility. Its belief-state theorem is valuable state-side evidence, but the paper does not yet model external candidate actions or intervention consequences.

SkyJEPA contributes the real-time physical-control version of this latent-dynamics line. It is not a broad visual foundation world model, but it tightens the action-conditioned interface: the model receives state and motor-force histories, MPPI proposes candidate future control sequences, and a structured prober converts latent rollouts back into metric state trajectories that a controller can cost. Its strongest transfer value is the separation between latent rollout, structured state readout, real-time planner latency, and closed-loop outdoor transfer.

Looped World Models extends the latent-dynamics line in a different direction: instead of choosing a better static latent encoder, it changes the transition compute by looping a shared Transformer block and delaying decoding during multi-step action rollout. This is closer to a simulator-systems claim than a pure representation claim, so it should be validated by both prediction/rollout accuracy and planner-transfer robustness.

LeJEPA Identifiability makes the world-model label more precise. It supports the claim that a representation can recover useful latent state coordinates, but only under explicit assumptions about the data-generating process, embedding distribution, and alignment optimum. Sensorimotor World Models adds the complementary action-grounded view: an inverse dynamics objective can make latent transitions action-informative without imposing a Gaussian embedding distribution, but it may intentionally ignore variables outside the action repertoire. These should therefore be used as evidence about state-representation pressures, not as evidence that exploration coverage, uncertainty, or arbitrary candidate-action rollout are solved.

stable-worldmodel adds the evaluation-infrastructure side of the same boundary. It does not propose a new world-model objective, but it makes action-conditioned evaluation more auditable by sharing trajectory storage, baseline implementations, solvers, and controllable factors of variation. Its Push-T analyses strengthen the hygiene warning that low prediction error and in-distribution success do not prove robust planning under distribution shift.

AdaJEPA adds the online adaptation variant of this same boundary. It does not replace offline data coverage or solve safety by itself, but it makes deployment-time model revision a first-class protocol axis: the wiki should track which parameters are adapted, what buffer is used, how much latency is added, whether the model resets between episodes, and whether prediction-error reduction actually improves action ranking.

The robot-learning survey broadens the same warning into a policy-centric map. It treats explicit video rollout, latent prediction, and symbolic/planner-facing state as alternative world-model interfaces, but requires each interface to preserve action consequences and downstream policy utility. This is the bridge between robotics-specific world models and the wiki’s broader time-series agenda: prediction quality is useful only insofar as it preserves the variables needed to compare candidate actions, control inputs, or interventions.

On Training in Imagination adds the reward side of that boundary. A learned simulator used for policy optimization is at least two learned objects: a dynamics model and a reward model. Their errors, annotation costs, scaling exponents, and noise/bias profiles should be reported separately before turning imagined-rollout success into a data-collection rule.

Awesome Agentic Time Series adds the time-series-agent version of this map. Its temporal world-model layer is useful because it puts simulation, interventions, and counterfactuals inside the agent stack, but it also sharpens the boundary: a planning/action agent can route tools or optimize a pipeline without possessing a learned transition model, and a memory agent can retrieve past cases without maintaining an action-conditioned latent state.

Agentic Automata Learning adds a controlled negative result for that same boundary. Its hidden DFA task gives the agent exact feedback and a fully checkable symbolic world, yet LLM agents still accumulate non-informative queries and fail to convert interaction history into stable hypotheses as state complexity grows. For this page, that is benchmark-hygiene evidence: do not equate tool use, long context, or interaction with a reliable world model unless the agent’s query policy, evidence reuse, and inferred transition structure are measured.

Relation To Foundation TSFM Agenda

This page is the general world-model counterpart to the Foundation Time-Series Model Research Agenda. It provides the problem framing for why state, plausible futures, hierarchy, and action consequences matter. It does not by itself close the TSFM-specific slots for high-dimensional numeric observations, irregular event streams, context schemas, dense numeric generation, or digital-world intervention logs.

Open Questions

How should long-horizon planning be layered on top of compact latent predictors?
When should a world model regularize local latent trajectory curvature, and how should that geometry change for irregular time gaps, asymmetric dynamics, discontinuous events, or irreversible interventions?
Can looped latent depth and deferred decoding reduce world-model rollout cost without increasing simulator exploitation or hidden-state drift?
Are semantic latents sufficient for control tasks that require precise geometry?
What observability benchmark would join metrics, traces, logs, topology, alerts, and operator actions strongly enough to test intervention-aware world models?
Can L2RPN/Grid2Op be converted from a challenge-agent benchmark into a TSFM-native benchmark that tests state + candidate action + contingency scenario -> future grid trajectory, separates preventive and corrective robustness, and reports simulator-query budget plus survival versus continuous-optimization tradeoffs?
How should passive forecasting scores be combined with counterfactual or action-conditioned evaluation when both matter operationally?
Can inverse-dynamics regularization identify controllable state in digital or numeric systems where actions are sparse, delayed, confounded, or only weakly visible?
How should semantic embedding streams be connected to candidate-action rollout without turning every internal state into natural language?
Can language-state world models pair readable high-level consequences with dense geometric or numeric latent state, calibrated uncertainty, and executable control inputs?
Which data-collection policies make learned world-model state identifiable without collapsing onto policy-biased, non-Gaussian trajectory marginals?
Can next-hidden-state supervision become an action-conditioned belief-state objective when the transition input is an explicit action, control input, intervention, or event rather than the next token?
How should memory-augmented policy models like RATE be paired with learned dynamics so they can evaluate candidate interventions rather than only choose actions from logged trajectories?
Can fixed-FLOPs hierarchy preserve rare events and intervention effects better than uniform token processing at the same serving cost?
Can explicit energy minimization over candidate futures or interventions avoid mode averaging while remaining cheap enough for operational control loops?
Which predictive family can represent calibrated separated modes in action-conditioned latent rollouts without sacrificing real-time planning cost?
What system embodiment descriptor is sufficient for transferring operational world models across different service graphs, telemetry schemas, and deployment stacks?
Can layer-aggregation choices in pretrained encoders change world-model planning quality, not only visual rollout fidelity?
Can CWM-style execution-trace training transfer from deterministic code environments to noisy telemetry systems with delayed, partially observed intervention effects?
Can a BDH-like sparse fast state become useful for action-conditioned rollout once typed actions and control inputs are first-class channels?
Which digital-world abstractions transfer best to operations: DOM-like state, executable code, typed action logs, replayable tests, or explicit error branches?
Which modern uncertainty and evaluation protocols prevent controllers from exploiting a learned simulator instead of learning policies that transfer?
Can closed-loop test-time adaptation improve action-conditioned world models without unsafe model drift, simulator exploitation, or overfitting to the latest transition?
When should a world model infer latent actions from observation-only trajectories, and when should it require typed actions, control inputs, intervention status, timing, and outcomes?
How should dynamics-transition data and reward-annotation data be budgeted when reward labels are expensive, noisy, delayed, or biased?

Alex Open Research Wiki

Explorer

World Models

World Models

Summary

What The Wiki Currently Believes

Probabilistic Belief Boundary

Observability Boundary

Digital World Boundary

Hierarchy And Compute Budget

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

World Models

World Models

Summary

What The Wiki Currently Believes

Probabilistic Belief Boundary

Observability Boundary

Digital World Boundary

Hierarchy And Compute Budget

Evidence

Relation To Foundation TSFM Agenda

Open Questions

Related Pages

Graph View

Table of Contents

Backlinks