Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Source

Raw Markdown: paper_agentic-world-modeling-2026.md
PDF: paper_agentic-world-modeling-2026.pdf
Preprint: https://arxiv.org/abs/2604.22748
Official code and bibliography: https://github.com/matrix-agent/awesome-agentic-world-modeling
Gonzo ML discussion: https://t.me/gonzo_ML/5294
ArxivIQ review: https://arxiviq.substack.com/p/agentic-world-modeling-foundations

Credibility

This is an April 2026 arXiv survey by authors from HKUST, NUS, Oxford, NTU, CUHK, HKU, University of Washington, SUTD, SMU, and related groups. It is less than one year old as of 2026-05-24 and is credible as a broad field map because it synthesizes more than 400 works, summarizes more than 100 representative systems, and ships an official taxonomy-aligned bibliography. It is not peer-reviewed yet and should be treated as a survey and taxonomy source, not as primary empirical evidence for every listed system.

Core Claim

World models for agents should be organized by capability level and governing-law regime rather than by data modality alone. The paper proposes a levels x laws taxonomy: L1 predictors learn local one-step transitions, L2 simulators perform coherent multi-step action-conditioned rollout under domain constraints, and L3 evolvers revise their own models when evidence contradicts predictions. These levels are crossed with physical, digital, social, and scientific law regimes.

Key Contributions

Defines three capability levels: L1 Predictor, L2 Simulator, and L3 Evolver.
Defines four governing-law regimes: physical, digital, social, and scientific worlds.
Uses the taxonomy to organize more than 400 works and more than 100 representative systems.
Separates decision-usable simulation from visually plausible generation.
Proposes decision-centric evaluation principles and a Minimal Reproducible Evaluation Package (MREP).
Treats digital world models as simulators of software-defined environments: code, APIs, web pages, GUI state, permissions, file systems, and game state.

Method Notes

The central interface for L2 world models is:

\overset{p}{^} (τ ∣ z_{0}, a_{1 : H}, c), τ = (z_{1}, \dots, z_{H})

where z_0 is the current state, a_{1:H} is a candidate action sequence, and c is the constraint set imposed by the governing-law regime. For digital worlds, those constraints are program semantics: DOM and UI state machines, API contracts, type constraints, file-system logic, permission checks, error codes, and network protocols.

flowchart LR
  State["digital state: DOM / files / GUI / program state"]
  Action["action: click / API call / command / edit"]
  Laws["digital laws: API contracts, permissions, state machines"]
  Model["digital world model"]
  Next["predicted next state / rollout"]
  Check["execution or verifier"]

  State --> Model
  Action --> Model
  Laws --> Model
  Model --> Next --> Check
  Laws --> Check

For this wiki, the useful transfer is the digital-world data contract rather than the label itself:

structured digital state + typed action + executable constraints
  -> predicted next state / rollout / outcome

That is close to CWM for code and computational environments, but broader because it includes web, GUI, game, desktop, and software-tool environments.

Evidence And Results

The paper’s abstract reports a synthesis of more than 400 works and more than 100 representative systems across model-based RL, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery.
The L2 boundary conditions are long-horizon coherence, intervention sensitivity, and constraint consistency.
In the digital-world section, the paper frames software transitions as mostly deterministic but highly branching through error codes, permission checks, popups, timeouts, API failures, type constraints, and edge cases.
The representative digital L2 systems include WebDreamer, GameCraft, MobileDreamer, Word2World, Code2World, gWorld, WebWorld, and RWML, but the paper’s table is a compact survey map rather than a matched benchmark.

Digital World Models

The paper’s digital branch should not be read as “video models for screens.” It is about simulators for environments whose state transitions are governed by formal software rules. A good digital world model predicts what happens after an action in a web page, app, codebase, game, shell, API, or GUI, while respecting executable constraints.

Representative systems from the paper’s digital L2 table:

System	What it models in this survey	Local wiki reading
WebDreamer	Web state simulation for model-based web-agent planning.	Early web-agent world-model anchor; LLM-based state prediction needs transition-focused training.
GameCraft	Interactive game video generation.	Useful for action-conditioned game rollouts, but still partly a visual/video simulator.
MobileDreamer	GUI sketch prediction for mobile agents.	Compresses GUI future prediction into task-relevant sketches rather than raw screenshots.
Word2World	Text-based world modeling with LLMs.	Tests whether text-only state/action traces can support implicit world models.
Code2World	GUI world model via renderable code generation.	Treats code as the predicted state: execute the code to render the next GUI.
gWorld	Generative visual code mobile world model.	Similar renderable-code direction for mobile GUI state prediction.
WebWorld	Large-scale open-web simulator.	Trains a web simulator for long web-agent rollouts.
RWML	Reinforcement world model learning for LLM-based agents.	Couples world-model learning with RL-style agent improvement; needs separate ingest before treating details as settled.

Limitations

This is a survey and taxonomy, not a single controlled empirical benchmark.
Some listed systems are very recent arXiv sources and are not independently ingested in this KB yet.
The digital-world category covers code, web, GUI, games, and software-tool environments, but it does not automatically close the Foundation TSFM agenda’s numeric telemetry, graph time-series, event-stream, intervention-log, or observability-control requirements.
Digital systems may be deterministic in principle, but real production systems are often partially observed, asynchronous, multi-tenant, delayed, and affected by hidden external actors.
The taxonomy can make weak systems look comparable if readers ignore the L1/L2/L3 boundary conditions and the difference between pretty rollout generation and decision-usable simulation.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Causal structure, counterfactuals, and control	adjacent	The L2 simulator definition explicitly requires action-conditioned multi-step rollout and intervention sensitivity.	Survey-level taxonomy; no telemetry-native benchmark with logged operator actions and outcomes.
Context interface	adjacent	Digital-world laws make environment contracts explicit: APIs, UI state machines, permissions, file systems, type constraints, and error branches.	Need service graph, telemetry schema, intervention catalog, deployment context, and human-approval semantics for operations.
Benchmarks: what level of modeling is tested	warning	Separates L1 one-step prediction from L2 decision-usable rollout and L3 evidence-driven revision.	Need matched benchmark protocols for digital operations, not only web/GUI/code agent tasks.
Streaming state, long context, and constant updates	insufficient evidence	The digital branch names persistent software state and long GUI/web workflows.	No always-on streaming latent-state update contract for numeric operational time series.

Links Into The Wiki

Digital World Models
World Models
Foundation Time-Series Model Research Agenda
Observability Time Series
Awesome Agentic Time Series for the time-series-specific survey/list counterpart covering closed-loop temporal agents, benchmark fragmentation, memory, temporal world models, and reliability.
Code World Model
LLM Agents Need Action-Conditioned World Models
Terminology

Open Questions

Which digital-world tasks actually require L2 simulation rather than strong L1 next-state prediction plus replanning?
What is the right equivalent of DOM or file-system state for observability: service graph state, metric state, trace state, event stream, deployment state, or action log?
How should digital world models handle hidden concurrent users, background jobs, external APIs, eventual consistency, and delayed effects?
Can renderable-code world models transfer from GUI prediction to operational simulations where the state is mostly numeric and graph-structured?
What MREP-style package would prove that a digital-operations world model improves action ranking, safe abstention, and incident recovery rather than only forecast accuracy?

Alex Open Research Wiki

Explorer

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Source

Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Digital World Models

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks