Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Source
- Raw Markdown: paper_agentic-world-modeling-2026.md
- PDF: paper_agentic-world-modeling-2026.pdf
- Preprint: https://arxiv.org/abs/2604.22748
- Official code and bibliography: https://github.com/matrix-agent/awesome-agentic-world-modeling
- Gonzo ML discussion: https://t.me/gonzo_ML/5294
- ArxivIQ review: https://arxiviq.substack.com/p/agentic-world-modeling-foundations
Credibility
This is an April 2026 arXiv survey by authors from HKUST, NUS, Oxford, NTU, CUHK, HKU, University of Washington, SUTD, SMU, and related groups. It is less than one year old as of 2026-05-24 and is credible as a broad field map because it synthesizes more than 400 works, summarizes more than 100 representative systems, and ships an official taxonomy-aligned bibliography. It is not peer-reviewed yet and should be treated as a survey and taxonomy source, not as primary empirical evidence for every listed system.
Core Claim
World models for agents should be organized by capability level and governing-law regime rather than by data modality alone. The paper proposes a levels x laws taxonomy: L1 predictors learn local one-step transitions, L2 simulators perform coherent multi-step action-conditioned rollout under domain constraints, and L3 evolvers revise their own models when evidence contradicts predictions. These levels are crossed with physical, digital, social, and scientific law regimes.
Key Contributions
- Defines three capability levels: L1 Predictor, L2 Simulator, and L3 Evolver.
- Defines four governing-law regimes: physical, digital, social, and scientific worlds.
- Uses the taxonomy to organize more than 400 works and more than 100 representative systems.
- Separates decision-usable simulation from visually plausible generation.
- Proposes decision-centric evaluation principles and a Minimal Reproducible Evaluation Package (MREP).
- Treats digital world models as simulators of software-defined environments: code, APIs, web pages, GUI state, permissions, file systems, and game state.
Method Notes
The central interface for L2 world models is:
where z_0 is the current state, a_{1:H} is a candidate action sequence, and c is the constraint set imposed by the governing-law regime. For digital worlds, those constraints are program semantics: DOM and UI state machines, API contracts, type constraints, file-system logic, permission checks, error codes, and network protocols.
flowchart LR State["digital state: DOM / files / GUI / program state"] Action["action: click / API call / command / edit"] Laws["digital laws: API contracts, permissions, state machines"] Model["digital world model"] Next["predicted next state / rollout"] Check["execution or verifier"] State --> Model Action --> Model Laws --> Model Model --> Next --> Check Laws --> Check
For this wiki, the useful transfer is the digital-world data contract rather than the label itself:
structured digital state + typed action + executable constraints
-> predicted next state / rollout / outcomeThat is close to CWM for code and computational environments, but broader because it includes web, GUI, game, desktop, and software-tool environments.
Evidence And Results
- The paper’s abstract reports a synthesis of more than 400 works and more than 100 representative systems across model-based RL, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery.
- The L2 boundary conditions are long-horizon coherence, intervention sensitivity, and constraint consistency.
- In the digital-world section, the paper frames software transitions as mostly deterministic but highly branching through error codes, permission checks, popups, timeouts, API failures, type constraints, and edge cases.
- The representative digital L2 systems include WebDreamer, GameCraft, MobileDreamer, Word2World, Code2World, gWorld, WebWorld, and RWML, but the paper’s table is a compact survey map rather than a matched benchmark.
Digital World Models
The paper’s digital branch should not be read as “video models for screens.” It is about simulators for environments whose state transitions are governed by formal software rules. A good digital world model predicts what happens after an action in a web page, app, codebase, game, shell, API, or GUI, while respecting executable constraints.
Representative systems from the paper’s digital L2 table:
| System | What it models in this survey | Local wiki reading |
|---|---|---|
| WebDreamer | Web state simulation for model-based web-agent planning. | Early web-agent world-model anchor; LLM-based state prediction needs transition-focused training. |
| GameCraft | Interactive game video generation. | Useful for action-conditioned game rollouts, but still partly a visual/video simulator. |
| MobileDreamer | GUI sketch prediction for mobile agents. | Compresses GUI future prediction into task-relevant sketches rather than raw screenshots. |
| Word2World | Text-based world modeling with LLMs. | Tests whether text-only state/action traces can support implicit world models. |
| Code2World | GUI world model via renderable code generation. | Treats code as the predicted state: execute the code to render the next GUI. |
| gWorld | Generative visual code mobile world model. | Similar renderable-code direction for mobile GUI state prediction. |
| WebWorld | Large-scale open-web simulator. | Trains a web simulator for long web-agent rollouts. |
| RWML | Reinforcement world model learning for LLM-based agents. | Couples world-model learning with RL-style agent improvement; needs separate ingest before treating details as settled. |
Limitations
- This is a survey and taxonomy, not a single controlled empirical benchmark.
- Some listed systems are very recent arXiv sources and are not independently ingested in this KB yet.
- The digital-world category covers code, web, GUI, games, and software-tool environments, but it does not automatically close the Foundation TSFM agenda’s numeric telemetry, graph time-series, event-stream, intervention-log, or observability-control requirements.
- Digital systems may be deterministic in principle, but real production systems are often partially observed, asynchronous, multi-tenant, delayed, and affected by hidden external actors.
- The taxonomy can make weak systems look comparable if readers ignore the L1/L2/L3 boundary conditions and the difference between pretty rollout generation and decision-usable simulation.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Causal structure, counterfactuals, and control | adjacent | The L2 simulator definition explicitly requires action-conditioned multi-step rollout and intervention sensitivity. | Survey-level taxonomy; no telemetry-native benchmark with logged operator actions and outcomes. |
| Context interface | adjacent | Digital-world laws make environment contracts explicit: APIs, UI state machines, permissions, file systems, type constraints, and error branches. | Need service graph, telemetry schema, intervention catalog, deployment context, and human-approval semantics for operations. |
| Benchmarks: what level of modeling is tested | warning | Separates L1 one-step prediction from L2 decision-usable rollout and L3 evidence-driven revision. | Need matched benchmark protocols for digital operations, not only web/GUI/code agent tasks. |
| Streaming state, long context, and constant updates | insufficient evidence | The digital branch names persistent software state and long GUI/web workflows. | No always-on streaming latent-state update contract for numeric operational time series. |
Links Into The Wiki
- Digital World Models
- World Models
- Foundation Time-Series Model Research Agenda
- Observability Time Series
- Code World Model
- LLM Agents Need Action-Conditioned World Models
- Terminology
Open Questions
- Which digital-world tasks actually require L2 simulation rather than strong L1 next-state prediction plus replanning?
- What is the right equivalent of DOM or file-system state for observability: service graph state, metric state, trace state, event stream, deployment state, or action log?
- How should digital world models handle hidden concurrent users, background jobs, external APIs, eventual consistency, and delayed effects?
- Can renderable-code world models transfer from GUI prediction to operational simulations where the state is mostly numeric and graph-structured?
- What MREP-style package would prove that a digital-operations world model improves action ranking, safe abstention, and incident recovery rather than only forecast accuracy?