Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning
Source
- Raw Markdown: paper_agentic-automata-learning-2026.md
- PDF: paper_agentic-automata-learning-2026.pdf
- Preprint: https://arxiv.org/abs/2606.16576
- Official project page: https://reefmenaged.github.io/Agentic_Automata_Learning/
- Official code: https://github.com/reefmenaged/Agentic_Automata_Learning
- Official data: https://drive.google.com/drive/folders/1ypJSzI5NuwaIC-ApgSrlkpZYDxhq5t_c?usp=sharing
- Live demo: https://agentic-automata-learning.onrender.com/
- Official X announcement: https://x.com/ReefMenaged/status/2066869903420698945
- Alex-provided X pointer: https://x.com/i/status/2068921421380874666
- Local social/artifact snapshots:
papers/agentic-automata-learning-2026/x_post_rohanpaul_ai_2068921421380874666.json,papers/agentic-automata-learning-2026/x_post_reefmenaged_2066869903420698945.json,papers/agentic-automata-learning-2026/github_repo_reefmenaged_agentic_automata_learning.json,papers/agentic-automata-learning-2026/social_and_project_notes.md
Credibility
This is a June 15, 2026 arXiv v1 preprint, marked “Preprint. Under review.” in the paper PDF. It is less than one year old as of 2026-06-23. The author affiliations are Hebrew University of Jerusalem, New York University, and Google Research, and the project ships a public code repository, project page, data link, and live demo. Treat it as a credible current preprint and controlled benchmark proposal, but not yet as peer-reviewed evidence.
Core Claim
The paper proposes agentic automata learning as a controlled way to test whether tool-calling LLM agents can infer a hidden world model through interaction. The hidden environment is a deterministic finite automaton (DFA). The agent interacts with an oracle through membership queries and equivalence queries, then must reconstruct the target language.
The main result is negative but useful: current LLM agents can sometimes perform non-trivial interactive discovery, especially when they have reasoning mechanisms, but they remain far less robust and query-efficient than classical active automata-learning algorithms such as L* and TTT.
Benchmark Setup
The interaction loop is deliberately simple and exact:
flowchart LR Hidden["hidden DFA / latent environment"] Agent["tool-calling LLM agent"] MQ["membership query: is word w accepted?"] EQ["equivalence query: is hypothesis DFA correct?"] Obs["oracle answer / counterexample"] Hyp["final hypothesis DFA"] Hidden --> Obs Agent --> MQ --> Obs --> Agent Agent --> EQ --> Obs --> Agent Agent --> Hyp
The task isolates three capabilities that matter for agentic world-model inference:
- query planning: choosing interactions that reveal new information about the hidden DFA;
- evidence integration: using accumulated observations instead of issuing redundant or contradicted queries;
- hypothesis construction: turning evidence into a stable candidate model of the environment.
This is not a numeric time-series benchmark. It is a symbolic, deterministic, oracle-based environment. Its value for this wiki is as a clean failure probe for interactive latent-structure discovery.
Evidence And Results
Key reported results from the paper:
- LLM-agent performance drops sharply as DFA size increases.
- For 8—9 state automata, no evaluated LLM model exceeds 25% success, while L* and TTT solve 100% of instances.
- On successful 8—9 state runs, Gemini 3.1 Pro, the strongest model in the paper, uses about 45.8% more tool calls than TTT.
- Reasoning models are much stronger than non-reasoning models. In the 4—5 state range, Gemini 3.1 Pro reaches 85% success and Gemini 3 Flash with thinking reaches 15%, while GPT-5.4 without reasoning, Gemini 3.1 Flash Lite, and Llama-3.3-70B reach 0% for automata with 4 or more states.
- The strongest model does not simply imitate L* or TTT: the paper reports no exact query-sequence matches, frequent overuse of equivalence queries, and non-monotonic hypothesis sequences.
- Non-informative queries increase over long interactions. After roughly 60 steps, even DeepSeek-V4-Pro issues non-informative queries about 20% of the time; classical algorithms maintain 0% by construction.
Comparison Scope
This paper should not be cited as evidence that neural or learned action-conditioned world models outperform LLM agents. It does not compare against Dreamer-style, JEPA-style, CWM/Genie-like, or telemetry-native learned dynamics models.
The direct baselines are classical symbolic active automata-learning algorithms, especially L* and TTT. Passive automata learners are used diagnostically to decide whether the LLM collected enough labeled evidence for a non-interactive learner to recover the hidden DFA.
The precise result is therefore narrower and sharper: on a clean hidden-DFA recovery task, LLM-only tool-calling agents are much less robust and query-efficient than explicit model-inference algorithms designed for the task. If L* and TTT are treated as symbolic world-model learners, then the paper supports explicit model-learning over prompt-loop inference. It does not prove that learned neural world models already beat LLM agents.
For this wiki, the fair architectural lesson is to compare at least four systems when designing digital-operations benchmarks:
| System | What the paper directly supports |
|---|---|
| LLM-only agent | Weak as a reliable world-model inference mechanism on this formal task. |
| Classical/symbolic model learner | Strong baseline when the environment has an exact symbolic structure. |
| Learned action-conditioned world model | Not evaluated here; remains the hypothesis for telemetry and digital operations. |
| Hybrid LLM + explicit model layer | Suggested by the failure mode, but not directly tested by this paper. |
Failure Analysis
The paper separates failures into two useful categories:
| Failure type | Operational meaning | Why it matters for this wiki |
|---|---|---|
| Planning failure | The agent did not collect enough evidence; passive automata learners cannot infer the DFA from the collected observations. | A tool-using LLM may interact a lot while still failing to ask the right state-revealing questions. |
| Reasoning failure | The collected evidence is sufficient for at least one passive learner, but the LLM fails to infer the correct DFA. | Long context and memory are not enough if the model cannot convert evidence into a stable latent state or transition model. |
This distinction is directly reusable for digital-world and operations benchmarks: separate failures caused by insufficient information gathering from failures caused by poor inference over already available state/action/observation history.
Social And Artifact Context
Alex provided a Rohan Paul X post summarizing the paper as evidence that LLM agents struggle to turn accumulating interaction evidence into a reliable internal world model. The official author announcement by Reef Menaged links the project page, live demo, and paper. Both X snapshots were preserved under papers/agentic-automata-learning-2026/ for provenance.
The official project page and code repository make the benchmark more operational than an abstract proposal: the live demo lets a user choose a target automaton and watch the agentic automata learning loop, and the repository exposes the experiment runner and model/provider interfaces.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Benchmarks: what level of modeling is tested? | warning | The benchmark cleanly separates task success from query planning, evidence integration, hypothesis construction, and non-informative tool use. | Not a time-series benchmark; no numeric telemetry, graph time series, irregular event streams, or operational action logs. |
| Streaming state, long context, and constant updates | adjacent | Agents receive growing interaction histories, and failures show that raw long context does not guarantee stable model inference. | No always-on numeric state update, memory-compression interface, or bounded-latency serving contract. |
| Causal structure, counterfactuals, and control | adjacent | Membership/equivalence queries are controlled interactions with exact oracle feedback and counterexamples. | Query actions are not real interventions whose consequences change a dynamical system; the hidden DFA is deterministic and exactly checkable. |
| Digital-world robot north star | warning | The source is a clean negative result for the idea that a general LLM agent can reliably infer hidden environment rules from interaction alone. | Need digital operations tasks with telemetry, logs, traces, service graph, typed interventions, delayed effects, and safety outcomes. |
Local Interpretation
For Alex’s research agenda, the paper strengthens the case for hybrid agents rather than LLM-only agents. If frontier tool-calling LLMs struggle to reconstruct a small hidden DFA from exact feedback, then an SRE or digital-operations agent should not rely on the LLM prompt loop alone to maintain a reliable dynamics model of production.
The useful architecture lesson is:
LLM agent:
propose questions, inspect evidence, explain hypotheses
separate learned or symbolic world-model layer:
maintain state, track constraints, score candidate futures, detect contradictions
benchmark harness:
measure query quality, evidence reuse, non-informative actions, and final model accuracyThis does not mean DFAs are the target domain. It means the benchmark provides a clean diagnostic for a failure mode we should expect to be worse in real systems: the agent collects observations, but its internal state does not compound into a reliable model.
Links Into The Wiki
- Agentic Automata Learning
- World Models
- Digital World Models
- Foundation Time-Series Model Research Agenda
- LLM Agents Need Action-Conditioned World Models
- Time-Series Benchmark Hygiene
- Terminology
Open Questions
- Can the same planning-vs-reasoning failure split be used in an observability or Grid2Op-style benchmark with numeric trajectories and typed actions?
- What memory or state interface would reduce non-informative queries without merely hard-coding L* or TTT?
- Should digital-world agents expose a separate contradiction checker that rejects hypotheses already contradicted by observations?
- Can a hybrid LLM plus symbolic or learned world-model layer match classical automata learners on this benchmark while remaining useful beyond DFAs?
- How quickly do non-informative actions grow when the hidden environment is stochastic, delayed, partially observed, or action-conditioned in the stronger control sense?