Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

Source

Raw Markdown: paper_agentic-automata-learning-2026.md
PDF: paper_agentic-automata-learning-2026.pdf
Preprint: https://arxiv.org/abs/2606.16576
Official project page: https://reefmenaged.github.io/Agentic_Automata_Learning/
Official code: https://github.com/reefmenaged/Agentic_Automata_Learning
Official data: https://drive.google.com/drive/folders/1ypJSzI5NuwaIC-ApgSrlkpZYDxhq5t_c?usp=sharing
Live demo: https://agentic-automata-learning.onrender.com/
Official X announcement: https://x.com/ReefMenaged/status/2066869903420698945
Alex-provided X pointer: https://x.com/i/status/2068921421380874666
Local social/artifact snapshots: papers/agentic-automata-learning-2026/x_post_rohanpaul_ai_2068921421380874666.json, papers/agentic-automata-learning-2026/x_post_reefmenaged_2066869903420698945.json, papers/agentic-automata-learning-2026/github_repo_reefmenaged_agentic_automata_learning.json, papers/agentic-automata-learning-2026/social_and_project_notes.md

Credibility

This is a June 15, 2026 arXiv v1 preprint, marked “Preprint. Under review.” in the paper PDF. It is less than one year old as of 2026-06-23. The author affiliations are Hebrew University of Jerusalem, New York University, and Google Research, and the project ships a public code repository, project page, data link, and live demo. Treat it as a credible current preprint and controlled benchmark proposal, but not yet as peer-reviewed evidence.

Core Claim

The paper proposes agentic automata learning as a controlled way to test whether tool-calling LLM agents can infer a hidden world model through interaction. The hidden environment is a deterministic finite automaton (DFA). The agent interacts with an oracle through membership queries and equivalence queries, then must reconstruct the target language.

The main result is negative but useful: current LLM agents can sometimes perform non-trivial interactive discovery, especially when they have reasoning mechanisms, but they remain far less robust and query-efficient than classical active automata-learning algorithms such as L* and TTT.

Benchmark Setup

The interaction loop is deliberately simple and exact:

flowchart LR
  Hidden["hidden DFA / latent environment"]
  Agent["tool-calling LLM agent"]
  MQ["membership query: is word w accepted?"]
  EQ["equivalence query: is hypothesis DFA correct?"]
  Obs["oracle answer / counterexample"]
  Hyp["final hypothesis DFA"]

  Hidden --> Obs
  Agent --> MQ --> Obs --> Agent
  Agent --> EQ --> Obs --> Agent
  Agent --> Hyp

The task isolates three capabilities that matter for agentic world-model inference:

query planning: choosing interactions that reveal new information about the hidden DFA;
evidence integration: using accumulated observations instead of issuing redundant or contradicted queries;
hypothesis construction: turning evidence into a stable candidate model of the environment.

This is not a numeric time-series benchmark. It is a symbolic, deterministic, oracle-based environment. Its value for this wiki is as a clean failure probe for interactive latent-structure discovery.

Evidence And Results

Key reported results from the paper:

LLM-agent performance drops sharply as DFA size increases.
For 8—9 state automata, no evaluated LLM model exceeds 25% success, while L* and TTT solve 100% of instances.
On successful 8—9 state runs, Gemini 3.1 Pro, the strongest model in the paper, uses about 45.8% more tool calls than TTT.
Reasoning models are much stronger than non-reasoning models. In the 4—5 state range, Gemini 3.1 Pro reaches 85% success and Gemini 3 Flash with thinking reaches 15%, while GPT-5.4 without reasoning, Gemini 3.1 Flash Lite, and Llama-3.3-70B reach 0% for automata with 4 or more states.
The strongest model does not simply imitate L* or TTT: the paper reports no exact query-sequence matches, frequent overuse of equivalence queries, and non-monotonic hypothesis sequences.
Non-informative queries increase over long interactions. After roughly 60 steps, even DeepSeek-V4-Pro issues non-informative queries about 20% of the time; classical algorithms maintain 0% by construction.

Comparison Scope

This paper should not be cited as evidence that neural or learned action-conditioned world models outperform LLM agents. It does not compare against Dreamer-style, JEPA-style, CWM/Genie-like, or telemetry-native learned dynamics models.

The direct baselines are classical symbolic active automata-learning algorithms, especially L* and TTT. Passive automata learners are used diagnostically to decide whether the LLM collected enough labeled evidence for a non-interactive learner to recover the hidden DFA.

The precise result is therefore narrower and sharper: on a clean hidden-DFA recovery task, LLM-only tool-calling agents are much less robust and query-efficient than explicit model-inference algorithms designed for the task. If L* and TTT are treated as symbolic world-model learners, then the paper supports explicit model-learning over prompt-loop inference. It does not prove that learned neural world models already beat LLM agents.

For this wiki, the fair architectural lesson is to compare at least four systems when designing digital-operations benchmarks:

System	What the paper directly supports
LLM-only agent	Weak as a reliable world-model inference mechanism on this formal task.
Classical/symbolic model learner	Strong baseline when the environment has an exact symbolic structure.
Learned action-conditioned world model	Not evaluated here; remains the hypothesis for telemetry and digital operations.
Hybrid LLM + explicit model layer	Suggested by the failure mode, but not directly tested by this paper.

Failure Analysis

The paper separates failures into two useful categories:

Failure type	Operational meaning	Why it matters for this wiki
Planning failure	The agent did not collect enough evidence; passive automata learners cannot infer the DFA from the collected observations.	A tool-using LLM may interact a lot while still failing to ask the right state-revealing questions.
Reasoning failure	The collected evidence is sufficient for at least one passive learner, but the LLM fails to infer the correct DFA.	Long context and memory are not enough if the model cannot convert evidence into a stable latent state or transition model.

This distinction is directly reusable for digital-world and operations benchmarks: separate failures caused by insufficient information gathering from failures caused by poor inference over already available state/action/observation history.

Alex provided a Rohan Paul X post summarizing the paper as evidence that LLM agents struggle to turn accumulating interaction evidence into a reliable internal world model. The official author announcement by Reef Menaged links the project page, live demo, and paper. Both X snapshots were preserved under papers/agentic-automata-learning-2026/ for provenance.

The official project page and code repository make the benchmark more operational than an abstract proposal: the live demo lets a user choose a target automaton and watch the agentic automata learning loop, and the repository exposes the experiment runner and model/provider interfaces.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Benchmarks: what level of modeling is tested?	warning	The benchmark cleanly separates task success from query planning, evidence integration, hypothesis construction, and non-informative tool use.	Not a time-series benchmark; no numeric telemetry, graph time series, irregular event streams, or operational action logs.
Streaming state, long context, and constant updates	adjacent	Agents receive growing interaction histories, and failures show that raw long context does not guarantee stable model inference.	No always-on numeric state update, memory-compression interface, or bounded-latency serving contract.
Causal structure, counterfactuals, and control	adjacent	Membership/equivalence queries are controlled interactions with exact oracle feedback and counterexamples.	Query actions are not real interventions whose consequences change a dynamical system; the hidden DFA is deterministic and exactly checkable.
Digital-world robot north star	warning	The source is a clean negative result for the idea that a general LLM agent can reliably infer hidden environment rules from interaction alone.	Need digital operations tasks with telemetry, logs, traces, service graph, typed interventions, delayed effects, and safety outcomes.

Local Interpretation

For Alex’s research agenda, the paper strengthens the case for hybrid agents rather than LLM-only agents. If frontier tool-calling LLMs struggle to reconstruct a small hidden DFA from exact feedback, then an SRE or digital-operations agent should not rely on the LLM prompt loop alone to maintain a reliable dynamics model of production.

The useful architecture lesson is:

LLM agent:
  propose questions, inspect evidence, explain hypotheses
 
separate learned or symbolic world-model layer:
  maintain state, track constraints, score candidate futures, detect contradictions
 
benchmark harness:
  measure query quality, evidence reuse, non-informative actions, and final model accuracy

This does not mean DFAs are the target domain. It means the benchmark provides a clean diagnostic for a failure mode we should expect to be worse in real systems: the agent collects observations, but its internal state does not compound into a reliable model.

Links Into The Wiki

Open Questions

Can the same planning-vs-reasoning failure split be used in an observability or Grid2Op-style benchmark with numeric trajectories and typed actions?
What memory or state interface would reduce non-informative queries without merely hard-coding L* or TTT?
Should digital-world agents expose a separate contradiction checker that rejects hypotheses already contradicted by observations?
Can a hybrid LLM plus symbolic or learned world-model layer match classical automata learners on this benchmark while remaining useful beyond DFAs?
How quickly do non-informative actions grow when the hidden environment is stochastic, delayed, partially observed, or action-conditioned in the stronger control sense?

Alex Open Research Wiki

Explorer

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

Source

Credibility

Core Claim

Benchmark Setup

Evidence And Results

Comparison Scope

Failure Analysis

Foundation TSFM Relevance

Local Interpretation

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks

Alex Open Research Wiki

Explorer

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

Source

Credibility

Core Claim

Benchmark Setup

Evidence And Results

Comparison Scope

Failure Analysis

Social And Artifact Context

Foundation TSFM Relevance

Local Interpretation

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks