Agentic Automata Learning
Summary
Agentic Automata Learning is a benchmark framework for testing whether a tool-calling LLM agent can infer a hidden deterministic finite automaton (DFA) through interaction. The hidden DFA acts as a small, exact, symbolic world model. The agent must recover it by issuing membership queries and equivalence queries to an oracle.
The entity is useful in this wiki as a controlled world-model-inference benchmark, not as a direct time-series model. It asks whether an agent can choose informative probes, use accumulated evidence, and construct a stable hypothesis about hidden state and transition structure.
Official Artifacts
- Project page: https://reefmenaged.github.io/Agentic_Automata_Learning/
- Code: https://github.com/reefmenaged/Agentic_Automata_Learning
- Data: https://drive.google.com/drive/folders/1ypJSzI5NuwaIC-ApgSrlkpZYDxhq5t_c?usp=sharing
- Live demo: https://agentic-automata-learning.onrender.com/
- Paper: https://arxiv.org/abs/2606.16576
Task Contract
hidden DFA + oracle
-> membership query(word) -> accepted/rejected
-> equivalence query(candidate DFA or regex) -> accepted/counterexample
-> agent updates hypothesis
-> final DFA/language reconstructionThe benchmark controls task difficulty by the number of DFA states and compares LLM agents against classical active automata-learning algorithms such as L* and TTT.
What It Measures
| Capability | Benchmark probe | Failure signal |
|---|---|---|
| Query planning | Does the agent ask informative membership/equivalence queries? | It spends budget on redundant, already-contradicted, or low-information calls. |
| Evidence integration | Does the agent use the full interaction history? | It repeats known evidence or proposes hypotheses inconsistent with prior observations. |
| Hypothesis construction | Does the agent infer the hidden DFA/language? | Passive learners can solve from the collected evidence, but the LLM cannot. |
| Interaction efficiency | How many tool calls are needed? | The agent solves only with much higher query cost than classical learners. |
| Robustness to complexity | Does success survive more DFA states? | Success collapses as the hidden automaton grows. |
Current Source Verdict
Can LLM Agents Infer World Models? reports that reasoning-capable LLM agents can sometimes solve small instances, but all evaluated LLM agents fall far behind classical algorithms in robustness and efficiency. For 8—9 state automata, no LLM exceeds 25% success, while L* and TTT solve all instances. Non-informative queries increase with long interactions, showing that growing context does not automatically become usable state.
Relation To Digital World Models
This benchmark is a small formal cousin of Digital World Models. It shares the useful property that environment rules are mechanically checkable, but it is much simpler than code, web, GUI, or operations worlds:
DFA benchmark:
exact symbolic hidden state + exact oracle feedback
software/digital operations:
partial observations + typed actions + delayed effects + hidden external actorsThe transfer is therefore methodological: build benchmarks that separate information-gathering failures from inference failures, and count non-informative actions explicitly.
Relation To Foundation TSFM Agenda
For the Foundation Time-Series Model Research Agenda, Agentic Automata Learning is a warning and diagnostic benchmark rather than direct progress on numeric time-series modeling. It suggests that an LLM-only agent loop is unlikely to be a sufficient latent-state or action-consequence model for operational systems. A TSFM-native analogue would replace the DFA oracle with multivariate time series, event streams, service graph context, typed interventions, delayed outcomes, and safety rewards.
Open Questions
- Can this benchmark be extended with stochastic, delayed, partial, or noisy feedback without losing the clean failure taxonomy?
- Can a hybrid LLM plus explicit symbolic learner or learned state module close the gap to L*/TTT?
- Which part of the failure is memory representation, query selection, hypothesis search, or tool-use protocol?
- What is the closest numeric time-series analogue: system-identification queries, simulator rollouts, intervention logs, or active experiment design?