Agentic Automata Learning

Summary

Agentic Automata Learning is a benchmark framework for testing whether a tool-calling LLM agent can infer a hidden deterministic finite automaton (DFA) through interaction. The hidden DFA acts as a small, exact, symbolic world model. The agent must recover it by issuing membership queries and equivalence queries to an oracle.

The entity is useful in this wiki as a controlled world-model-inference benchmark, not as a direct time-series model. It asks whether an agent can choose informative probes, use accumulated evidence, and construct a stable hypothesis about hidden state and transition structure.

Official Artifacts

Task Contract

hidden DFA + oracle
  -> membership query(word) -> accepted/rejected
  -> equivalence query(candidate DFA or regex) -> accepted/counterexample
  -> agent updates hypothesis
  -> final DFA/language reconstruction

The benchmark controls task difficulty by the number of DFA states and compares LLM agents against classical active automata-learning algorithms such as L* and TTT.

What It Measures

CapabilityBenchmark probeFailure signal
Query planningDoes the agent ask informative membership/equivalence queries?It spends budget on redundant, already-contradicted, or low-information calls.
Evidence integrationDoes the agent use the full interaction history?It repeats known evidence or proposes hypotheses inconsistent with prior observations.
Hypothesis constructionDoes the agent infer the hidden DFA/language?Passive learners can solve from the collected evidence, but the LLM cannot.
Interaction efficiencyHow many tool calls are needed?The agent solves only with much higher query cost than classical learners.
Robustness to complexityDoes success survive more DFA states?Success collapses as the hidden automaton grows.

Current Source Verdict

Can LLM Agents Infer World Models? reports that reasoning-capable LLM agents can sometimes solve small instances, but all evaluated LLM agents fall far behind classical algorithms in robustness and efficiency. For 8—9 state automata, no LLM exceeds 25% success, while L* and TTT solve all instances. Non-informative queries increase with long interactions, showing that growing context does not automatically become usable state.

Relation To Digital World Models

This benchmark is a small formal cousin of Digital World Models. It shares the useful property that environment rules are mechanically checkable, but it is much simpler than code, web, GUI, or operations worlds:

DFA benchmark:
  exact symbolic hidden state + exact oracle feedback
 
software/digital operations:
  partial observations + typed actions + delayed effects + hidden external actors

The transfer is therefore methodological: build benchmarks that separate information-gathering failures from inference failures, and count non-informative actions explicitly.

Relation To Foundation TSFM Agenda

For the Foundation Time-Series Model Research Agenda, Agentic Automata Learning is a warning and diagnostic benchmark rather than direct progress on numeric time-series modeling. It suggests that an LLM-only agent loop is unlikely to be a sufficient latent-state or action-consequence model for operational systems. A TSFM-native analogue would replace the DFA oracle with multivariate time series, event streams, service graph context, typed interventions, delayed outcomes, and safety rewards.

Open Questions

  • Can this benchmark be extended with stochastic, delayed, partial, or noisy feedback without losing the clean failure taxonomy?
  • Can a hybrid LLM plus explicit symbolic learner or learned state module close the gap to L*/TTT?
  • Which part of the failure is memory representation, query selection, hypothesis search, or tool-use protocol?
  • What is the closest numeric time-series analogue: system-identification queries, simulator rollouts, intervention logs, or active experiment design?

Source Pages