---
abstract: |
  We propose *agentic automata learning* to evaluate the extent to which tool-calling LLM agents can uncover hidden environments through interaction. In our setup, an agent should uncover a hidden deterministic finite automaton (DFA) by interacting with an oracle through (1) membership queries (\`\`Does this string belong to the target language?") and (2) equivalence queries (\`\`Is this the target DFA?"). This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines (classic automata-learning algorithms). Evaluating state-of-the-art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than non-reasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, our results show that current LLM agents can sometimes perform non-trivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.[^1]
author:
- |
  Reef Menaged^1^ `\quad`{=latex} Gili Lior^1^ `\quad`{=latex} Shauli Ravfogel^2^ `\quad`{=latex} Roee Aharoni^3^ `\quad`{=latex} Gabriel Stanovsky^1^\
  ^1^The Hebrew University of Jerusalem `\quad`{=latex} ^2^New York University `\quad`{=latex} ^3^Google Research\
  `{reef.menaged, gili.lior, gabriel.stanovsky}@mail.huji.ac.il`\
  `shauli.ravfogel@gmail.com` `\quad`{=latex} `roeeaharoni@google.com`
bibliography:
- custom.bib
title: |
  Can LLM Agents Infer World Models?\
  Evidence from Agentic Automata Learning
---

# Introduction

LLMs are increasingly used as interactive agents across environments of varying complexity, from settings like games with fixed rules to dynamic environments such as navigation in open worlds [@park2026orak; @zheng2025lifelongagentbench]. Despite their strong performance, it remains unclear whether LLMs can infer a correct world model of the environment they interact with, or whether they rely on local patterns and shallow heuristics [@vafa2024worldmodel]. Addressing this question requires controlled tasks in which the hidden structure of the environment is known, its complexity can be controlled systematically, and the agent's discovery process can be observed throughout the interaction.

![**Agentic automata learning.** An agent is tasked to identify a hidden DFA through successive interactions with the environment. *Membership Queries* ask whether a word is accepted by the hidden DFA (left), while *Equivalence Queries* submit a hypothesized DFA and ask whether it matches the hidden target. The environment either confirms equivalence or returns a counterexample on which the two DFAs differ (right). `\label{fig:framework_diagram}`{=latex}](figures/diagram4.png){width="\\columnwidth"}

*Active automata learning*, exemplified in Figure `\ref{fig:framework_diagram}`{=latex}, offers a well-studied formal framework for learning hidden structure via interaction [@rivest1993inference; @isberner2015foundations]. In this classic formulation, a *learner* attempts to identify a hidden deterministic finite automaton (DFA) through a sequence of queries to an oracle representing the environment. *Membership Queries* ask whether a word belongs to the target language, while *Equivalence Queries* submit a hypothesis DFA, which the oracle either accepts or refutes with a counterexample, i.e., a word on which the hypothesis DFA and the hidden DFA disagree.

While the task provides a simple and controlled framework for environment modeling, it is also useful for modeling real-world tasks. For instance, @rivest1993inference framed it as an autonomous robot learning an unfamiliar deterministic environment through actions and observations, and more recently @vafa2024worldmodel examined whether LLMs can learn the world model of a deterministic environment from observations by casting a map of Manhattan as a DFA, finding that they struggle to do so.

In this work, we recast active automata learning as *agentic automata learning*. To do so, we initialize an oracle with a target DFA and equip an LLM agent with tools to perform membership and equivalence queries against the oracle. This formulation turns automata learning into a controlled task for testing whether LLM agents can infer latent structure through interaction. It provides formal measures for task complexity (e.g., the number of states in the DFA) as well as for discovery efficiency (e.g., the number of queries), and enables direct comparison with strong baselines, including the $L^*$ and TTT algorithms [@angluin1987learning; @angluin1988queries; @isberner2014ttt].

Our results reveal that the task is challenging even for state-of-the-art LLMs and that substantial performance gaps exist across models. We find that the task becomes increasingly challenging as the environment complexity grows, as reflected in a sharp decline in success rates as the number of states in the DFA increases: for automata with 9 states, no model exceeds 25% success, while classic algorithms solve 100% of task instances. Discovery efficiency is also challenging: even when considering only successful runs, Gemini 3.1 Pro, the best-performing model, requires approximately 45.8% more queries than the classic TTT algorithm. In addition, a clear performance gap emerges across models, particularly between reasoning and non-reasoning models: for example, on automata with 4--5 states, Gemini 3.1 Pro reaches 85% success, while GPT-5.4 without reasoning, Gemini 3.1 Flash Lite, and Llama-3.3-70B achieve 0% success.

Beyond task performance, our analyses show that all models have substantial room for improvement in planning, reasoning, and proper use of accumulated information throughout the interaction. Models often fail to plan informative queries, reason inconsistently over accumulated evidence, or fail to reliably use prior observations when constructing hypotheses. These findings suggest that while advanced LLMs can perform non-trivial interactive discovery, they remain far from classic learners in both robustness and efficiency.

To summarize, our main contributions are as follows: (1) We introduce *agentic automata learning*, a controlled framework for evaluating whether tool-calling LLM agents can infer latent structure through interaction; (2) show that advanced reasoning models can perform non-trivial interactive discovery, but degrade sharply as automata complexity increases; and (3) analyze interaction trajectories and identify recurring limitations in planning, reasoning, and consistent use of accumulated information.

# Background: Automata Learning

Automata learning tests the extent to which learning algorithms are capable of discovering the structure of a hidden DFA from observations of which words it accepts or rejects. Automata learning is commonly divided into two paradigms: *active automata learning*, where the learner can interact with an oracle through queries, and *passive automate learning*, where the learner infers a DFA from a fixed set of labeled examples [@delahiguera2010grammatical].

## Active Automata Learning

In this setting, a learner interacts with an oracle through adaptively chosen queries in order to identify a hidden DFA [@isberner2015foundations]. The oracle has access to the hidden DFA and answers queries about the language it recognizes. Traditionally, there are two types of queries [@angluin1988queries; @angluin1990negative]. A Membership Query allows the learner to ask whether a given word is accepted by the hidden DFA, while an Equivalence Query allows the learner to propose a hypothesis DFA and test whether it is equivalent to the hidden DFA, which the oracle can either accept or refute with a counterexample, namely a word on which the hypothesis DFA and the hidden DFA disagree.

Prior work has established formal complexity and evaluation measures, as well as efficient deterministic active automata learning algorithms [@angluin1987learning; @rivest1993inference]. Task complexity is typically characterized by the number of states in the minimal DFA accepting the language of the hidden DFA, while learning efficiency is measured by the number of queries required to identify the hidden DFA [@isberner2015foundations].

Classic learners such as L\* and TTT solve this task through an iterative hypothesis-refinement process: they consider an hypothesis which matches all given observations using EQ, and when a counterexample is returned, use a polynomial number of MQs to construct a new hypothesis that again accounts for all given information while adding at least one previously unidentified state in the hidden DFA. These algorithms were proven to converge on the hidden DFA in polynominal time [@angluin1987learning; @isberner2014ttt]. This exemplifies several desired agentic real-world traits, including maintaining a memory of observed behavior, reasoning over it, choosing actions based on previous interactions, and adapting to new information.

## Passive Automata Learning

In passive learning, the learner does not query an oracle. Instead, it receives a fixed labeled sample of accepted and rejected words and must infer a DFA consistent with that sample [@murphy1996passively]. Classical passive algorithms such as RPNI, EDSM, and Blue-Fringe learn automata from finite samples in this way [@oncina1992rpni; @lang1998edsm].

We use passive learning algorithms to assess the information gathered by LLM agents during interaction: given the labeled words collected by an agent, we test whether a passive learner can already infer the hidden DFA.

# Agentic Automata Learning

In this section we formally define the task of *agentic automata learning*, which recasts the classic interactive automata learning for agentic tool-calling LLMs (§`\ref{sec:task-def}`{=latex}). We then describe how task instances can be generated automatically at arbitrary scale by sampling DFAs while controlling for their complexity, and discuss how the task requires important agentic capabilities, such as adaptive planning, memory organization, and efficient tool use.

#### Task definition. {#sec:task-def}

Let $\Sigma$ be a finite alphabet, $\mathcal{D}_\Sigma$ be the set of all DFAs over $\Sigma$, and $A^\star \in \mathcal{D}_\Sigma$ be a minimal DFA inducing the target language $L^\star \subseteq \Sigma^*$. An agent interacts with an oracle $\mathcal{O}_{A^\star}$ through a sequence of tool calls, each termed a query.

At each time step $t = 1,\ldots,T$, the agent chooses one of two tool calls: *(1) Membership Query*, $\mathrm{MQ}:\Sigma^* \mapsto \{0,1\}$ for some word $w \in \Sigma^*$, to which the oracle returns $\mathbbm{1}_{L^\star}\{w\}$. *(2) Equivalence Query*, $\mathrm{EQ}:\mathcal{D}_\Sigma  \mapsto \{\top\} \cup \Sigma^*$ for some hypothesis DFA $A$, to which the oracle returns either $\top$ if $L(A) = L^\star$, or a counterexample $w \in L(A) \triangle L^\star$ if they differ.

The agent's prompt at time $t$ contains the entire interaction history: $$h_{t-1} = ((a_1,o_1),\ldots,(a_{t-1},o_{t-1})),$$ where $a_i$ is the query issued at step $i$ and $o_i$ is the oracle response. See complete prompt examples in Appendix `\ref{sec:interaction_example}`{=latex}.

The goal of the agent is to identify the target language by issuing an equivalence query $\mathrm{EQ}(\hat{A})$ such that: $$L(\hat{A}) = L^\star.$$ An interaction is successful if the oracle returns $\top$ for an equivalence query within a query budget $T$, while its cost is the number of queries issued up to and including the accepted equivalence query.

The complexity of an interaction is defined by the number of states in $A^\star$. This is motivated by the classic active automata learning problem, where the number of states is a central factor affecting the efficiency of algorithms such as L\* and TTT.

```{=latex}
\definecolor{figtaborange}{HTML}{FF7F0E}
```
```{=latex}
\definecolor{figtabblue}{HTML}{1F77B4}
```
```{=latex}
\definecolor{figtabgreen}{HTML}{2CA02C}
```
```{=latex}
\definecolor{figtabred}{HTML}{D62728}
```
```{=latex}
\definecolor{figtabpink}{HTML}{E377C2}
```
```{=latex}
\definecolor{figgold}{HTML}{FFD700}
```
```{=latex}
\begin{figure*}[t]

\resizebox{0.98\textwidth}{!}{%


\makebox[\textwidth][c]{%
%


\begin{tabular*}{\linewidth}{@{\extracolsep{\fill}}lll@{}}
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtaborange] (0,0) rectangle (0.28,0.13);}~\small Gemini 3.1 Pro Preview &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figgold] (0,0) rectangle (0.28,0.13);}~\small Deepseek-v4-Pro &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabblue] (0,0) rectangle (0.28,0.13);}~\small Gemini-3-Flash-Preview (thinking) \\

\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabred] (0,0) rectangle (0.28,0.13);}~\small GPT-5.4 (without thinking) &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabgreen] (0,0) rectangle (0.28,0.13);}~\small Gemini-3.1-Flash-Lite-Preview &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabpink] (0,0) rectangle (0.28,0.13);}~\small Llama-3.3-70B-Instruct-Turbo
\end{tabular*}
%
}


\begin{subfigure}[t]{0.32\textwidth}

\begin{tikzpicture}
\begin{axis}[
    width=\linewidth,
    height=5.8cm,
    ymin=0,
    ymax=60,
    ylabel={Total interaction cost (\$)},
    xlabel={Minimal DFA States},
    xtick={2,3,4,5,6,7,8,9},
    xmin=1.5,
    xmax=9.5,
    grid=major,
    grid style={gray!18},
    tick label style={font=\scriptsize},
    label style={font=\small},
]

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtaborange}] coordinates {
(2,0.0987) (2,1.0420) (2,0.4004) (2,0.2205) (2,0.2308) (2,0.0387) (2,0.5567) (2,0.2944) (2,0.3714) (2,0.4468) (3,0.2302) (3,0.2948) (3,0.3624) (3,0.7381) (3,1.4689) (3,0.2325) (3,0.6364) (3,0.6961) (3,0.3552) (3,0.7056) (4,1.2674) (4,3.5409) (4,0.7394) (4,1.3252) (4,1.0312) (4,0.5077) (4,1.9867) (4,1.0354) (4,1.0756) (4,1.7918) (5,3.3838) (5,4.7508) (5,11.2485) (5,19.4667) (5,11.8814) (5,0.9815) (5,4.6628) (5,2.0529) (5,1.8912) (5,14.1842) (6,3.2626) (6,20.0026) (6,3.9177) (6,7.1961) (6,2.7804) (6,7.9497) (6,8.3443) (6,2.3071) (6,13.5807) (6,2.3473) (7,25.4110) (7,7.8091) (7,27.3775) (7,5.6403) (7,20.6961) (7,8.4720) (7,20.6803) (7,1.2559) (7,6.8171) (7,18.3662) (8,26.3707) (8,33.4520) (8,28.3468) (8,33.8778) (8,41.0308) (8,14.4198) (8,4.7134) (8,26.9342) (8,30.2036) (8,7.2219) (9,50.1629) (9,32.9887) (9,38.8142) (9,30.7722) (9,30.3047) (9,30.9871) (9,24.2295) (9,34.6801) (9,19.6310) (9,36.4843)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figgold}] coordinates {
(2,0.0041) (2,0.0234) (2,0.0178) (2,0.0045) (2,0.0045) (2,0.0068) (2,0.0114) (2,0.0092) (2,0.0069) (2,0.0192) (3,0.0287) (3,0.0637) (3,0.0540) (3,0.0512) (3,0.1378) (3,0.0616) (3,0.0431) (3,0.2069) (3,0.0594) (3,0.0941) (4,0.3325) (4,0.2698) (4,0.4178) (4,0.2871) (4,0.3946) (4,0.2279) (4,0.6513) (4,0.1897) (4,0.4463) (4,0.2154) (5,0.7780) (5,0.5145) (5,0.9388) (5,0.8154) (5,0.7132) (5,0.4125) (5,1.0993) (5,0.2808) (5,0.5563) (5,0.8819) (6,1.0739) (6,1.1145) (6,0.9337) (6,0.9332) (6,0.4013) (6,0.9905) (6,1.6113) (6,1.8808) (6,1.0206) (6,1.1231) (7,2.1495) (7,3.6192) (7,0.9821) (7,1.8452) (7,1.7714) (7,1.7862) (7,2.1522) (7,0.5984) (7,1.9061) (7,0.4491) (8,3.1049) (8,3.0351) (8,1.8916) (8,2.5426) (8,2.3496) (8,2.1159) (8,1.6200) (8,1.7080) (8,1.9983) (8,2.2955) (9,4.0064) (9,2.7038) (9,2.3960) (9,2.3960) (9,2.6976) (9,3.4675) (9,4.2959) (9,3.4671) (9,1.8061) (9,3.0578)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabblue}] coordinates {
(2,0.0659) (2,0.3902) (2,0.0948) (2,0.0572) (2,0.0663) (2,0.0244) (2,0.0665) (2,0.0419) (2,0.0422) (2,0.1996) (3,0.1087) (3,0.3259) (3,0.0803) (3,0.1732) (3,0.2266) (3,0.1273) (3,0.7279) (3,0.3593) (3,0.3061) (3,0.2539) (4,0.3905) (4,0.1029) (4,0.6327) (4,0.0504) (4,0.3962) (4,0.1628) (4,0.5832) (4,0.8663) (4,1.2513) (4,0.6577) (5,1.4927) (5,0.3965) (5,2.2150) (5,3.6668) (5,1.2573) (5,0.2381) (5,0.3714) (5,1.0318) (5,0.7641) (5,2.9790) (6,0.7124) (6,1.9138) (6,0.5950) (6,6.1279) (6,0.3008) (6,1.7237) (6,1.4984) (6,0.7835) (6,0.5134) (6,1.4825) (7,0.6964) (7,2.0269) (7,2.6801) (7,0.4792) (7,2.9244) (7,2.6006) (7,0.5908) (7,1.5374) (7,0.5172) (7,2.5168) (8,0.2959) (8,1.2036) (8,2.4043) (8,1.9062) (8,0.5162) (8,0.9015) (8,5.4225) (8,1.9366) (8,9.9049) (8,3.9159) (9,1.0592) (9,0.3742) (9,11.2893) (9,2.9807) (9,7.1616) (9,1.0083) (9,3.9369) (9,5.0563) (9,0.3742) (9,4.6248)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabred}] coordinates {
(2,0.0293) (2,0.0390) (2,0.0360) (2,0.0362) (2,0.0351) (2,0.0354) (2,0.0269) (2,0.0327) (2,0.0327) (2,0.0334) (3,0.0892) (3,0.1127) (3,0.0931) (3,0.1025) (3,0.0865) (3,0.0590) (3,0.1002) (3,0.0801) (3,0.0436) (3,0.1106) (4,0.2361) (4,0.1321) (4,0.1738) (4,0.1842) (4,0.1103) (4,0.1284) (4,0.1835) (4,0.1151) (4,0.1350) (4,0.1531) (5,0.3289) (5,0.2498) (5,0.2209) (5,0.3330) (5,0.2424) (5,0.2780) (5,0.3226) (5,0.2651) (5,0.2025) (5,0.2488) (6,0.2580) (6,0.4062) (6,0.4790) (6,0.4559) (6,0.3942) (6,0.3833) (6,0.4556) (6,0.3608) (6,0.3535) (6,0.3945) (7,0.5072) (7,1.0531) (7,0.5028) (7,0.5472) (7,0.4568) (7,0.7321) (7,0.5292) (7,0.3513) (7,0.6493) (7,0.4859) (8,0.3585) (8,0.5427) (8,0.4404) (8,0.6181) (8,0.5946) (8,0.4628) (8,0.5037) (8,0.6565) (8,0.5849) (8,0.5011) (9,1.0796) (9,0.6250) (9,0.8201) (9,0.4451) (9,0.8907) (9,0.7884) (9,0.7353) (9,0.7457) (9,0.6596) (9,0.9314)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabgreen}] coordinates {
(2,0.0071) (2,0.0074) (2,0.0071) (2,0.0044) (2,0.0041) (2,0.0038) (2,0.0053) (2,0.0041) (2,0.0038) (2,0.0061) (3,0.0236) (3,0.0219) (3,0.0268) (3,0.0238) (3,0.0223) (3,0.0235) (3,0.0227) (3,0.0208) (3,0.0213) (3,0.0212) (4,0.0435) (4,0.0280) (4,0.0350) (4,0.0386) (4,0.0279) (4,0.0234) (4,0.0386) (4,0.0274) (4,0.0354) (4,0.0359) (5,0.0599) (5,0.0581) (5,0.0544) (5,0.0623) (5,0.0526) (5,0.0509) (5,0.0633) (5,0.0521) (5,0.0495) (5,0.0583) (6,0.0546) (6,0.0803) (6,0.0509) (6,0.0860) (6,0.0736) (6,0.0744) (6,0.0798) (6,0.0579) (6,0.0580) (6,0.0809) (7,0.0723) (7,0.1700) (7,0.1001) (7,0.1000) (7,0.0858) (7,0.1198) (7,0.0862) (7,0.0517) (7,0.1034) (7,0.0905) (8,0.0855) (8,0.1202) (8,0.1162) (8,0.1014) (8,0.1362) (8,0.1063) (8,0.1014) (8,0.1480) (8,0.0943) (8,0.1198) (9,0.2033) (9,0.1399) (9,0.1652) (9,0.1135) (9,0.1451) (9,0.1475) (9,0.1556) (9,0.1600) (9,0.1076) (9,0.1527)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabpink}] coordinates {
(2,0.0154) (2,0.0145) (2,0.0147) (2,0.0053) (2,0.0147) (2,0.0093) (2,0.0152) (2,0.0093) (2,0.0077) (2,0.0154) (3,0.0820) (3,0.0864) (3,0.0944) (3,0.0826) (3,0.0845) (3,0.0852) (3,0.0915) (3,0.0587) (3,0.0721) (3,0.0693) (4,0.2280) (4,0.1348) (4,0.1841) (4,0.1922) (4,0.1218) (4,0.1417) (4,0.1935) (4,0.1101) (4,0.1722) (4,0.1532) (5,0.4098) (5,0.3283) (5,0.3280) (5,0.4416) (5,0.3036) (5,0.2678) (5,0.3801) (5,0.2470) (5,0.2400) (5,0.3218) (6,0.3064) (6,0.6232) (6,0.5473) (6,0.5538) (6,0.5778) (6,0.4176) (6,0.6378) (6,0.3912) (6,0.4951) (6,0.5822) (7,0.7190) (7,1.6141) (7,0.7719) (7,0.7603) (7,0.6397) (7,1.0789) (7,0.6096) (7,0.5473) (7,1.3671) (7,0.7103) (8,0.6200) (8,1.2035) (8,1.0836) (8,0.8447) (8,1.2388) (8,0.8432) (8,0.7478) (8,1.3000) (8,1.0365) (8,1.2061) (9,1.8905) (9,1.3876) (9,0.5211) (9,1.4436) (9,1.2574) (9,1.5235) (9,1.3883) (9,1.5009) (9,1.0312) (9,1.5066)
};


\end{axis}
\end{tikzpicture}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.32\textwidth}

\begin{tikzpicture}
\begin{axis}[
    width=\linewidth,
    height=5.8cm,
    ymin=0,
    ymax=22,
    ylabel={Total runtime (hours)},
    xlabel={Minimal DFA States},
    xtick={2,3,4,5,6,7,8,9},
    xmin=1.5,
    xmax=9.5,
    grid=major,
    grid style={gray!18},
    tick label style={font=\scriptsize},
    label style={font=\small},
]

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtaborange}] coordinates {
(2,0.0222) (2,0.2087) (2,0.0847) (2,0.0495) (2,0.0527) (2,0.0085) (2,0.1098) (2,0.0642) (2,0.0808) (2,0.0924) (3,0.0474) (3,0.0625) (3,0.0746) (3,0.2768) (3,0.2540) (3,0.0506) (3,0.1238) (3,0.1864) (3,0.0765) (3,0.1272) (4,0.2504) (4,0.8093) (4,0.1618) (4,0.2731) (4,0.2130) (4,0.1133) (4,0.4118) (4,0.2083) (4,0.2302) (4,0.3489) (5,0.7099) (5,0.8320) (5,2.0815) (5,3.5563) (5,2.2148) (5,0.2311) (5,0.9506) (5,0.4282) (5,0.4057) (5,2.8492) (6,0.6834) (6,3.4401) (6,0.7703) (6,1.3903) (6,0.6933) (6,1.6456) (6,1.5641) (6,0.4301) (6,2.3204) (6,0.5921) (7,4.3964) (7,1.6105) (7,4.8125) (7,1.3679) (7,3.8091) (7,1.5661) (7,3.1711) (7,0.4003) (7,1.2810) (7,3.1943) (8,4.6905) (8,5.7116) (8,4.9230) (8,5.7180) (8,7.0491) (8,2.5780) (8,1.0226) (8,4.3657) (8,5.1083) (8,1.6331) (9,8.8445) (9,6.2177) (9,7.2136) (9,5.6612) (9,5.4207) (9,5.7946) (9,4.6858) (9,6.6151) (9,4.1072) (9,6.8811)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figgold}] coordinates {
(2,0.0209) (2,0.1166) (2,0.0807) (2,0.0241) (2,0.0246) (2,0.0359) (2,0.0564) (2,0.0470) (2,0.0338) (2,0.0867) (3,0.1350) (3,0.3134) (3,0.2525) (3,0.2460) (3,0.6621) (3,0.3003) (3,0.2096) (3,0.9446) (3,0.2778) (3,0.4509) (4,1.6054) (4,1.2900) (4,1.9366) (4,1.4342) (4,2.0711) (4,1.1051) (4,3.1555) (4,0.9184) (4,2.0840) (4,1.0365) (5,3.4195) (5,2.3297) (5,4.2193) (5,3.6255) (5,3.2027) (5,1.8829) (5,5.0544) (5,1.2204) (5,2.5369) (5,4.0495) (6,4.8549) (6,5.1200) (6,4.0600) (6,4.2029) (6,1.7665) (6,4.4295) (6,7.2895) (6,8.3009) (6,4.5878) (6,4.9758) (7,9.7732) (7,16.8240) (7,4.3593) (7,8.5647) (7,8.2018) (7,8.2281) (7,9.8453) (7,2.6042) (7,8.5834) (7,1.9508) (8,14.4520) (8,13.9231) (8,8.6262) (8,11.5108) (8,10.6942) (8,9.6825) (8,7.4692) (8,7.7801) (8,9.2805) (8,10.5476) (9,18.7821) (9,12.2884) (9,11.3616) (9,14.1189) (9,12.3273) (9,15.6759) (9,19.5521) (9,15.6802) (9,7.8780) (9,13.7797)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabblue}] coordinates {
(2,0.0241) (2,0.1288) (2,0.0294) (2,0.0159) (2,0.0203) (2,0.0101) (2,0.0214) (2,0.0136) (2,0.0108) (2,0.0676) (3,0.0372) (3,0.1101) (3,0.0247) (3,0.0638) (3,0.0786) (3,0.0472) (3,0.2401) (3,0.1265) (3,0.1044) (3,0.0923) (4,0.1462) (4,0.0301) (4,0.2323) (4,0.0254) (4,0.1295) (4,0.0592) (4,0.2154) (4,0.3035) (4,0.3775) (4,0.2290) (5,0.4653) (5,0.1518) (5,0.8214) (5,1.2055) (5,0.4533) (5,0.0993) (5,0.1076) (5,0.3537) (5,0.1579) (5,1.0375) (6,0.2743) (6,0.6296) (6,0.2400) (6,2.5847) (6,0.1115) (6,0.6235) (6,0.5696) (6,0.2752) (6,0.1779) (6,0.4431) (7,0.2635) (7,0.5491) (7,0.9313) (7,0.1678) (7,1.0139) (7,0.9303) (7,0.2107) (7,0.6335) (7,0.2077) (7,0.8375) (8,0.0984) (8,0.4205) (8,0.8757) (8,0.5343) (8,0.1629) (8,0.2848) (8,1.8816) (8,0.6271) (8,3.8820) (8,1.3295) (9,0.3001) (9,0.0986) (9,3.7213) (9,0.9728) (9,2.3719) (9,0.2950) (9,1.3651) (9,1.6267) (9,1.1332) (9,1.5743)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabred}] coordinates {
(2,0.0050) (2,0.0049) (2,0.0040) (2,0.0033) (2,0.0044) (2,0.0037) (2,0.0036) (2,0.0038) (2,0.0050) (2,0.0049) (3,0.0130) (3,0.0188) (3,0.0167) (3,0.0113) (3,0.0130) (3,0.0122) (3,0.0178) (3,0.0292) (3,0.0086) (3,0.0144) (4,0.0318) (4,0.0205) (4,0.0240) (4,0.0222) (4,0.0162) (4,0.0184) (4,0.0264) (4,0.0200) (4,0.0206) (4,0.0227) (5,0.0376) (5,0.0343) (5,0.1810) (5,0.0401) (5,0.0262) (5,0.0338) (5,0.0424) (5,0.0298) (5,0.0254) (5,0.0310) (6,0.0304) (6,0.0487) (6,0.0445) (6,0.0956) (6,0.0366) (6,0.0899) (6,0.0531) (6,0.0391) (6,0.0382) (6,0.2056) (7,0.0479) (7,0.2823) (7,0.1026) (7,0.1116) (7,0.0975) (7,0.2280) (7,0.0534) (7,0.0358) (7,0.2427) (7,0.2096) (8,0.0535) (8,0.0682) (8,0.0591) (8,0.0817) (8,0.0780) (8,0.0560) (8,0.0602) (8,0.0893) (8,0.0732) (8,0.0686) (9,0.1690) (9,0.0701) (9,0.0872) (9,0.0643) (9,0.0926) (9,0.0845) (9,0.1348) (9,0.0786) (9,0.0764) (9,0.0975)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabgreen}] coordinates {
(2,0.0042) (2,0.0044) (2,0.0041) (2,0.0024) (2,0.0021) (2,0.0017) (2,0.0029) (2,0.0022) (2,0.0015) (2,0.0044) (3,0.0118) (3,0.0130) (3,0.0130) (3,0.0117) (3,0.0109) (3,0.0115) (3,0.0111) (3,0.0105) (3,0.0111) (3,0.0106) (4,0.0196) (4,0.0139) (4,0.0165) (4,0.0173) (4,0.0145) (4,0.0136) (4,0.0182) (4,0.0130) (4,0.0170) (4,0.0164) (5,0.0256) (5,0.0241) (5,0.0232) (5,0.0267) (5,0.0226) (5,0.0215) (5,0.0271) (5,0.0229) (5,0.0211) (5,0.0240) (6,0.0225) (6,0.0324) (6,0.0267) (6,0.0351) (6,0.0314) (6,0.0301) (6,0.0328) (6,0.0252) (6,0.0271) (6,0.0320) (7,0.0316) (7,0.0603) (7,0.0377) (7,0.0388) (7,0.0344) (7,0.0470) (7,0.0355) (7,0.0245) (7,0.0467) (7,0.0382) (8,0.0352) (8,0.0466) (8,0.0447) (8,0.0417) (8,0.0485) (8,0.0445) (8,0.0397) (8,0.0517) (8,0.0422) (8,0.0463) (9,0.0763) (9,0.0576) (9,0.0641) (9,0.0574) (9,0.0587) (9,0.0699) (9,0.0703) (9,0.0643) (9,0.0551) (9,0.0605)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabpink}] coordinates {
(2,0.0147) (2,0.0213) (2,0.0139) (2,0.0056) (2,0.0177) (2,0.0118) (2,0.0157) (2,0.0117) (2,0.0100) (2,0.0192) (3,0.0577) (3,0.0624) (3,0.1758) (3,0.0625) (3,0.1071) (3,0.1129) (3,0.0703) (3,0.0474) (3,0.0566) (3,0.0506) (4,0.0723) (4,0.0637) (4,0.0624) (4,0.0775) (4,0.0507) (4,0.0663) (4,0.0684) (4,0.0468) (4,0.0713) (4,0.0671) (5,0.2201) (5,0.1408) (5,0.2389) (5,0.1716) (5,0.1861) (5,0.1427) (5,0.2017) (5,0.1270) (5,0.1291) (5,0.1461) (6,0.4131) (6,0.7728) (6,0.7090) (6,0.4854) (6,0.4947) (6,0.4486) (6,0.8125) (6,0.6973) (6,0.5193) (6,0.7025) (7,0.8310) (7,1.2349) (7,0.7831) (7,0.7353) (7,0.4944) (7,1.0483) (7,0.6979) (7,0.7473) (7,0.9658) (7,0.8207) (8,0.5358) (8,0.8387) (8,0.8841) (8,0.4396) (8,0.8245) (8,0.6733) (8,0.7491) (8,0.5511) (8,0.8142) (8,0.9435) (9,0.6988) (9,0.4106) (9,0.6876) (9,0.6621) (9,0.5055) (9,0.5471) (9,0.3836) (9,0.4649) (9,0.4441) (9,0.5419)
};


\end{axis}
\end{tikzpicture}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.32\textwidth}

\begin{tikzpicture}
\begin{axis}[
    width=\linewidth,
    height=5.8cm,
    ymin=0,
    ymax=35,
    ylabel={Total tokens (millions)},
    xlabel={Minimal DFA States},
    xtick={2,3,4,5,6,7,8,9},
    xmin=1.5,
    xmax=9.5,
    grid=major,
    grid style={gray!18},
    tick label style={font=\scriptsize},
    label style={font=\small},
]

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtaborange}] coordinates {
(2,0.0220) (2,0.1271) (2,0.0530) (2,0.0359) (2,0.0387) (2,0.0082) (2,0.0634) (2,0.0447) (2,0.0459) (2,0.0546) (3,0.0517) (3,0.0554) (3,0.0708) (3,0.1148) (3,0.1700) (3,0.0543) (3,0.0938) (3,0.0988) (3,0.0781) (3,0.1019) (4,0.1701) (4,0.4535) (4,0.1293) (4,0.1802) (4,0.1557) (4,0.1009) (4,0.2550) (4,0.1516) (4,0.1723) (4,0.2458) (5,0.4502) (5,0.5369) (5,1.5522) (5,3.0487) (5,1.4662) (5,0.1922) (5,0.6229) (5,0.3001) (5,0.2442) (5,1.7805) (6,0.4530) (6,2.9037) (6,0.4971) (6,1.0470) (6,0.4056) (6,1.0611) (6,1.3284) (6,0.3332) (6,1.8050) (6,0.3477) (7,3.5260) (7,1.0159) (7,4.3480) (7,0.7901) (7,3.0036) (7,1.0867) (7,3.1314) (7,0.2302) (7,0.9017) (7,3.0057) (8,4.4843) (8,6.1896) (8,4.6895) (8,5.8513) (8,6.6979) (8,1.9913) (8,0.6731) (8,3.8489) (8,5.1763) (8,1.0416) (9,8.3674) (9,5.7277) (9,5.7225) (9,4.9313) (9,5.5307) (9,5.7340) (9,3.1429) (9,5.9946) (9,3.2302) (9,5.7198)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figgold}] coordinates {
(2,0.0093) (2,0.0284) (2,0.0215) (2,0.0124) (2,0.0111) (2,0.0154) (2,0.0180) (2,0.0185) (2,0.0152) (2,0.0225) (3,0.0390) (3,0.0735) (3,0.0585) (3,0.0628) (3,0.1202) (3,0.0843) (3,0.0615) (3,0.1910) (3,0.0611) (3,0.1017) (4,0.3436) (4,0.2486) (4,0.3498) (4,0.2823) (4,0.3328) (4,0.2320) (4,0.5383) (4,0.1799) (4,0.3803) (4,0.2169) (5,0.6899) (5,0.4655) (5,0.7858) (5,0.7525) (5,0.6040) (5,0.3678) (5,0.9285) (5,0.2602) (5,0.4957) (5,0.7486) (6,0.9400) (6,0.9681) (6,0.8807) (6,0.8569) (6,0.3771) (6,0.8780) (6,1.4953) (6,2.8473) (6,0.9635) (6,0.9795) (7,2.0181) (7,3.3242) (7,0.7916) (7,1.6148) (7,1.5297) (7,1.8868) (7,1.9163) (7,0.5011) (7,1.7630) (7,0.4192) (8,2.8436) (8,2.9640) (8,1.6475) (8,2.3235) (8,2.0647) (8,2.1751) (8,1.4392) (8,1.5980) (8,1.7836) (8,2.1250) (9,4.1068) (9,2.4771) (9,2.3930) (9,4.0607) (9,2.5728) (9,3.2646) (9,3.7146) (9,3.3395) (9,1.5978) (9,2.9873)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabblue}] coordinates {
(2,0.0491) (2,0.1715) (2,0.0607) (2,0.0566) (2,0.0581) (2,0.0321) (2,0.0505) (2,0.0410) (2,0.0395) (2,0.0898) (3,0.1084) (3,0.3610) (3,0.2607) (3,0.3435) (3,0.2265) (3,0.2963) (3,0.7080) (3,1.0541) (3,0.3431) (3,0.5870) (4,0.4859) (4,0.3710) (4,0.7236) (4,0.1935) (4,0.4187) (4,0.2780) (4,0.9524) (4,1.2172) (4,1.6387) (4,0.9501) (5,1.3503) (5,0.5162) (5,1.8771) (5,4.3401) (5,0.9136) (5,1.4705) (5,1.2994) (5,1.7560) (5,3.4397) (5,1.5680) (6,2.4935) (6,3.1611) (6,1.0106) (6,12.1791) (6,0.9407) (6,2.9614) (6,1.6751) (6,0.9472) (6,1.4049) (6,4.1070) (7,1.5455) (7,13.4071) (7,12.0044) (7,2.2358) (7,3.9834) (7,3.9864) (7,1.2276) (7,2.7393) (7,1.4000) (7,6.7356) (8,1.2106) (8,3.9161) (8,3.7908) (8,6.9825) (8,2.5818) (8,2.7657) (8,12.0983) (8,4.6485) (8,24.3689) (8,9.2012) (9,5.9351) (9,2.8791) (9,30.4576) (9,10.2466) (9,13.3143) (9,5.2306) (9,5.3530) (9,16.4770) (9,6.5584) (9,8.6886)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabred}] coordinates {
(2,0.0116) (2,0.0151) (2,0.0151) (2,0.0133) (2,0.0151) (2,0.0151) (2,0.0116) (2,0.0151) (2,0.0151) (2,0.0151) (3,0.0694) (3,0.0700) (3,0.0785) (3,0.0697) (3,0.0578) (3,0.0336) (3,0.0729) (3,0.0531) (3,0.0210) (3,0.0657) (4,0.2088) (4,0.1083) (4,0.1418) (4,0.1416) (4,0.0920) (4,0.1130) (4,0.1656) (4,0.0885) (4,0.1370) (4,0.1246) (5,0.3116) (5,0.2589) (5,0.2373) (5,0.3567) (5,0.2286) (5,0.2373) (5,0.3277) (5,0.2451) (5,0.1839) (5,0.2453) (6,0.2479) (6,0.4397) (6,0.4001) (6,0.5177) (6,0.3891) (6,0.3606) (6,0.4866) (6,0.3278) (6,0.3558) (6,0.4728) (7,0.5402) (7,1.7142) (7,0.6064) (7,0.6274) (7,0.5113) (7,0.9269) (7,0.5531) (7,0.3434) (7,0.9636) (7,0.5769) (8,0.4655) (8,0.9363) (8,0.7075) (8,0.8547) (8,0.9613) (8,0.6620) (8,0.7357) (8,1.0619) (8,0.9507) (8,0.8435) (9,2.1730) (9,1.0489) (9,1.2224) (9,1.0101) (9,1.3961) (9,1.2864) (9,1.2723) (9,1.2740) (9,1.0225) (9,1.6289)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabgreen}] coordinates {
(2,0.0340) (2,0.0349) (2,0.0344) (2,0.0179) (2,0.0152) (2,0.0121) (2,0.0243) (2,0.0152) (2,0.0121) (2,0.0328) (3,0.1385) (3,0.1302) (3,0.1569) (3,0.1376) (3,0.1240) (3,0.1297) (3,0.1343) (3,0.1102) (3,0.1181) (3,0.1170) (4,0.3401) (4,0.1866) (4,0.2482) (4,0.2741) (4,0.1801) (4,0.1780) (4,0.2879) (4,0.1735) (4,0.2602) (4,0.2489) (5,0.5397) (5,0.5159) (5,0.4719) (5,0.6146) (5,0.4453) (5,0.4174) (5,0.6013) (5,0.4430) (5,0.3957) (5,0.5025) (6,0.4725) (6,0.8318) (6,0.5364) (6,0.9569) (6,0.7729) (6,0.7258) (6,0.8524) (6,0.5300) (6,0.6139) (6,0.8455) (7,0.7956) (7,2.6654) (7,1.1702) (7,1.1684) (7,0.9742) (7,1.5487) (7,0.9642) (7,0.5377) (7,1.4380) (7,1.0269) (8,0.9347) (8,1.6055) (8,1.4917) (8,1.2585) (8,1.7830) (8,1.2945) (8,1.1865) (8,2.0469) (8,1.2987) (8,1.5716) (9,3.3807) (9,1.9086) (9,2.2564) (9,1.6939) (9,2.0210) (9,2.2020) (9,2.2600) (9,2.3613) (9,1.4624) (9,2.2590)
};

\addplot+[only marks, mark=*, mark size=1.45pt, opacity=0.72, mark options={draw=black, fill=figtabpink}] coordinates {
(2,0.0175) (2,0.0165) (2,0.0167) (2,0.0060) (2,0.0168) (2,0.0106) (2,0.0173) (2,0.0106) (2,0.0088) (2,0.0175) (3,0.0931) (3,0.0982) (3,0.1072) (3,0.0939) (3,0.0960) (3,0.0969) (3,0.1040) (3,0.0667) (3,0.0820) (3,0.0787) (4,0.2590) (4,0.1531) (4,0.2092) (4,0.2184) (4,0.1384) (4,0.1610) (4,0.2199) (4,0.1251) (4,0.1957) (4,0.1741) (5,0.4657) (5,0.3731) (5,0.3727) (5,0.5018) (5,0.3450) (5,0.3044) (5,0.4319) (5,0.2806) (5,0.2727) (5,0.3657) (6,0.3482) (6,0.7082) (6,0.6219) (6,0.6293) (6,0.6566) (6,0.4746) (6,0.7248) (6,0.4446) (6,0.5626) (6,0.6616) (7,0.8170) (7,1.8343) (7,0.8772) (7,0.8639) (7,0.7269) (7,1.2260) (7,0.6928) (7,0.6220) (7,1.5535) (7,0.8072) (8,0.7045) (8,1.3676) (8,1.2313) (8,0.9599) (8,1.4077) (8,0.9582) (8,0.8498) (8,1.4773) (8,1.1779) (8,1.3706) (9,2.1483) (9,1.5768) (9,0.5921) (9,1.6404) (9,1.4288) (9,1.7313) (9,1.5777) (9,1.7055) (9,1.1719) (9,1.7120)
};


\end{axis}
\end{tikzpicture}
\end{subfigure}

%
}
\caption{\textbf{Computational cost, runtime, and token usage of model runs as the number of minimal DFA states increases.} Each point is a run; the x-axis groups runs by the number of states, and point color indicates the model.}
\label{fig:cost_runtime_tokens_by_states}
\end{figure*}
```
#### Automatic Generation of Instances.

A key advantage of agentic automata learning is that task instances can be generated procedurally and at arbitrary scale, while controlling their complexity through properties of the target DFA, such as the number of states. In particular, we use the Boltzmann sampling method, which samples uniformly from the space of accessible DFAs over a binary alphabet,[^2] and then applies rejection sampling to filter out non-minimal DFAs [@BassinoNicaud2007]. While in our experiments we keep the alphabet fixed, this sampling procedure itself does not inherently restrict the alphabet size. This procedural construction allows us to create arbitrarily many environments over a chosen alphabet, rather than relying on a fixed pool of predefined tasks or examples requiring expensive human annotation.

#### The task measures core agentic capabilities.

Agentic automata learning is designed to evaluate several capabilities of a language model operating as an interactive agent, including adaptive planning, reasoning from observations, memory organization and tool use, all of which are required to discover the latent structure of the environment defined by the hidden DFA. Adaptive planning is required as the model receives feedback from the environment after each query, and must choose a subsequent query that reduces uncertainty about the target DFA. Reasoning from observations is required as the accumulated interaction history forms a passive dataset of accepted and rejected words, from which the model can infer properties of the hidden structure. Memory organization is required as the model must maintain consistency with previous observations, avoiding hypotheses that contradict known evidence and queries whose answers are already implied by the interaction history. Finally, tool use is required as the interaction is mediated through membership and equivalence queries, which the model must invoke correctly with valid inputs.

# Experimental Setup

In this section, we describe the different design choices and evaluation criteria used to assess LLMs on the agentic automata learning task.

#### Models and metrics.

We evaluate 6 advanced models: DeepSeek-V4-Pro [@deepseek_v4_pro], a 1.6T-parameter model, Gemini 3.1 Pro Preview, Gemini-3-Flash-Preview (thinking), Gemini-3.1-Flash-Lite-Preview [@google_gemini_models], GPT-5.4 without thinking [@openai_gpt54], Llama-3.3-70B-Instruct-Turbo [@meta_llama33_70b]. Unless otherwise specified, models are used in their vanilla configuration. The set includes models that differ in their use of reasoning tokens, size, architecture, and availability as closed-source or open-source systems. We did not include a GPT reasoning variant in the main evaluation due to the substantially higher computational cost of running a full evaluation, especially given that other reasoning models already provided sufficient coverage for comparison. In addition, a preliminary sample of one task instance per difficulty level did not indicate that this variant was likely to exhibit significantly different performance from the reasoning models evaluated in practice.

We evaluate two aspects of model performance. First *Success Rate* measures the rate at which models successfully identify the hidden DFA: $$\frac{1}{N}\sum_{i=1}^{N} Succ(M, x_i),$$

where $Succ(M, x_i)$ is a binary indicator of whether model $M$ successfully infers the hidden DFA $x_i$, and $N$ is the number of sampled automata.

Second, we measure *interaction efficiency* by measuring the difference between the number of tool calls made by a model $M$ when it successfully finds the hidden DFA and the number of tool calls made by the classic algorithm over successful runs:

$$\frac{1}{|S_{\mathrm{succ}}|}
\sum_{i\in S_{\mathrm{succ}}}
\left(
TC_M(x_i)-TC_{\mathrm{TTT}}(x_i)
\right),$$

where $S_{\mathrm{succ}}$ denotes successful runs, and $TC_A(x_i)$ denotes the number of tool calls used by algorithm $A$ to identify the hidden DFA $x_i$.

#### Dataset construction.

We sample an evaluation dataset consisting of 80 task instances evenly distributed into 4 complexity levels, defined by the number of states in the minimal DFA: 2--3, 4--5, 6--7, and 8--9. This provides a set of target DFAs for evaluating model performance across DFAs of the same complexity level. Although additional instances can be generated automatically, evaluation remains expensive in runtime and monetary cost. Each instance requires a long interactive run, which may involve tens to hundreds of queries and consume millions of tokens. As shown in Figure `\ref{fig:cost_runtime_tokens_by_states}`{=latex}, runtime and cost increase sharply with DFA size, reaching up to about 20 hours, 30 million tokens, and \$50 per run. Overall, the total monetary cost of the full experimental suite was roughly \$1,200. This amount aggregates costs across the services used in our evaluation, including Google Cloud credits for Gemini, the DeepSeek platform, Together AI for Llama, and the OpenAI platform for GPT.

#### Query budget.

The number of queries allowed in each instance is defined to be at most twice the number of queries required by the better of $L^*$ and TTT for the corresponding DFA, both of which have provable upper bounds on the number of queries. In particular, L\* uses $O(|\Sigma| n^2 m)$ queries [@angluin1987learning], while TTT uses $O(|\Sigma| n^2 + n \log m)$ queries [@isberner2014ttt], where $n$ is the number of states in the minimal hidden DFA, $|\Sigma|$ is the alphabet size, and $m$ is the maximum counterexample length. Defining the budget as twice the number of queries guarantees that a correct solution exists within the allocated query budget.

#### Oracle counterexample selection strategy.

For each equivalence query, if the proposed DFA is not equivalent to the hidden DFA, the oracle must return a counterexample (a word on which the two DFAs disagree), yet the classic task formulation does not define *which* counterexample it should return. We return a counterexample designed to balance two considerations. On one hand, the length of the minimal counterexample directly affects the runtime of the classic learning algorithms and, consequently, the query budget allocated to the model. On the other hand, a minimal counterexample provides additional implicit information, since it implies that no shorter word distinguishes between the hypothesis DFA and the hidden DFA. Intuitively, if the oracle returns a minimal counterexample of length $n$, then the two DFAs agree on all words of length $< n$, allowing the agent to infer their membership without explicitly querying them. We therefore sample a short counterexample at random. A full description of the sampling procedure is provided in Appendix `\ref{sec:counterexample}`{=latex}.

#### Task instruction and context construction.

All models are provided with the same instruction describing the task and the interaction rules as defined in the agentic automata learning task (see Appendix `\ref{sec:prompt}`{=latex} for the full instruction). At each interaction step, we construct the agent's context from the observable information accumulated so far. Specifically, the context includes the previous queries issued by the agent, the oracle responses, counterexamples returned by equivalence queries, and the agent's observable hypotheses or final answers. We do not preserve chain-of-thought reasoning across steps: only externally visible outputs and environment feedback are carried forward. This design keeps the interaction history comparable across models and avoids rapidly accumulating reasoning tokens in the context. After running all task instances, we verified that performance was not affected by context truncation. Specifically, for every task instance, the maximum constructed context throughout the interaction remained below the context-window capacity of the corresponding model. Appendix `\ref{sec:context_window}`{=latex} reports the total constructed context size (including accumulated input and generated output tokens) at each interaction step for every task instance.

```{=latex}
\definecolor{figtaborange}{HTML}{FF7F0E}
```
```{=latex}
\definecolor{figtabblue}{HTML}{1F77B4}
```
```{=latex}
\definecolor{figtabgreen}{HTML}{2CA02C}
```
```{=latex}
\definecolor{figtabred}{HTML}{D62728}
```
```{=latex}
\definecolor{figtabpink}{HTML}{E377C2}
```
```{=latex}
\definecolor{figgold}{HTML}{FFD700}
```
```{=latex}
\newcommand{\ColoredErrorBar}[4]{%
    \draw[#1, line width=0.9pt]
        (axis cs:#2,#3) -- (axis cs:#2,#4);
    \draw[#1, line width=0.9pt]
        ([xshift=-3.0pt]axis cs:#2,#3) -- ([xshift=3.0pt]axis cs:#2,#3);
    \draw[#1, line width=0.9pt]
        ([xshift=-3.0pt]axis cs:#2,#4) -- ([xshift=3.0pt]axis cs:#2,#4);
}
```
```{=latex}
\begin{figure*}[t]


\makebox[\textwidth][c]{%
%


\begin{tabular*}{\linewidth}{@{\extracolsep{\fill}}lll@{}}
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtaborange] (0,0) rectangle (0.28,0.13);}~\small Gemini 3.1 Pro Preview &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figgold] (0,0) rectangle (0.28,0.13);}~\small Deepseek-v4-Pro &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabblue] (0,0) rectangle (0.28,0.13);}~\small Gemini-3-Flash-Preview (thinking) \\

\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabred] (0,0) rectangle (0.28,0.13);}~\small GPT-5.4 (without thinking) &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabgreen] (0,0) rectangle (0.28,0.13);}~\small Gemini-3.1-Flash-Lite-Preview &
\tikz[baseline=-0.5ex]{\draw[draw=black, fill=figtabpink] (0,0) rectangle (0.28,0.13);}~\small Llama-3.3-70B-Instruct-Turbo
\end{tabular*}
%
}


\begin{subfigure}[t]{0.43\textwidth}

\begin{tikzpicture}
\begin{axis}[
    ybar,
    width=\linewidth,
    height=6.1cm,
    ymin=0,
    ymax=110,
    ylabel={Success rate (\%) $\rightarrow$},
    xlabel={Number of States in the Minimal DFA},
    xtick={1,2,3,4},
    xticklabels={2--3,4--5,6--7,8--9},
    bar width=4pt,
    enlarge x limits=0.22,
    grid=major,
    grid style={gray!18},
    tick label style={font=\scriptsize},
    label style={font=\small},
]

\addplot+[fill=figtaborange, draw=black, bar shift=-10pt] coordinates {(1,100)};
\addplot+[fill=figgold, draw=black, bar shift=-6pt] coordinates {(1,95)};
\addplot+[fill=figtabblue, draw=black, bar shift=-2pt] coordinates {(1,55)};
\addplot+[fill=figtabred, draw=black, bar shift=2pt] coordinates {(1,55)};
\addplot+[fill=figtabgreen, draw=black, bar shift=6pt] coordinates {(1,30)};
\addplot+[fill=figtabpink, draw=black, bar shift=10pt] coordinates {(1,25)};

\addplot+[fill=figtaborange, draw=black, bar shift=-4pt] coordinates {(2,85)};
\addplot+[fill=figgold, draw=black, bar shift=0pt] coordinates {(2,70)};
\addplot+[fill=figtabblue, draw=black, bar shift=4pt] coordinates {(2,15)};

\addplot+[fill=figtaborange, draw=black, bar shift=-2pt] coordinates {(3,65)};
\addplot+[fill=figgold, draw=black, bar shift=2pt] coordinates {(3,45)};

\addplot+[fill=figtaborange, draw=black, bar shift=-2pt] coordinates {(4,25)};
\addplot+[fill=figgold, draw=black, bar shift=2pt] coordinates {(4,10)};

\end{axis}
\end{tikzpicture}
\end{subfigure}
\hfill
\begin{subfigure}[t]{0.53\textwidth}

\raisebox{0.13cm}{%
\begin{tikzpicture}
\begin{axis}[
    width=\linewidth,
    height=6.1cm,
    ymin=-12,
    ymax=62,
    ylabel={$\leftarrow$ $\Delta$ Tool Calls from TTT},
    xlabel={Number of States in the Minimal DFA},
    xtick={1,2,3,4},
    xticklabels={2--3,4--5,6--7,8--9},
    grid=major,
    grid style={gray!18},
    tick label style={font=\scriptsize},
    label style={font=\small},
]

% Shown model lines: mean Δ tool calls from TTT.
% Vertical bars show ±1 standard deviation.
\addplot+[
    color=figtaborange,
    line width=1.8pt,
    mark=*,
    mark size=1.6pt,
    mark options={solid, fill=figtaborange, draw=figtaborange}
] coordinates {
    (1,-1.30)
    (2,3.76)
    (3,4.31)
    (4,25.40)
};
\ColoredErrorBar{figtaborange}{1}{-4.56}{1.96}
\ColoredErrorBar{figtaborange}{2}{-3.34}{10.86}
\ColoredErrorBar{figtaborange}{3}{-5.89}{14.51}
\ColoredErrorBar{figtaborange}{4}{5.82}{44.98}

\addplot+[
    color=figgold,
    line width=1.8pt,
    mark=*,
    mark size=1.6pt,
    mark options={solid, fill=figgold, draw=figgold}
] coordinates {
    (1,3.32)
    (2,16.64)
    (3,20.78)
    (4,39.00)
};
\ColoredErrorBar{figgold}{1}{-0.76}{7.40}
\ColoredErrorBar{figgold}{2}{10.14}{23.14}
\ColoredErrorBar{figgold}{3}{3.91}{37.65}
\ColoredErrorBar{figgold}{4}{34.76}{43.24}

% Classical baselines: TTT is the zero reference; L* is shown as delta from TTT.
\addplot+[black, dashed, line width=1.5pt, mark=none] coordinates {
    (1,0) (2,0) (3,0) (4,0)
};

\addplot+[gray, dashed, line width=1.5pt, mark=none] coordinates {
    (1,-1.55) (2,2.93) (3,7.22) (4,18.51)
};

\node[
    anchor=north,
    draw=black!35,
    fill=white,
    inner xsep=6.5pt,
    inner ysep=2.6pt,
    minimum width=3.25cm,
    font=\scriptsize
] at (rel axis cs:0.50,0.97) {%%
\makebox[3.05cm][c]{%%
\raisebox{0.15ex}{\tikz{\draw[black, dashed, line width=1.15pt] (0,0) -- (0.42,0);}}TTT%%
%%
\raisebox{0.15ex}{\tikz{\draw[gray, dashed, line width=1.15pt] (0,0) -- (0.42,0);}}L*%%
}%%
};

\end{axis}
\end{tikzpicture}
}
\end{subfigure}

\caption{\textbf{Model performance as the number of states in the minimal DFA increases.}
The left subplot shows success rate; L* and TTT are omitted there because both achieve 100\% success. The right subplot compares successful-run query cost relative to TTT: values above zero require more calls and values below zero require fewer. Some models are not shown in the right subplot because their success rates are low, so measuring tool calls is not meaningful, their average tool calls for successful runs are: Gemini-3-Flash: 0.55 (2--3), 16.67 (4--5); GPT-5.4: 2.64 (2--3); Gemini-3.1-Flash-Lite: -2.50 (2--3); Llama-3.3: 0.00 (2--3).}

\label{fig:success_and_tools}
\end{figure*}
```
# Results

Figure `\ref{fig:success_and_tools}`{=latex} presents our main evaluation results, reporting *Success Rate* which measures whether models identify the hidden DFA, and *$\Delta$ Tool Calls from TTT* which measures interaction efficiency. We complement this analysis by differentiating between planning and reasoning errors in Figure `\ref{fig:task_completion_outcomes_all_models}`{=latex} and examining non-informative query rates in Figure `\ref{fig:noninformative_by_step}`{=latex}. Together, these results characterize task difficulty, reveal differences across models, and quantify the gap between LLM agents and classical active automata learning algorithms. We first highlight the main findings, then turn to a detailed error and cost analysis.

#### *Agentic automata learning* is challenging even for state-of-the-art models and exposes substantial differences in their ability to learn through interaction.

Figure `\ref{fig:success_and_tools}`{=latex} shows that the performance of all models declines as the number of states in the minimal DFA increases. For automata with 8--9 states, no model exceeds 25% success rate, whereas the classic algorithms solve all task instances with 100% success. The gap relative to classic algorithms is also reflected in interaction efficiency: in the same 8--9 state range, when considering only successful runs, Gemini 3.1 Pro, the best-performing model, uses about 45.8% more tool calls than TTT.

Alongside the overall difficulty of the task, the framework also exposes substantial differences between models, particularly for automata with 4 states or more. A clear gap emerges between models that employ reasoning mechanisms and those that do not: in the 4--5 state range, Gemini 3.1 Pro reaches 85% success, while Gemini 3 Flash, which applies reasoning only in part of the interaction steps, reaches 15% and shows intermediate performance. In contrast, GPT-5.4, Gemini 3.1 Flash Lite, and Llama-3.3-70B, which do not employ reasoning mechanisms, reach 0% success for automata with 4 states or more.

```{=latex}
\definecolor{outcomesuccess}{HTML}{6FBF73}
```
```{=latex}
\definecolor{outcomereasoning}{HTML}{F2B84B}
```
```{=latex}
\definecolor{outcomeplanning}{HTML}{E57373}
```
```{=latex}
\pgfplotsset{
    outcomeaxis/.style={
        xbar stacked,
        width=\linewidth,
        height=4cm,
        xmin=0,
        xmax=100,
        xtick={0,10,20,30,40,50,60,70,80,90,100},
        xlabel={Percentage of task instances},
        symbolic y coords={
            Gemini 3.1 Pro,
            Deepseek-v4-Pro,
            Gemini-3-Flash (thinking),
            GPT-5.4 (without thinking),
            Gemini-3.1-Flash-Lite,
            Llama-3.3-70B-Instruct-Turbo
        },
        ytick=data,
        y dir=reverse,
        bar width=8pt,
        grid=major,
        grid style={gray!16},
        tick label style={font=\small},
        label style={font=\small},
        axis line style={black!60},
        tick style={black!60},
        yticklabel style={
            font=\scriptsize,
            text width=3.2cm,
            align=right
        },
        enlarge y limits=0.2,
    }
}
```
```{=latex}
\begin{figure*}[t]


\makebox[\textwidth][c]{%
%


\begin{tabular*}{\linewidth}{@{\extracolsep{\fill}}lll@{}}
\tikz[baseline=-0.5ex]{\draw[draw=white, fill=outcomesuccess] (0,0) rectangle (0.32,0.14);}~\small Success &
\tikz[baseline=-0.5ex]{\draw[draw=white, fill=outcomereasoning] (0,0) rectangle (0.32,0.14);}~\small Reasoning Failure &
\tikz[baseline=-0.5ex]{\draw[draw=white, fill=outcomeplanning] (0,0) rectangle (0.32,0.14);}~\small Planning Failure
\end{tabular*}
%
}


\makebox[\textwidth][c]{%
%


\begin{tikzpicture}
\begin{axis}[outcomeaxis]

\addplot+[draw=white, line width=0.35pt, fill=outcomesuccess]
coordinates {
    (68.75,{Gemini 3.1 Pro})
    (55.00,{Deepseek-v4-Pro})
    (17.50,{Gemini-3-Flash (thinking)})
    (13.75,{GPT-5.4 (without thinking)})
    (7.50,{Gemini-3.1-Flash-Lite})
    (6.25,{Llama-3.3-70B-Instruct-Turbo})
};

\addplot+[draw=white, line width=0.35pt, fill=outcomereasoning]
coordinates {
    (21.25,{Gemini 3.1 Pro})
    (17.50,{Deepseek-v4-Pro})
    (30.00,{Gemini-3-Flash (thinking)})
    (33.75,{GPT-5.4 (without thinking)})
    (20.00,{Gemini-3.1-Flash-Lite})
    (15.00,{Llama-3.3-70B-Instruct-Turbo})
};

\addplot+[draw=white, line width=0.35pt, fill=outcomeplanning]
coordinates {
    (10.00,{Gemini 3.1 Pro})
    (27.50,{Deepseek-v4-Pro})
    (52.50,{Gemini-3-Flash (thinking)})
    (52.50,{GPT-5.4 (without thinking)})
    (72.50,{Gemini-3.1-Flash-Lite})
    (78.75,{Llama-3.3-70B-Instruct-Turbo})
};

\end{axis}
\end{tikzpicture}
%
}

\caption{\textbf{Error analysis}. Each row corresponds to a model, and each stacked bar shows the percentage of task instances classified as successful runs, reasoning failures, or planning failures.}

\label{fig:task_completion_outcomes_all_models}
\end{figure*}
```
#### LLMs do not seem to follow classical algorithms.

A natural concern in evaluating LLMs on interactive automata learning is that they may have been pretrained on classical formulations of the problem, as well as with standard algorithms such as $L*$ and TTT. If so, strong performance might reflect memorization or imitation of known algorithmic solutions rather than the model's ability to solve the learning task interactively.

```{=latex}
\begin{table*}[t]

\renewcommand{\arraystretch}{0.60}
\begin{tabular}{c cc cc cc cc}
\toprule
& \textbf{LLM-based learners}
& \textbf{Classic algorithms} \\
\cmidrule(lr){2-5}
\cmidrule(lr){6-9}
\# step
& Gemini 3.1 Pro
& Llama-3.3-70B
& $L^\star$
& TTT \\
\cmidrule(lr){2-3}
\cmidrule(lr){4-5}
\cmidrule(lr){6-7}
\cmidrule(lr){8-9}
& Tool Call & Resp.
& Tool Call & Resp.
& Tool Call & Resp.
& Tool Call & Resp. \\
\midrule
1 & EQ($\emptyset$) & $bba$ & MQ($\epsilon$) & 0 & MQ($b$) & 1 & MQ($\epsilon$) & 0 \\
2 & EQ($\Sigma^*$) & $aaa$ & MQ($a$) & 0 & MQ($a$) & 0 & EQ($\emptyset$) & $bba$ \\
3 & EQ($b \in w$) & $\top$ & MQ($b$) & 1 & MQ($\epsilon$) & 0 & MQ($bba$) & 1 \\
4 & --- & MQ($bb$) & 1 & MQ($bb$) & 1 & MQ($ba$) & 1 \\
5 & --- & MQ($ab$) & 1 & MQ($ba$) & 1 & MQ($a$) & 0 \\
6 & --- & MQ($ba$) & 1 & EQ($b \in w$) & $\top$ & MQ($b$) & 1 \\
7 & --- & EQ($\neg\epsilon$) & $aa$ & --- & MQ($aa$) & 0 \\
8 & --- & MQ($aa$) & 0 & --- & MQ($baa$) & 1 \\
9 & --- & EQ($b \in w$) & $\top$ & --- & EQ($b \in w$) & $\top$ \\
\bottomrule
\end{tabular}%
% }
\caption{
\textbf{Example interactions between LLM and oracle} (\emph{Resp.} column), for the hidden DFA \emph{accepting all words containing at least one $b$}, --- indicates that the learner has already solved the problem.
Gemini 3.1 Pro solves the task using only equivalence queries, whereas Llama-3.3-70B-Instruct-Turbo, $L^\star$, and TTT use both membership and equivalence queries.
In EQ notation, $\emptyset$ denotes the empty language, $\Sigma^*$ denotes all words, $\neg\epsilon$ denotes all non-empty words, and $b \in w$ denotes the language of words containing $b$.
}
\label{tab:example-trajectories}
\end{table*}
```
We alleviate this concern in three ways. First, since classic algorithms are deterministic, exact imitation can be checked directly by comparing the sequence of queries up to recovery. We find no task instance in which models produces the same query sequence as a classical algorithm. Moreover, in some cases the model recovers the target DFA using fewer queries than the classic algorithms, further suggesting that it is not replaying a known algorithm. Table `\ref{tab:example-trajectories}`{=latex} shows a comparison of interactions between LLMs and classic algorithms.

Second, we find that the best performing model (Gemini 3.1 Pro) overuses equivalence queries (EQs) in comparison to classic algorithms, in which the number of equivalence queries is bounded by the number of states in the hidden DFA. Gemini 3.1 Pro exceeds this bound in 92.5% of the task instances, indicating behavior that differs from the query structure of the classical algorithms.

Third, Gemini 3.1 Pro's behavior violates a *monotonicity* property satisfied by the classic algorithms. For a task instance, monotonicity is preserved if for every hypothesis tested by an EQ, the number of states in that hypothesis is larger than the number of states in all previously tested hypotheses. We found that Gemini 3.1 Pro produced a non-monotonic hypothesis sequence in all interactions. This provides another indication that the model is not merely implementing a noisy or unstable variant of these algorithms.

## Error Analysis

We conduct several analyses in order to better understand the performance of the models and shed light on which aspects of the task are more challenging to outline avenues for future research.

#### Differentiating between planning and reasoning errors.

We classify model failures into two mutually-exclusive categories: planning failures and reasoning failures. A *planning failure* occurs when the model fails and none of the classic passive learning algorithms, RPNI, EDSM, or Blue-Fringe, can infer the hidden DFA from the observations accumulated by the model. In this case, we assume that the model did not collect sufficient information during interaction. In contrast, a *reasoning failure* occurs when the model fails, but at least one passive learning algorithm can infer the hidden DFA from the accumulated observations. Thus, the necessary information was available, but the model failed to infer the correct DFA from it. Figure `\ref{fig:task_completion_outcomes_all_models}`{=latex} shows that all evaluated models suffer from both reasoning and planning failures, although stronger models have proportionally less planning failures compared to the weaker models, where planning failures are dominant.

#### All tested LLMs tend to make non-informative queries as the interaction progresses.

```{=latex}
\definecolor{figtaborange}{HTML}{FF7F0E}
```
```{=latex}
\definecolor{figtabblue}{HTML}{1F77B4}
```
```{=latex}
\definecolor{figtabgreen}{HTML}{2CA02C}
```
```{=latex}
\definecolor{figtabred}{HTML}{D62728}
```
```{=latex}
\definecolor{figtabpink}{HTML}{E377C2}
```
```{=latex}
\definecolor{figgold}{HTML}{FFD700}
```
```{=latex}
\pgfplotstableread[row sep=\\, col sep=space]{
step deepseek gemini_flash gemini_flash_lite llama gpt_no_thinking gemini_pro\\
1 0.08 0.04 0.26 0.21 0.00 0.00\\
2 0.15 0.12 0.92 0.77 0.00 0.00\\
3 0.22 0.24 2.62 2.30 0.00 0.00\\
4 0.25 0.40 6.19 5.61 0.00 0.00\\
5 0.22 0.58 12.27 11.38 0.02 0.00\\
6 0.17 0.77 20.78 19.49 0.06 0.00\\
7 0.13 1.06 30.78 28.94 0.16 0.00\\
8 0.16 1.57 41.06 38.52 0.40 0.00\\
9 0.27 2.44 50.71 47.59 0.89 0.00\\
10 0.44 3.77 59.33 55.98 1.74 0.00\\
11 0.59 5.68 66.68 63.48 3.14 0.00\\
12 0.71 8.28 72.63 69.74 5.24 0.02\\
13 0.86 11.60 77.17 74.57 8.15 0.07\\
14 1.13 15.49 80.58 78.16 11.72 0.19\\
15 1.55 19.55 83.22 80.82 15.38 0.41\\
16 2.04 23.34 85.38 82.87 18.22 0.68\\
17 2.44 26.60 87.18 84.62 19.72 0.91\\
18 2.62 29.40 88.53 86.28 20.19 1.00\\
19 2.60 32.03 89.37 87.92 20.45 0.93\\
20 2.44 34.77 89.78 89.48 21.06 0.83\\
21 2.22 37.60 90.03 90.88 21.94 0.81\\
22 2.10 40.15 90.40 92.10 22.68 0.87\\
23 2.18 42.02 91.05 93.15 23.03 0.95\\
24 2.43 43.29 91.98 94.05 23.18 1.02\\
25 2.74 44.58 93.02 94.79 23.63 1.23\\
26 3.05 46.49 93.86 95.40 24.72 1.82\\
27 3.39 49.16 94.27 95.89 26.13 2.98\\
28 3.78 52.28 94.21 96.25 27.23 4.64\\
29 4.21 55.36 93.86 96.46 27.76 6.40\\
30 4.65 58.00 93.37 96.53 28.19 7.78\\
31 5.11 59.90 92.84 96.59 29.16 8.57\\
32 5.62 60.93 92.42 96.79 30.71 9.09\\
33 6.22 61.23 92.34 97.18 32.23 9.90\\
34 6.95 61.15 92.77 97.73 33.12 11.30\\
35 7.84 61.09 93.51 98.33 33.41 13.00\\
36 8.81 61.36 94.23 98.85 33.62 14.42\\
37 9.59 62.12 94.66 99.17 34.14 15.38\\
38 9.93 63.41 94.79 99.24 34.95 16.14\\
39 10.00 65.08 94.71 99.11 35.69 17.10\\
40 10.34 66.75 94.59 98.92 36.07 18.35\\
41 11.31 68.01 94.63 98.79 36.26 19.70\\
42 12.59 68.65 94.94 98.77 36.62 20.90\\
43 13.40 68.86 95.49 98.84 37.30 21.86\\
44 13.23 69.07 96.17 98.97 37.93 22.79\\
45 12.41 69.56 96.76 99.11 37.99 24.04\\
46 11.79 70.13 97.02 99.20 37.43 25.98\\
47 11.93 70.42 96.74 99.22 36.75 28.80\\
48 12.62 70.27 95.94 99.21 36.48 32.27\\
49 13.18 69.85 94.98 99.20 36.69 35.72\\
50 13.21 69.49 94.32 99.22 37.12 38.53\\
51 12.98 69.38 94.14 99.28 37.62 40.43\\
52 13.10 69.54 94.33 99.35 38.31 41.51\\
53 13.99 69.83 94.70 99.43 39.16 41.98\\
54 15.62 70.04 95.18 99.52 39.94 42.17\\
55 17.47 70.04 95.73 99.59 40.60 42.52\\
56 18.93 69.75 96.28 99.58 41.35 43.37\\
57 19.67 69.20 96.78 99.50 42.22 44.63\\
58 19.86 68.58 97.20 99.40 42.82 45.80\\
59 19.88 68.16 97.53 99.31 42.78 46.42\\
60 20.04 68.02 97.74 99.25 42.19 46.32\\
61 20.43 67.98 97.76 99.21 41.52 45.86\\
62 21.06 67.89 97.59 99.16 41.00 45.68\\
63 21.79 67.86 97.38 99.10 40.62 46.37\\
64 22.40 68.11 97.33 99.03 40.50 47.88\\
65 22.61 68.57 97.51 98.99 40.99 49.53\\
66 22.27 68.91 97.83 98.98 42.17 50.51\\
67 21.54 68.86 98.14 98.99 43.43 50.47\\
68 20.88 68.36 98.41 98.97 43.99 49.67\\
69 20.63 67.45 98.66 98.92 43.76 48.60\\
70 20.74 66.30 98.85 98.88 43.47 47.58\\
71 20.95 65.31 98.88 98.90 43.94 46.88\\
72 21.26 64.99 98.72 98.96 45.35 46.90\\
73 21.90 65.51 98.48 99.06 47.25 48.10\\
74 23.02 66.45 98.38 99.21 48.94 50.43\\
75 24.61 67.10 98.55 99.40 49.91 53.17\\
76 26.57 66.95 98.91 99.62 50.14 55.63\\
77 28.71 65.95 99.25 99.80 50.07 57.57\\
78 30.55 64.41 99.36 99.92 50.15 58.85\\
79 31.40 62.85 99.14 99.97 50.47 59.00\\
80 30.91 61.87 98.64 99.99 50.95 57.67\\
81 29.44 61.70 98.03 100.00 51.57 55.33\\
82 27.99 62.07 97.50 100.00 52.17 53.14\\
83 27.43 62.38 97.17 100.00 52.48 52.10\\
84 27.84 62.35 97.00 100.00 52.52 52.40\\
85 28.70 62.19 96.90 100.00 52.73 53.65\\
86 29.49 62.23 96.82 100.00 53.41 55.24\\
87 30.09 62.51 96.79 100.00 54.30 56.72\\
88 30.45 62.77 96.87 100.00 54.77 57.89\\
89 30.45 62.76 97.15 100.00 54.43 58.83\\
90 29.97 62.34 97.64 100.00 53.62 59.70\\
91 28.98 61.29 98.17 100.00 53.19 60.48\\
92 27.57 59.56 98.40 100.00 53.81 60.93\\
93 25.93 57.58 98.07 100.00 55.48 60.77\\
94 24.35 56.07 97.23 100.00 57.69 59.75\\
95 23.06 55.44 96.31 100.00 59.99 57.78\\
96 22.16 55.51 95.69 100.00 62.24 55.28\\
97 21.81 55.89 95.28 100.00 64.42 53.39\\
98 22.35 56.30 94.42 100.00 66.53 53.30\\
99 23.93 56.57 92.63 100.00 68.52 55.25\\
100 26.24 56.58 90.10 100.00 70.32 58.32\\
}
```
```{=latex}
\nonInformativeStepTable
```
<figure id="fig:noninformative_by_step">

<figcaption><strong>Non-informative query rate by interaction step.</strong> For each model and tool-call number, the value reports the percentage of queries at that step that are non-informative across all task instances. Curves are smoothed using a 1D Gaussian filter (<span class="math inline"><em>σ</em> = 2</span>) for visualization. We report values up to step 100, as only a small number of task instances continue beyond this point.</figcaption>
</figure>

We measure how consistently LLMs use the information accumulated during interaction through the notion of *non-informative queries*: tool calls whose answers can be implied by previous oracle responses and therefore cannot provide new evidence.

For membership queries, a query is non-informative if the same word has already been queried. For equivalence queries, a query is non-informative if it duplicates a previous hypothesis, or if the proposed DFA is already contradicted by known evidence; that is, if there exists a previously queried word that the hypothesis classifies differently from the oracle. For each model, we compute the percentage of non-informative queries issued at each interaction step across all task instances.

Figure `\ref{fig:noninformative_by_step}`{=latex} shows that non-informative queries become increasingly common as interactions grow longer. After roughly 60 steps, even the most consistent model, DeepSeek-V4-Pro, issues non-informative queries about 20% of the time. This pattern appears across all evaluated LLMs, whereas classic active-learning algorithms maintain 0% non-informative queries by construction. These results suggest that as evidence accumulates in context, LLM agents increasingly struggle to organize and use it consistently.

```{=latex}
\definecolor{aaibudgetorange}{HTML}{FF7F0E}
```
```{=latex}
\definecolor{aaibudgetblue}{HTML}{1F77B4}
```
```{=latex}
\definecolor{aaibudgetgreen}{HTML}{2CA02C}
```
```{=latex}
\definecolor{aaibudgetred}{HTML}{D62728}
```
```{=latex}
\definecolor{aaibudgetpink}{HTML}{E377C2}
```
```{=latex}
\definecolor{aaibudgetgold}{HTML}{FFD700}
```
<figure id="fig:budget_factor_cost">

<figcaption><strong>Average cost as a function of the query-budget factor.</strong> The query budget factor determines the number of queries allocated to each task by multiplying the number of queries required by the more efficient classical learner between L* and TTT. Here, we examine the average cost per task instance for factors of 1, 1.25, 1.5, 1.75, and 2.</figcaption>
</figure>

## Cost Analysis

A central question in modern evaluations of LLMs is the computational and monetary cost. This issue is especially important for agents, where each task instance may involve a long sequence of interactions: each additional interaction extends the context window and adds runtime. Figure `\ref{fig:budget_factor_cost}`{=latex} illustrates the cost of our evaluation framework, which grows linearly with the number of tool calls, with Gemini 3.1 Pro driving most of the cost. We further analyze the need for such budget, and find that it is critical and allows models to solve DFAs which they could not find with a smaller budget, as shown in Figure `\ref{fig:query_budget_factor_success}`{=latex} in the Appendix.

# Related Work

Recent benchmarks for LLM-based agents often evaluate models through end tasks in complex environments, such as web navigation, software engineering, tool use, or embodied interaction [@park2026orak; @zheng2025lifelongagentbench]. While these benchmarks are important for measuring practical agent performance, their complexity makes it difficult to isolate specific capabilities such as adaptive information gathering, memory organization, and reasoning over accumulated observations.

Prior work on learning world models with language models primarily focuses on training models, rather than interactive learning at inference time as required from an agent [@zhang2023chess; @vafa2024worldmodel]. As a result, these approaches do not capture key agentic capabilities. In addition, these works do not analyze model performance as a function of changes in latent-structure complexity.

In recent years, connections have also been established between neural networks and active automata learning [@weiss2018extracting; @chen2024llms; @vazquez2025lstarlm]. In these works, the model is trained to represent a DFA, which is subsequently extracted from the model's weights using algorithms such as L\*.

# Conclusion

We introduce *agentic automata learning*, a controlled framework for evaluating whether language models can infer hidden structure through interaction. By modeling environments as hidden DFAs and allowing membership and equivalence queries, the framework measures both final success and the discovery process. Our results show that even strong models struggle as complexity increases, with failures in accuracy, planning, reasoning, tool use, and management of accumulated information. These findings motivate evaluations that go beyond outcomes to examine how agents search, update beliefs, and use feedback over time.

Future work can extend this framework beyond DFAs to environments with non-determinism or stochasticity. It could also relax the assumption of exact oracle feedback by introducing noisy, partial, delayed, or incorrect signals, or by replacing equivalence queries with weaker feedback. Finally, the framework could serve not only for inference-time evaluation, but also as a training environment, for example through reinforcement learning.

# Limitations

Our evaluation framework is costly in terms of runtime, token usage, and API spend, and these costs increase with task complexity. In total, the full evaluation suite cost approximately \$1,200 for 480 runs, or \$2.50 per datapoint on average, with Gemini 3.1 Pro accounting for most of the expense. This cost profile is not unique to our setting: recent work has noted that modern language-model benchmarks increasingly incur substantial computational costs, sometimes reaching thousands of GPU hours per evaluated model [@perlitz-etal-2024-efficient]. For example, the recent ProgramBench reports an average evaluation cost of approximately \$5 per task instance, twice the average cost per task instance incurred by our framework [@yang2026programbench].

# Acknowledgments

This research was supported in part by the Google Cloud Research Credits Program. We would like to thank Prof. Orna Kupferman for suggesting the automata-learning setting that inspired this work, and Prof. Dana Fisman and Gal Meirom for their valuable advice and insights.

```{=latex}
\appendix
```
```{=latex}
\clearpage
```
# Initial Prompt Template {#sec:prompt}

We provide the full prompt template used in our experiments. This prompt is given to the model at the beginning of each task instance and defines the interaction protocol under which the model operates.

```{=latex}
\medskip
```
`\noindent`{=latex}

**Prompt Summary.** The model is instructed to reconstruct an unknown deterministic finite automaton (DFA). A fixed target DFA exists, but its structure is not directly accessible. The model must learn about it through interaction with available interactive tools and its own internal reasoning.

```{=latex}
\medskip
```
**Game Rules.** The model is given a budget of tool calls. At each step, it must issue exactly one tool call. On the final step, it must invoke the evaluation tool.

```{=latex}
\medskip
```
**Scoring Policy.** The model is encouraged to minimize the number of tool calls while correctly reconstructing the automaton. Tool usage should be efficient and aimed at obtaining informative results. If the model believes it has identified the correct DFA, it should invoke the evaluation tool immediately, even if calls remain.

```{=latex}
\medskip
```
**Strict Output Policy.** The model must output exactly one `TOOL_ACTION` block at each step. The output must be a JSON object of the form:

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> { \"tool_name\": \"\<tool_name\>\", \"input\": { \... } }

The model must not produce natural language explanations or any additional text.

```{=latex}
\medskip
```
**Tools (Available Interactive Tools).**

`is_word_in_language`: performs a membership query and returns whether a given word is accepted by the target DFA.

`evaluate_dfa_candidate`: evaluates a proposed DFA. If incorrect, a **counterexample** is returned, highlighting the discrepancy and guiding refinement.

```{=latex}
\medskip
```
**Hints.** The model is given access to the DFA alphabet, as in classic automata learning algorithms.

# Example Interaction Format {#sec:interaction_example}

We provide an example interaction illustrating the protocol under which the model operates during a task instance: `\medskip`{=latex}

```{=latex}
\medskip
```
**Initial Prompt:** The interaction begins with the task prompt described in Appendix `\ref{sec:prompt}`{=latex}

```{=latex}
\medskip
```
**Assistant:**

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> \<TOOL_ACTION\>\
> { \"tool_name\": \"is_word_in_language\",\
>   \"input\": { \"word\": \"\" } }\
> \</TOOL_ACTION\>

**User:**

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> \<TOOL_RESULT\>\
> {\"tool_outputs\": \[{\
>   \"tool_name\": \"is_word_in_language\",\
>   \"call_count\": 1,\
>   \"error\": null,\
>   \"output\": {\"word\": \"$\epsilon$\", \"accepted\": true} }\]}\
> \</TOOL_RESULT\>

```{=latex}
\medskip
```
**Assistant:**

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> \<TOOL_ACTION\>\
> { \"tool_name\": \"is_word_in_language\",\
>   \"input\": { \"word\": \"a\" } }\
> \</TOOL_ACTION\>

**User:**

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> \<TOOL_RESULT\>\
> {\"tool_outputs\": \[{\
>   \"tool_name\": \"is_word_in_language\",\
>   \"call_count\": 2,\
>   \"error\": null,\
>   \"output\": {\"word\": \"a\", \"accepted\": true} }\]}\
> \</TOOL_RESULT\>

```{=latex}
\medskip
```
`\noindent`{=latex}**Skipped tool calls:** 3--10.

```{=latex}
\medskip
```
```{=latex}
\medskip
```
**Assistant:**

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> \<TOOL_ACTION\>\
> { \"tool_name\": \"evaluate_dfa_candidate\",\
>   \"input\": {\
>     \"candidate_dfa\": {\
>       \"states\": \[\"q0\",\"q1\"\],\
>       \"alphabet\": \[\"a\",\"b\"\],\
>       \"start_state\": \"q0\",\
>       \"accept_states\": \[\"q0\"\],\
>       \"transitions\": \[\
>         \[\"q0\",\"a\",\"q0\"\],\
>         \[\"q0\",\"b\",\"q1\"\],\
>         \[\"q1\",\"a\",\"q0\"\],\
>         \[\"q1\",\"b\",\"q1\"\]\]}}}\
> \</TOOL_ACTION\>

**User:**

> ```{=latex}
> \small
> ```
> ```{=latex}
> \ttfamily
> ```
> \<TOOL_RESULT\>\
> {\"tool_outputs\": \[{\
>   \"tool_name\": \"evaluate_dfa_candidate\",\
>   \"call_count\": 11,\
>   \"error\": null,\
>   \"output\": {\"score\": 1.0,\"optimal\": true,\"witness_word\": null}}\]}\
> \</TOOL_RESULT\>

```{=latex}
\medskip
```
**Termination:** The interaction ends once the evaluation tool indicates that the proposed DFA is equivalent to the hidden DFA, as reflected by \"score\": 1.0 and \"optimal\": true.

```{=latex}
\clearpage
```
`\label{appendix:symmetric_difference_similarity}`{=latex}

# Best Hypothesis Similarity Metric

<figure id="fig:best_hypothesis_similarity_appendix">

<figcaption>Best Hypothesis Similarity as the number of states in the minimal DFA increases. The shaded regions represent one standard deviation, and the dashed gray line represents a random hypothesis baseline.</figcaption>
</figure>

To complement the binary success metric, we define a continuous similarity measure that captures how close the best hypothesis proposed during interaction is to the hidden DFA.

For each task instance, we compute the similarity between every intermediate hypothesis proposed by the model and the hidden DFA, and report the highest similarity value achieved throughout the interaction, referred to as *Best Hypothesis Similarity*. This extends the binary success metric by providing a finer-grained measure of progress toward the hidden DFA.

Formally, for a hypothesis DFA $h$ and a hidden DFA $x$, we compute a symmetric-difference-based similarity approximation over all words up to length $k=200$:

$$\mathrm{Sim}_{\triangle,200}(h,x)
=
1-
\frac{|L(h)_{\leq 200}\triangle L(x)_{\leq 200}|}
{|\Sigma_{\leq 200}|}$$

Here, $L(h)_{\leq 200}$ and $L(x)_{\leq 200}$ denote the sets of words of length at most 200 accepted by the hypothesis and hidden DFAs, respectively, $\triangle$ denotes the symmetric difference between the two languages, and $\Sigma_{\leq 200}$ denotes the set of all words over the alphabet whose length is at most 200.

This formulation measures the fraction of words on which the two DFAs agree, producing a similarity score in $[0,1]$, where higher values indicate greater behavioral agreement between the hypothesis and hidden languages. We use symmetric-difference similarity rather than Jaccard similarity because it is better aligned with the structure of our sampled DFA distribution. Since our DFA generation process produces languages in which, in expectation, a word has approximately equal probability of being accepted or rejected, the resulting language space is relatively balanced. In this setting, normalizing by the full input space $|\Sigma_{\leq 200}|$ provides a stable and directly interpretable measure of behavioral agreement across the entire observable space. In contrast, Jaccard similarity normalizes by the union of accepted words, making it more sensitive to variation in language density and less suitable for our objective of measuring agreement over the full input space. Although the number of words of length at most 200 is exponential, approximately $2^{200}$ for a binary alphabet, this metric can be computed efficiently using dynamic programming over the product DFA of the two DFAs. This enables counting the number of words in the symmetric difference without explicitly enumerating all possible strings.

Beyond aggregate evaluation, this metric enables tracking intermediate hypothesis similarity throughout interaction, as shown in Figure `\ref{fig:eq_similarity_example}`{=latex}. This reveals whether a model gradually approaches the target DFA or instead oscillates between hypotheses without consistent and meaningful improvement over time.

<figure id="fig:eq_similarity_example">

<figcaption> Example interaction trajectory for Gemini 3.1 Pro on a single task instance. The graph tracks the symmetric-difference similarity between the hidden DFA language and the hypotheses proposed throughout interaction. </figcaption>
</figure>

# How Counterexamples Are Sampled {#sec:counterexample}

In each equivalence query between two DFAs, we construct the symmetric difference automaton, which accepts exactly the words on which the two automata disagree. A counterexample is a word accepted by this automaton.

We first compute the minimal length $k_{\min}$ of a word accepted by the symmetric difference automaton. Instead of selecting a single word of this minimal length, we define a set of candidate words consisting of short words whose lengths lie in the range $k_{\min}$ to $k_{\min} + 3$.

The selection from this candidate set is performed in a pseudo-random yet deterministic manner. Specifically, for each given pair of automata (the target DFA and the hypothesis), we compute a fixed seed using a stable hash function over their representations. This seed is then used to select a single word from the candidate set.

This approach ensures stability by maintaining identical conditions across runs, while at the same time avoiding providing the model with additional information that could simplify the problem.

# Effect of Query Budget on Success Rate {#sec:query_budget_factor}

```{=latex}
\definecolor{figtaborange}{HTML}{FF7F0E}
```
```{=latex}
\definecolor{figtabblue}{HTML}{1F77B4}
```
```{=latex}
\definecolor{figtabgreen}{HTML}{2CA02C}
```
```{=latex}
\definecolor{figtabred}{HTML}{D62728}
```
```{=latex}
\definecolor{figtabpink}{HTML}{E377C2}
```
```{=latex}
\definecolor{figgold}{HTML}{FFD700}
```
<figure id="fig:query_budget_factor_success">

<figcaption> <strong>Success rate as a function of the query budget factor.</strong> For each task instance, the allowed query budget is computed as the factor multiplied by the smaller number of queries required by <span class="math inline"><em>L</em><sup>⋆</sup></span> and TTT. A model is counted as successful only if it reaches the correct DFA within this factor multiplied budget. </figcaption>
</figure>

As shown in Figure `\ref{fig:query_budget_factor_success}`{=latex}, increasing the query budget generally improves the success rate, but the effect is highly model-dependent. While some models, such as Gemini 3.1 Pro, show only modest additional gains as the budget increases, others benefit substantially; for example, Deepseek-v4-Pro improves from 0% to 55% success.

```{=latex}
\clearpage
```
# Context Window Analysis {#sec:context_window}

![image](figures/D1.png){width="\\textwidth" height="15.2cm"}

```{=latex}
\captionof{figure}{
Context window size throughout interaction. Each line represents a task instance, while colors correspond to different models. The figure shows how context size evolves across tool-call steps during task execution.
}
```
`\label{fig:context_window_analysis}`{=latex}

[^1]: Project page: <https://reefmenaged.github.io/Agentic_Automata_Learning>.

[^2]: A DFA is accessible if every state can be reached from the initial state by some input word.