---
abstract: |
  Hybrid language models that mix attention and recurrent layers have shown promise: theoretically, recurrent layers ameliorate the limitations of pure transformers on state tracking, and empirically, hybrids can outperform pure transformers in loss and downstream evaluations [@waleffe2024empirical; @merrill2026olmohybrid]. Yet it remains unclear which data or capabilities drive these gains, and to what degree they reflect the theoretical advantages motivating hybrid models. We address this question using the open weights from Olmo 3 [@olmo2025olmo3] and Olmo Hybrid [@merrill2026olmohybrid]: we compare the loss of a matched transformer and hybrid at the same target tokens under the same prefixes, stratifying the results by natural token tags, copy features, delimiter structure, and controlled synthetic probes. The hybrid has lower loss on most tag families, but the gains are not uniform: they are largest for open-class content words and smaller for many closed-class function words. Across prose, code, and markup, the hybrid's loss advantage is larger on opening delimiters than on the corresponding closing delimiters, and nearly vanishes on repeated $n$-grams. Synthetic probes show the same split: the hybrid is favored on pronoun-memory and entity-tracking tasks, whereas the transformer is favored on bracket-matching tasks that require choosing closing delimiters. These patterns suggest that the recurrent layers in hybrids improve predictions that leverage the semantic state of a document, whereas attention helps on tokens predictable by $n$-gram copying or syntactic bracket matching. We conclude with proof-of-concept filtered evaluations showing how token-level decompositions can sharpen pretraining diagnostics for hybrid architectures.
author:
- Yanhong Li
- William Merrill
bibliography:
- references.bib
title: Comparing Transformers and Hybrid Models at the Token Level
---

# Abstract

Hybrid language models that mix attention and recurrent layers have shown promise: theoretically, recurrent layers ameliorate the limitations of pure transformers on state tracking, and empirically, hybrids can outperform pure transformers in loss and downstream evaluations [@waleffe2024empirical; @merrill2026olmohybrid]. Yet it remains unclear which data or capabilities drive these gains, and to what degree they reflect the theoretical advantages motivating hybrid models. We address this question using the open weights from Olmo 3 [@olmo2025olmo3] and Olmo Hybrid [@merrill2026olmohybrid]: we compare the loss of a matched transformer and hybrid at the same target tokens under the same prefixes, stratifying the results by natural token tags, copy features, delimiter structure, and controlled synthetic probes. The hybrid has lower loss on most tag families, but the gains are not uniform: they are largest for open-class content words and smaller for many closed-class function words. Across prose, code, and markup, the hybrid's loss advantage is larger on opening delimiters than on the corresponding closing delimiters, and nearly vanishes on repeated $n$-grams. Synthetic probes show the same split: the hybrid is favored on pronoun-memory and entity-tracking tasks, whereas the transformer is favored on bracket-matching tasks that require choosing closing delimiters. These patterns suggest that the recurrent layers in hybrids improve predictions that leverage the semantic state of a document, whereas attention helps on tokens predictable by $n$-gram copying or syntactic bracket matching. We conclude with proof-of-concept filtered evaluations showing how token-level decompositions can sharpen pretraining diagnostics for hybrid architectures.

# Introduction

Hybrid language models that mix attention with recurrent sequence layers have recently challenged the attention-only transformer as the default architecture for large-scale language modeling. Theoretically, hybrid models are motivated by the complementary expressivity of attention and recurrent layers: attention allows retrieving information from past tokens, which is useful for copying and syntactic tasks like bracket matching, while recurrent layers offer advantages for constructing and updating latent state in ordered state-tracking computations [@merrill2024illusion; @grazzi2025unlocking]. Theoretically, hybrid models inherit both of these strengths and, empirically, they can outperform transformers with the same pretraining budget in loss and downstream benchmarks [@waleffe2024empirical; @merrill2026olmohybrid]. While these empirical results are impressive, it remains unclear what finegrained predictions are improved by adding hybrid layers to a model, or to what degree these relate to the theoretical advantages of hybrid models. In this work, we therefore aim to understand the finegrained prediction differences between transformers and hybrid models *at the token level*, asking the following two questions:

1.  ***What kinds of tokens and prediction contexts do hybrid models predict better than transformers?*** More concretely, which individual token occurrences account for the lower average loss of a hybrid model relative to a matched transformer?

2.  ***Do those token occurrences match the theoretical advantages of the component architectures?*** More specifically, do hybrid-favored tokens look like the expressivity advantages associated with recurrence, such as ordered state tracking, while transformer-favored tokens look like the advantages associated with attention, such as recall from the visible prefix?

We study these questions by comparing two released 7B-scale models from the same recipe family, Olmo 3 7B and Olmo Hybrid 7B [@olmo2025olmo3; @merrill2026olmohybrid]. For a target position $i$ with prefix $x_{<i}$ and observed token $x_i$, we compute paired token losses $$\ell_i^{\mathrm{Tr}}=-\log p_{\mathrm{Tr}}(x_i\mid x_{<i}),\qquad
\ell_i^{\mathrm{Hyb}}=-\log p_{\mathrm{Hyb}}(x_i\mid x_{<i}),$$ and the paired gap $\Delta_i=\ell_i^{\mathrm{Tr}}-\ell_i^{\mathrm{Hyb}}$. Thus $\Delta_i>0$ means that the hybrid assigns higher probability, equivalently lower NLL, to the observed next token at the same prefix. This paired quantity lets us move from the average question \`\`which model has lower loss?" to the token-level question \`\`which prediction events produce the gain?"

Our first analysis is observational. We compute $\Delta_i$ over prose, code, and markup, then align target tokens to surface tags: POS tags for prose and source-level categories such as identifiers, strings, comments, text nodes, attributes, commands, brackets, and tags for structured text. We report raw tag-stratified means, which describe the absolute hybrid advantage on the actual corpus positions, and regression robustness checks, which ask whether the same patterns remain after controlling for difficulty, frequency, position, subword status, and local reuse. The regression analysis is complementary to the raw summaries: it asks whether the tag-level patterns persist after controlling for difficulty, frequency, position, subword status, and local reuse.

The natural-token results are non-uniform. The hybrid has lower loss on most token families, but the advantage is largest on open-class content-bearing categories. In prose, content words have a larger raw gap than function words, and the aggregate content--function contrast remains after controls. In structured domains, hybrid-favored categories include identifiers, strings, comments, text nodes, attribute values, and commands. By contrast, the hybrid advantage is smaller on closing delimiters and rigid formatting tokens, and it approaches zero on long repeated $n$-grams. Thus, while the hybrid wins overall---especially on open-class words---the transformer remains competitive, and sometimes favored, in contexts where attention is the important primitive: retrieving material already present in the visible prefix or satisfying an already-open structural obligation.

Our second analysis uses controlled synthetic probes to separate delayed information from the type of computation required at the target. In a pronoun-memory probe, the model sees people with roles and later must choose the pronoun for the person filling a queried role. In an entity-tracking probe, it sees entities bound to attributes and later must choose which entity has a queried attribute. In a structural-closure probe, it sees an opened bracket or tag region and later must predict the required closing token. The first two probes require readout of a maintained role or entity--attribute binding and favor the hybrid. The closure probe also involves delayed information, but the answer is already determined by an opener visible in the prefix, and it favors the transformer. Thus the relevant distinction is not simply short versus long dependency, but whether the target is primarily a state-conditioned choice, a visible-prefix copy, or a closure decision.

These results point to two different follow-on uses of the token-level analysis. The first is for designing controlled state-tracking benchmarks. Existing state-tracking tasks often ask a model to update a small, fixed set of symbols according to a sequence of instructions and then report the final symbol or value. Such closed-world tasks are useful because they isolate ordered update and readout, but once architectures solve the fixed-slot setting, they become less informative for distinguishing future recurrent layers. Our results suggest a more demanding, open-world extension: discourse state tracking. In these tasks, the input can introduce new people, objects, variables, or document regions over time, change their attributes or relations, and later require a context-dependent prediction, such as which entity, value, or content word should come next. This framing keeps the connection to state tracking while moving from fixed symbolic slots to the growing, relational state needed in natural text, code, and markup. Controlled discourse-state-tracking tasks would therefore provide a more demanding testbed for developing linear-RNN and hybrid sequence layers.

The second use is for comparing architectures during pretraining experiments. In 1B-scale development runs comparing a Transformer, a Hybrid, and a Pure RNN, aggregate validation loss compresses distinct regimes into a single number. Filtered token losses could be a useful way to compare and contrast architectures in this setting: losses on hybrid-favored non-copy tokens separate architectures more sharply, while a copy-only filter exposes the complementary regime in which attention-based models outperform the Pure RNN on visible-prefix reuse. Because these filters are computed from the same per-token NLL as standard validation, they add little overhead while providing a more capability-resolved view of training progress. Reporting filtered token losses alongside aggregate validation loss can therefore show not only whether a design improves overall perplexity, but which predictive capabilities it improves or sacrifices.

# Expressivity Background and Empirical Hypotheses {#sec:architecture-state}

Expressivity theory gives a useful way to separate the computations that can be hidden inside next-token prediction. We organize the background by architectural primitive: attention, recurrence, and their combination in hybrid models.

**Transformer expressivity: copy/recall and structural matching.** Attention gives transformers a direct mechanism for selecting positions in the visible prefix. Under standard formalizations, this lets transformers express recall-style computations such as $n$-gram retrieval, and it also supports many bracket-matching or structural-matching problems by making the relevant earlier position or opener accessible [@weiss2021thinkingliketransformers; @yao2023selfattentionnetworksprocessbounded]. Thus attention should be especially useful when the next token can be recovered by reusing already-visible material or by matching an explicit structural opener:

1.  `\label{ex:copying}`{=latex} *John works for the National Hamburger Association of America. The National Hamburger Association of [America]{.underline} $\ldots$*

2.  `\label{ex:parens}`{=latex} `( [ ] [ { ( ) } ] `[`)`]{.underline}.

The first example asks for copy/recall from the visible prefix, while the second asks for structural matching against a visible opener. In contrast, copy/recall is a known limitation of pure recurrent models with bounded state: copying or recalling arbitrary prefix information can require storing more information than a fixed-size recurrent state can retain [@arora2024zoology; @jelassi2024repeat; @merrill2026olmohybrid].

**RNN expressivity: ordered state tracking.** The complementary limitation of fixed-depth transformers is ordered state composition. Under standard fixed-depth and log-precision assumptions, transformer next-token predictors are contained in low-depth threshold-circuit classes such as $\mathsf{TC}^0$; consequently, they cannot express general $\mathsf{NC}^1$-complete ordered state-composition problems unless $\mathsf{TC}^0=\mathsf{NC}^1$ [@merrill-sabharwal-2023-parallelism; @chiang2025transformers]. In contrast, modern linear RNNs with sufficiently expressive transition matrices, including DeltaNet/GDN variants with negative eigenvalues, can represent such state-tracking computations [@merrill2024illusion; @grazzi2025unlocking; @merrill2026olmohybrid]. These results motivate the complementary hypothesis that predictions requiring ordered state updates should favor models with recurrent layers over pure transformers. A simple program-state example is:

1.  `\label{ex:program-state}`{=latex} `a,b,c = 1,2,3; a,c = c,a; assert a == `[3]{.underline}.

**Hybrid expressivity: composing recall with state.** Hybrid models combine attention layers, which support recall from the visible prefix, with recurrent layers, which support ordered state updates. This combination should allow them to handle contexts like `\Cref{ex:copying,ex:parens,ex:program-state}`{=latex}, where a pure RNN or a pure transformer is missing one of the relevant primitives. Moreover, hybrids are not merely the union of two independent capabilities: @merrill2026olmohybrid show that a GDN--attention hybrid can solve *state-based recall*, where a model must track updates to a pointer and then use the resulting pointer to retrieve a value from the prefix, while neither a pure transformer nor a pure GDN model can express the full problem under the standard assumptions. A code-like example is:

1.  `\label{ex:state-based-recall}`{=latex} `bits = [0,1,0,0,`$\ldots$`]; a,b,c = 3,1,2; a,c = c,a; assert bits[a] == `[0]{.underline}.

This expressivity picture gives the empirical hypotheses for the rest of the paper. Next-token prediction is not a single homogeneous computation: we treat each prefix as inducing a latent discourse/program state, and we view each target token as a query against, or an update to, some aspect of that state. Some targets are recoverable from the visible prefix; others require ordered state construction and state-conditioned readout; still others compose both. We therefore expect attention-heavy transformers to be competitive on visible-prefix reuse and structural closure, recurrent layers to help on state-conditioned predictions, and hybrids to be strongest when natural language, code, or markup requires both. The remainder of the paper tests where these regimes appear in ordinary next-token prediction across natural language, code, and markup.

# Empirical Methodology {#sec:empirical_method}

We ask which *individual next-token predictions* account for the loss gap between a hybrid model and a matched transformer. The basic unit of analysis is one target position. For a packed token sequence $x_{1:L}$ and target position $i$, both models are evaluated on the same prefix $x_{<i}$ and the same observed target token $x_i$. Let $$\ell_i^{\mathrm{Tr}}
    = -\log p_{\mathrm{Tr}}(x_i \mid x_{<i}),
    \qquad
\ell_i^{\mathrm{Hyb}}
    = -\log p_{\mathrm{Hyb}}(x_i \mid x_{<i}),$$ where $\mathrm{Tr}$ denotes the transformer and $\mathrm{Hyb}$ denotes the hybrid model. We define the paired token-level loss gap[^1] $$\Delta_i
    = \ell_i^{\mathrm{Tr}} - \ell_i^{\mathrm{Hyb}}
    = \log p_{\mathrm{Hyb}}(x_i \mid x_{<i})
      - \log p_{\mathrm{Tr}}(x_i \mid x_{<i}).$$ Thus $\Delta_i>0$ means that, at position $i$, the hybrid assigns higher probability to the observed next token than the transformer. All natural-token analyses in this section aggregate these per-position quantities.

**Models, domains and evaluation.** We compare two released 7B-scale models from the same recipe family: **Olmo 3 7B** and **Olmo Hybrid 7B** [@olmo2025olmo3; @merrill2026olmohybrid]. This pair is closely matched in tokenizer, data mixture, and training recipe, making the per-token gap primarily reflect the architectural difference (the sequence mixer). We evaluate on prose and structured text spanning natural language, code, and markup (prose: PG-19, News, Wikipedia, essay, textbooks and scientific papers; structured: Python, HTML, LaTeX). Text is packed into length-$L{=}8192$ sequences and we compute next-token NLL at every position. Unless otherwise stated, all 7B token-level and synthetic analyses use the final released checkpoint pair. Additional preprocessing and checkpoint details are in Appendix `\ref{app:method_details}`{=latex}.

**Token tagging and alignment.** After computing $\Delta_i$ for every LM target position, we assign surface tags using the same two-step procedure across domains. First, we tag spans in the original source text. Second, we align those source-level tags to LM target tokens by character-span overlap between the decoded LM token and the tagged source span. Prose and structured domains differ in the source-level tagger and tag inventory, and share the alignment procedure. For prose, the source spans are words tagged with the Brown POS tagset. For Python, HTML, and LaTeX, the source spans come from lightweight tokenizers or parsers, with categories such as identifiers, strings, comments, delimiters, tags, attributes, commands, and text nodes. When an LM token overlaps multiple source-level tags, we use multi-tag attribution, so the same $\Delta_i$ contributes to each overlapping tag's summary. Full tag taxonomies and alignment rules are in Appendix `\ref{app:tagging_details}`{=latex} and Appendices `\ref{sec:python_analysis}`{=latex}--`\ref{sec:html_analysis}`{=latex}. We also record each LM token's word-position type: whole word, prefix, middle subword, or suffix.

**Analysis I: Tag-stratified raw summaries.** Raw here means unadjusted: we average the paired loss gap over the observed corpus positions assigned to a tag, without reweighting those positions or controlling for other variables. This answers the descriptive question: among the actual corpus positions with a given tag, how much lower is the hybrid's NLL than the transformer's? For a tag or position set $\tau$, let $\mathcal{I}_{\tau}$ be the set of target positions assigned that tag. We report $$\widehat{\Delta}(\tau)
    =
    \frac{1}{|\mathcal{I}_{\tau}|}
    \sum_{i\in \mathcal{I}_{\tau}} \Delta_i
    \;=\;
    \mathbb{E}[\Delta_i \mid i\in \mathcal{I}_{\tau}].$$ For example, $\tau$ can be [Noun]{.smallcaps}, [Open Bracket]{.smallcaps}, or [HTML open tag]{.smallcaps}. If an LM token overlaps multiple source-level tags, that position belongs to each corresponding $\mathcal{I}_{\tau}$, so its token-level gap contributes to each relevant raw summary.

**Analysis II: Regression robustness check.** Raw tag means provide the main descriptive summary: they show the hybrid--transformer gap on the actual positions belonging to each tag. However, a raw tag mean can be confounded by other properties of those positions. Tags are correlated with difficulty, frequency, position, subword status, and local reuse, and these covariates can affect $\Delta_i$ independently of tag membership. For example, content and function categories differ substantially in token frequency. We therefore fit linear regressions that control for these correlates:

$$\small
\label{eq:prose_reg}

\begin{array}{@{}l@{}l@{}}
\begin{aligned}
\Delta_i \;\sim\;& \textsc{domain}_i + \textsc{tag}_i + \textsc{wpos}_i \\ &+ \textsc{relpos}_i + \bar\ell_i + \bar\ell_i^{2} \\ &+ \sum_{k\in\{1,2,3,4\}} \textsc{copy}_{k,i} \\
&+ \log \textsc{prevdist}_i + \log \textsc{freq}(x_i).
\end{aligned}
&
{\scriptsize
\setlength{\tabcolsep}{1pt}
\renewcommand{\arraystretch}{0.8}
\begin{tabular}{@{}l p{28em}@{}}
$\Delta_i$        & paired NLL gap $\ell^{\mathrm{Tr}}_i-\ell^{\mathrm{Hyb}}_i$ ($>0$ = hybrid better) \\
\textsc{domain}   & prose source fixed effects \\
\textsc{tag}      & either coarse POS family or aggregate word class, depending on the model \\
\textsc{wpos}     & subword position in word (whole/prefix/middle/suffix) \\
\textsc{relpos}   & relative position in packed sequence \\
$\bar\ell_i$      & mean NLL; difficulty proxy ($\bar\ell_i^{2}$ allows curvature) \\
\textsc{copy}$_k$  & prefix-reuse features for $k=1,2,3,4$; $\textsc{copy}_{1,i}$ is same-token reuse (the target token type appeared earlier in the prefix), and $\textsc{copy}_{k,i}$ for $k\ge2$ indicates that the target completes a repeated $k$-gram \\
\textsc{prevdist} & distance to previous occurrence of the same token type (log-scaled) \\
\textsc{freq}     & empirical target-token type frequency in the pooled regression sample \\
\end{tabular}}

\end{array}$$

We fit two instantiations of Equation `\ref{eq:prose_reg}`{=latex}. The first is the **coarse-tag model**, where [tag]{.smallcaps} is the full coarse Brown-POS family; this model gives the detailed adjusted effects for nouns, verbs, auxiliaries, brackets, punctuation, and other coarse categories. The second is the **aggregate word-class model**, where [tag]{.smallcaps} is replaced by the three-way label [Content]{.smallcaps}/[Function]{.smallcaps}/[Other]{.smallcaps};[^2] this model gives the aggregate rows in Figure `\ref{fig:tag_raw_vs_effect}`{=latex}. We fit the aggregate model separately. Both models include the same controls.

The raw summaries and regression effects answer complementary questions: raw gaps show whether the hybrid is better or worse on a token family in absolute terms, while regression effects show whether that family is more or less hybrid-favored than expected after controlling for confounds.

<figure id="fig:synthetic_examples">

<figcaption><strong>Synthetic probe examples.</strong> We vary the distance <span class="math inline"><em>d</em></span> between an antecedent/opener and a scored target token.</figcaption>
</figure>

**Analysis III: Controlled synthetic probes.** The natural-token analyses are observational: they reveal where the gap appears in real text, but a surface tag does not directly specify the decision the model must make at the scored position. We therefore add three controlled synthetic probe families (Figure `\ref{fig:synthetic_examples}`{=latex}):

-   [Pronoun memory]{.smallcaps}. The prompt introduces two people with different genders and roles, inserts filler, then refers to one role and scores the matching pronoun, e.g., *he* versus *she*.

-   [Entity tracking]{.smallcaps}. The prompt binds two entities to different attributes, inserts filler, then asks which entity has a queried attribute.

-   [Structural closure]{.smallcaps}. The prompt opens a bracketed or tagged region, inserts filler, then scores the required closing token.

For each family, we keep the surface template and filler distribution fixed while varying the distance $d\in\{32,64,128,256,512,1024\}$ between the relevant earlier cue and the scored target token.

For the first two probes, we score them contrastively using accuracy $\mathbf{1}[m>0]$ and margin $$m = \log p(y^+ \mid c) - \log p(y^- \mid c),$$ where $y^+$ is the correct token, $y^-$ is a matched distractor, and $c$ is the shared prefix. [Structural closure]{.smallcaps} instead tests visible structural closure, so we report NLL on the closing token, $$\ell_{\mathrm{close}}=-\log p(y_{\mathrm{close}}\mid c).$$

# Empirical Results {#sec:empirical_results}

We organize the results into two parts. Section `\ref{sec:natural_token_results}`{=latex} analyzes natural tokens from prose, code, and markup; these analyses show where the hybrid--transformer gap appears in real data and distinguish absolute raw gaps from controlled regression effects. Section `\ref{sec:synthetic_results}`{=latex} uses controlled synthetic probes to test whether the same split appears when we directly manipulate the target type and antecedent distance.

## Natural-token analysis {#sec:natural_token_results}

![**Raw and adjusted tag effects in prose.** Left panels show raw paired loss gaps, where positive values mean the hybrid has lower NLL on that token family. Right panels show regression-adjusted effects relative to the global mean from the regressions. Top panels use the full coarse POS taxonomy; bottom panels use the separately fitted three-way aggregate model (content/function/other).](figures_main/fig_m3_freq_tag_raw_vs_effect.png){#fig:tag_raw_vs_effect width="\\linewidth"}

![**Tag vocabulary size is associated with the prose tag effects.** Each point is a coarse prose tag from Figure `\ref{fig:tag_raw_vs_effect}`{=latex}, marked by aggregate word class. The x-axis is $\log|\mathcal{V}_\tau|$, where $\mathcal{V}_\tau$ is the set of distinct target-token types observed with tag $\tau$. Left: raw paired loss gap $\bar\Delta(\tau)$. Right: regression-adjusted tag effect relative to the global mean.](figures_main/fig_m3_freq_coarse_vocab_scatter.png){#fig:tag_vocab_scatter width="\\linewidth"}

**Hybrid advantage across token tags is largest for open-class content words.** `\Cref{fig:tag_raw_vs_effect}`{=latex} shows that the hybrid is better on most prose tag families, but not uniformly so. In the aggregate raw panel, content words have the largest hybrid advantage ($0.0384$ nats) versus function words ($0.0238$ nats), a difference of $0.0146$ nats (about $61\%$ larger). This ordering persists after controls (difficulty, token/subword position, sequence position, local reuse, previous-token distance, and token frequency): content stays above the global mean, while function falls below it. The regression gives a stricter test of the raw trend: after matching positions by difficulty, token frequency, subword status, sequence position, and local reuse, do the same tag-level observations still point in the same direction? For the main pattern, they do. The aggregate content--function contrast survives the controls, and many open-class categories remain more hybrid-favored than closed-class or highly constrained categories. The effects for individual function tags are more heterogeneous, but several function-like categories (e.g., auxiliaries, wh-words, [TO]{.smallcaps}, pronouns) still appear among the least hybrid-favored.

The aggregate contrast is also closely related to the open-/closed-class distinction: function words tend to come from small, closed inventories, whereas content words tend to be open class. We therefore ask whether the tag-level effects vary continuously with tag vocabulary size. `\Cref{fig:tag_vocab_scatter}`{=latex} plots $\log|\mathcal{V}_\tau|$, the number of distinct target token types realized under tag $\tau$, against both the raw and adjusted tag effects. Open-class tags (large vocabularies) tend to occupy the more hybrid-favored region in both the raw and adjusted views. After controls the slope is modest, which is expected because frequency, difficulty, and reuse explain part of the same variation; nevertheless, its positive direction agrees with the aggregate content--function result. The weaker adjusted slope should not be read as eliminating the closed-class pattern: several closed-class/function-like tags still have strongly negative adjusted effects, even though the overall vocabulary-size relationship is less steep after controls. A notable exception is [Existential]{.smallcaps}, which remains unusually hybrid-favored despite a small choice set (also under the controlled view). A plausible explanation is that existential *there* often signals a discourse-introducing construction (\`\`there is/are $\ldots$"), so the decision to use it can depend on latent state updates even though the surface choice set is small.

![**Brackets and repeated $n$-grams.** **A:** Open and close bracket raw gaps across domains; line segments show the open--close difference. **B:** Raw paired loss gaps for repeated-token events from repeated 1-grams through 16-grams. Positive raw gaps mean lower hybrid NLL. **C:** Implied repetition effects from an extension of Equation `\ref{eq:prose_reg}`{=latex} with [copy]{.smallcaps}$_1$--[copy]{.smallcaps}$_{16}$. Negative adjusted effects mean that repetition reduces the hybrid advantage relative to comparable non-copy positions.](figures_main/fig_brackets_and_repeat_ngrams_16.png){#fig:bracket_copy_gap width="\\linewidth"}

**Opening vs. closing brackets: the hybrid advantage is reduced on structural closure.** If the bracket pattern is about the predictive role of the target rather than the bracket character itself, then openers and closers should behave differently: openers begin a new region or scope, whereas closers satisfy an obligation already established by the visible prefix. `\Cref{fig:bracket_copy_gap}`{=latex}A tests this prediction across domains. Across all seven domains (prose, Python, HTML, LaTeX), *opening* brackets are consistently more hybrid-favored than the corresponding *closing* brackets; the gap is largest in prose and remains visible in structured text. This matters because opening and closing delimiters have similar surface form but different predictive roles. Openers often initiate a new region/scope (a state update), whereas closers typically satisfy an already-established structural obligation (closure). The cross-domain consistency therefore supports an interpretation in terms of what computation is queried at the target, not the bracket character per se.

**Hybrid advantage nearly disappears on repeated $n$-grams.** `\Cref{fig:bracket_copy_gap}`{=latex}B--C isolates repeated spans, an observable proxy for prefix-based copying in a weak sense rather than direct evidence of a copying mechanism. We call a target a repeated $n$-gram event when the contiguous $n$-token sequence ending at the target has appeared earlier in the prefix. For example, in \`\``a b c d a b` [`c`]{.underline}", the underlined target completes a repeated 3-gram, \`\``a b c`". `\Cref{fig:bracket_copy_gap}`{=latex}B reports raw paired loss gaps on such targets for $n=1,\ldots,16$. A repeated 1-gram only means that the target token type appeared earlier in the prefix, so it captures ordinary token reuse rather than a clean repeated-span continuation. As $n$ grows, the event becomes a cleaner repetition proxy: if a longer span appears twice, the target is increasingly determined by continuing material already visible in the prefix. Consistent with this interpretation, the raw hybrid advantage shrinks rapidly with $n$ and approaches zero for long repeated spans. In these settings, the repeated continuation in the visible prefix provides a strong prediction, so the hybrid no longer has a measurable absolute advantage.

The regression view in `\Cref{fig:bracket_copy_gap}`{=latex}C makes the same point under controls. We extend Equation `\ref{eq:prose_reg}`{=latex} by replacing [copy]{.smallcaps}$_1$--[copy]{.smallcaps}$_4$ with [copy]{.smallcaps}$_1$--[copy]{.smallcaps}$_{16}$. Because repeated-$n$ events nest (a repeated $n$-gram also activates all shorter copy indicators), we plot the *implied* repetition effect at length $n$ by summing active copy coefficients through $n$. These implied effects are consistently transformer-shifting (negative), with uncertainty growing for large $n$ because long repeated spans are rare and the nested indicators are highly correlated. Non-repetition control coefficients are reported in Appendix `\ref{app:control_coefficients}`{=latex}.

## Controlled synthetic analysis {#sec:synthetic_results}

![**Controlled probes at the final matched checkpoint.** Pronoun memory and entity tracking are scored by contrastive accuracy and log-probability margin; structural closure is scored by NLL on the closing token. Higher is better for accuracy and margin; lower is better for NLL.](figures_main/fig_distance_row.png){#fig:distance_row width="\\linewidth"}

`\Cref{fig:distance_row}`{=latex} provides the controlled counterpart to the natural-token split by fixing templates and varying the dependency distance and target type. Pronoun memory favors the hybrid (larger contrastive margins and typically higher accuracy). Entity tracking shows a stronger separation: the hybrid stays above chance across distances, while the transformer dips below chance at intermediate distances and yields negative margins there. Structural closure reverses sign: when the target is a closer whose opener is already present, the transformer attains lower NLL at every distance. Thus delay alone is not the relevant variable: all three probes require information from earlier context, but only the state-readout probes favor the hybrid. This is consistent with the theoretical expectation that recurrent layers help most when prediction requires maintaining and querying latent state, whereas attention remains competitive or better when the target is a closure token determined by a visible opener.

## Summary of empirical findings

Across observational tagging and controlled probes, we find a consistent, non-uniform token-level structure in the hybrid--transformer gap: (i) the hybrid advantage is broad but especially pronounced on open-class, content-bearing choices; the effect is weaker and more heterogeneous for closed-class/function categories; (ii) in contrast, the transformer matches or exceeds the hybrid on structural closure: closing delimiters are less hybrid-favored than openers across domains, and the controlled closure probe favors the transformer; and (iii) finally, the hybrid advantage collapses on explicit visible-prefix reuse (repeated $n$-grams), and repetition features shift the gap toward the transformer even after controls.

# Interpretation: Hybrid Gains and Discourse-State Tracking {#sec:discussion}

We now give a preliminary theoretical account of the empirical split in §`\ref{sec:empirical_results}`{=latex}.

#### Setup.

Fix a target position $t$ and prefix $x_{<t}$. Let $\tau$ denote the target token's coarse local class, such as a POS tag, delimiter class, or markup/code token type. For any such class, define $$\mathcal{V}_{\tau}
    =
    \{x\in\mathcal{V}:x\text{ can realize class }\tau\}.$$ Since our empirical strata are assigned in the analysis, we treat $\tau$ as an oracle local-slot variable. Thus, when analyzing a fixed class $\tau$, we use class-conditional next-token distributions; equivalently, the domain of $X_t$ is restricted to $\mathcal{V}_{\tau}$. To keep the notation local to this fixed target and class, write $$p^\star_{\tau}(\cdot)
    =
    p^\star\!\left(
        X_t=\cdot
        \mid
        X_{<t}=x_{<t},\,
        X_t\in\mathcal{V}_{\tau}
    \right)$$ for the ideal full-prefix next-token distribution under the data-generating process, conditional on the target class. Let $\phi$ be a feature map of the prefix. A reduced predictor that can use only the information summarized by $\phi(x_{<t})$, together with the same class restriction, predicts $$p_{\phi,\tau}(\cdot)
    =
    p_\phi\!\left(
        X_t=\cdot
        \mid
        \phi(x_{<t}),\,
        X_t\in\mathcal{V}_{\tau}
    \right).$$ Going forward, $p_{\phi,\tau}$ denotes the best predictor under that feature restriction and class conditioning. For an actual architecture, $\phi$ can be viewed as the class of prefix features that the sequence mixer makes easy to express: attention can expose visible-prefix retrieval and structural-matching features, while recurrent layers can expose updated latent-state features. The reducible loss incurred by using $p_{\phi,\tau}$ to approximate $p^\star_{\tau}$ is $$\mathcal{E}_{\phi,\tau}(t)
    =
    H(p^\star_{\tau},p_{\phi,\tau})-H(p^\star_{\tau})
    =
    D_{\mathrm{KL}}\!\left(p^\star_{\tau}\,\|\,p_{\phi,\tau}\right),$$ where, for distributions supported on $\mathcal{V}_{\tau}$, $$H(p) = -\sum_{x\in\mathcal{V}_{\tau}} p(x)\log p(x),
    \qquad
    H(p,q) = -\sum_{x\in\mathcal{V}_{\tau}} p(x)\log q(x).$$ A feature map is predictive enough for a token family when this class-conditional KL divergence is small on that family.

**First: visible-prefix sufficiency explains why the gap closes on recall and closure.** The common structure in recall and closure is simple: within the relevant local class $\tau$, suppose a prefix feature $\phi$ lets the reduced predictor determine the class-conditional next-token distribution. Then, in the deterministic idealization, $$p_{\phi,\tau}=p^\star_{\tau}
    \qquad\text{and hence}\qquad
    D_{\mathrm{KL}}(p^\star_{\tau}\|p_{\phi,\tau})=0.$$ In less deterministic natural text, the corresponding claim is that this class-conditional KL is small. Thus, once such a feature is available, a richer state representation has little additional loss to remove.

In the repeated-$n$-gram case, $\phi$ is the visible continuation of the longest repeated suffix: if the current suffix has occurred before and its continuation is unambiguous, then the target is predictable by continuing material already present in the prefix. In the closure case, $\phi$ is the relevant structural matcher, such as the top of the active delimiter stack: if the target is a closing delimiter, the correct closer is predictable from that stack state. Since both the matched transformer and the hybrid can express these features in the regimes measured by repeated continuations and bracket/tag closure, we expect little systematic difference between them there. This explains why repeated $n$-grams and closing delimiters are transformer-friendly or near-neutral.

**Second: local class size bounds how much any richer feature map can help.** Let $U_{\tau}$ be the uniform distribution on $\mathcal{V}_{\tau}$, and let $p_{\mathrm{class},\tau}$ be the best predictor that sees only the class restriction $X_t\in\mathcal{V}_{\tau}$ and no prefix features. The following bound is the formal version of the POS-vocabulary-size effect in `\Cref{fig:tag_vocab_scatter}`{=latex}.

::: proposition
**Proposition 1**. *Fix a target class $\tau$ and restrict the domain of $X_t$ to $\mathcal{V}_{\tau}$. Then any richer feature map $\phi$, evaluated through its best class-conditional reduced predictor, satisfies $$D_{\mathrm{KL}}(p^\star_{\tau}\|p_{\phi,\tau})
        \le
        \log |\mathcal{V}_{\tau}| .$$*
:::

::: proof
*Proof.* Because $p_{\phi,\tau}$ is the best predictor under the feature restriction $\phi$ and the class restriction $X_t\in\mathcal{V}_{\tau}$, it can always ignore $\phi(x_{<t})$ and emulate the best class-only predictor $p_{\mathrm{class},\tau}$. Therefore, $$\begin{aligned}
        D_{\mathrm{KL}}(p^\star_{\tau}\|p_{\phi,\tau})
        &\le
        D_{\mathrm{KL}}(p^\star_{\tau}\|p_{\mathrm{class},\tau}) \\
        &\le
        D_{\mathrm{KL}}(p^\star_{\tau}\|U_{\tau}) \\
        &= \log |\mathcal{V}_{\tau}| - H(p^\star_{\tau})
        \le \log |\mathcal{V}_{\tau}|. \qedhere

\end{aligned}$$ ◻
:::

For small closed classes, knowing the local class leaves only a limited amount of possible extra information. Thus, the additional benefit of information beyond the local class is bounded even when that information is useful. For open classes, the bound is much looser: knowing that the next token is a noun, identifier, string, or text node can still leave substantial uncertainty about which token should fill the slot. This prediction is consistent with our empirical finding that many function-like categories have smaller and more heterogeneous hybrid gains, while larger open-class categories leave more room for architectural differences.

**Finally: when recall, closure, and local class are not enough, the missing feature is semantic state.** We state this final case as a hypothesis. For non-copy open-class choices, identifiers, attribute values, and controlled entity/role probes, the local class can specify the kind of slot being filled, while reuse and closure features still need not determine the filler. We hypothesize that the semantic state of the discourse provides useful predictive cues in these contexts.

More concretely, let $\delta_{[t]-1}$ be a *discourse state* representing the longer-lived discourse, program, or document state after the previous sentence/unit, where $[t]$ is the index of the sentence containing token $t$ or the analogous local unit in code/markup. In the context of code evaluation, the discourse state can be identified with the current variable assignments after a step of execution. In linguistics, different theories of dynamic semantics provide different ways of instantiating $\delta_j$ to encode entities, events, relations, salience, variable bindings, scopes, or document regions [@heim1982semantics; @heim1983file; @kamp1993discourse; @grosz1986attention; @grosz1995centering]. At a high level of abstraction, we imagine that, for a sequence of sentences $\mathrm{sent}_1,\ldots,\mathrm{sent}_j$, this discourse state is some structured object updated recurrently by each sentence: $$\delta_j = \mathsf{Update}(\delta_{j-1},\mathrm{sent}_j) .$$ Thus, each sentence acts as an update operator over the discourse state $\delta_j$. While models may not explicitly store the discourse state as a structured object, we hypothesize it is useful for them to represent some potentially lossy part of it, since $\delta_j$ carries information useful for predicting the next token that is not supplied by the local class, repeated-prefix continuation, or local structural obligation alone. For example, as shown in `\Cref{fig:discourse-a5-bridge}`{=latex}, in \`\`John took the book and gave it to Mary. Mary grabbed her

------------------------------------------------------------------------

," the local class suggests a noun phrase, and neither copy nor delimiter state resolves the filler. What makes *reading glasses* plausible is semantic state: Mary now has the book, so a reading-related continuation becomes more likely. This is the sense in which we use \`\`state-conditioned" or \`\`semantically conditioned": the local slot is partly specified, but the filler depends on accumulated context. Thus, we hypothesize that discourse state tracking can provide useful additional information for next-token prediction in open-class contexts, and thus that models that can better represent discourse state updates may achieve lower loss in these contexts.

Putting the three parts together, repeated continuations and closing delimiters are mostly explained by features that both architectures can access, while small closed classes leave limited room for any richer feature to help. The largest hybrid gains should therefore appear on open-class, non-copy, state-conditioned targets whose fillers depend on accumulated semantic, program, or document context. This is the split seen in `\Cref{fig:tag_raw_vs_effect,fig:bracket_copy_gap,fig:distance_row}`{=latex}.[^3]

## Discourse State Tracking Targets to Guide RNN Architectures {#sec:discourse-benchmarks}

Much recent work on recurrent sequence layers uses the $A_5$ word problem as a standard benchmark for state tracking [@merrill2024illusion; @grazzi2025unlocking; @peng2025rwkv; @yang2026path; @merrill2026olmohybrid]. In this task, the input specifies a sequence of permutations from the alternating group $A_5$; the model must compose these updates in order and then read out the resulting state, or a property of it. In the notation above, $A_5$ is a closed-world case where $\delta$ is a fixed-size finite state and the target is a deterministic readout from that state. The right side of `\Cref{fig:discourse-a5-bridge}`{=latex} gives a schematic example of this kind of closed-world permutation-update problem.

$A_5$ has been useful because it cleanly isolates ordered update and readout over a fixed finite state space. However, structured linear-RNN variants can already solve this benchmark essentially perfectly [@terzic2026structured], which makes it less informative for distinguishing future recurrent sequence layers.

Moreover, $A_5$ does not capture certain properties of state tracking in discourse or code. Unlike $A_5$, natural text and code can introduce new entities, events, variables, scopes, and situations over time, so the effective state can grow with input length. The state is also relational: useful readouts may depend on who did what to whom, which object is salient, which scope is active, or which variable binding currently holds [@kamp1993discourse]. Thus, a broad takeaway from our analysis is that more realistic state tracking could be a good synthetic-task target for future architectures. Such tasks would preserve the clean update/readout structure of $A_5$ while adding dynamic entity introduction and relational state, making them a more demanding testbed for future linear-RNN and hybrid sequence layers.

```{=latex}
\definecolor{pm}{HTML}{1AA3D4}
```
`\definecolor{pmsoft}{HTML}{F4FAFD}`{=latex} `\definecolor{pmedge}{HTML}{B5DCEB}`{=latex} `\definecolor{et}{HTML}{A67868}`{=latex} `\definecolor{etsoft}{HTML}{FBF6F2}`{=latex} `\definecolor{etedge}{HTML}{DCC1AC}`{=latex} `\definecolor{ink}{HTML}{2A2A2A}`{=latex} `\definecolor{muted}{HTML}{6F6F6F}`{=latex} `\definecolor{bridgegrey}{HTML}{8A8A8A}`{=latex}

```{=latex}
\tikzset{
  >=Stealth,
  panelP/.style={
    rounded corners=8pt,
    line width=.85pt,
    fill=pmsoft,
    draw=pmedge
  },
  panelE/.style={
    rounded corners=8pt,
    line width=.85pt,
    fill=etsoft,
    draw=etedge
  },
  ptitle/.style={
    font=\bfseries\small,
    text=pm,
    anchor=north west,
    inner xsep=0pt,
    inner ysep=0pt
  },
  etitle/.style={
    font=\bfseries\small,
    text=et,
    anchor=north west,
    inner xsep=0pt,
    inner ysep=0pt
  },
  sent/.style={
    font=\footnotesize,
    text=ink,
    align=left,
    anchor=north west,
    inner xsep=0pt,
    inner ysep=0pt
  },
  cloze/.style={
    font=\scriptsize\itshape,
    text=pm,
    anchor=north east,
    inner xsep=0pt,
    inner ysep=0pt
  },
  pnode/.style={
    draw=pm!75,
    fill=white,
    rounded corners=2.5pt,
    line width=.7pt,
    inner xsep=4pt,
    inner ysep=2.5pt,
    font=\footnotesize,
    text=ink,
    align=center,
    minimum height=5mm,
    minimum width=12mm
  },
  goalnode/.style={
    draw=pm,
    fill=pm!10,
    rounded corners=2.5pt,
    line width=.95pt,
    inner xsep=4pt,
    inner ysep=2.5pt,
    font=\footnotesize\itshape,
    text=pm!75!black,
    align=center,
    minimum height=5mm,
    minimum width=12mm
  },
  rarrow/.style={
    draw=muted,
    line width=.65pt,
    -{Stealth[length=1.6mm,width=1.3mm]}
  },
  relarrow/.style={
    draw=pm,
    line width=.9pt,
    dash pattern=on 1.9pt off 1.5pt,
    -{Stealth[length=1.8mm,width=1.5mm]}
  },
  elabel/.style={
    font=\scriptsize\itshape,
    text=muted,
    fill=pmsoft,
    fill opacity=.96,
    text opacity=1,
    inner xsep=1.5pt,
    inner ysep=.4pt
  },
  rellabel/.style={
    font=\scriptsize\itshape,
    text=pm,
    fill=pmsoft,
    fill opacity=.96,
    text opacity=1,
    inner xsep=1.5pt,
    inner ysep=.4pt
  },
  codebox/.style={
    draw=etedge,
    fill=white,
    rounded corners=5pt,
    line width=.7pt,
    inner xsep=7pt,
    inner ysep=6pt,
    font=\ttfamily\footnotesize,
    text=ink,
    align=left
  },
  bridge/.style={
    draw=bridgegrey,
    fill=white,
    rounded corners=3pt,
    line width=.75pt,
    inner xsep=3.5pt,
    inner ysep=2.5pt,
    font=\scriptsize\itshape,
    text=bridgegrey,
    align=center
  },
  bridgearr/.style={
    draw=bridgegrey,
    line width=.8pt,
    -{Stealth[length=1.7mm,width=1.4mm]}
  }
}
```
```{=latex}
\begin{figure*}[t]

\resizebox{.92\textwidth}{!}{%
\begin{tikzpicture}[font=\small]

\def\PW{8.4cm}
\def\AW{5.4cm}
\def\PH{3.7cm}
\def\Gap{1.2cm}

% Panels
\node[panelP, minimum width=\PW, minimum height=\PH, anchor=north west] (D) at (0,0) {};
\node[panelE, minimum width=\AW, minimum height=\PH, anchor=north west] (A)
  at ([xshift=\Gap]D.north east) {};

% =========================================================
% Left panel: discourse tracking
% =========================================================
\node[ptitle] (dt)
  at ([xshift=6mm, yshift=-2.5mm]D.north west)
  {Discourse tracking};

\node[sent] (s1)
  at ([yshift=-2mm]dt.south west)
  {John took the book and gave it to Mary.};

\node[sent] (s2)
  at ([yshift=-1mm]s1.south west)
  {Mary grabbed her \rule{4.5mm}{.35pt}};

\node[cloze]
  at ([xshift=-5mm, yshift=-0.5mm]s2.south -| D.east)
  {predict content phrase};

% Discourse-state graph
\node[pnode] (john)
  at ([xshift=13mm, yshift=15mm]D.south west)
  {John};

\node[pnode, right=9mm of john] (book)
  {book};

\node[pnode, right=9mm of book] (mary)
  {Mary};

\node[goalnode] (glasses)
  at ([xshift=65mm, yshift=5.5mm]D.south west)
  {reading\\glasses};

% Edges: surface event chain
\draw[rarrow, shorten <=1pt, shorten >=1pt]
  (john.east) --
  node[elabel, above=.35mm] {gives}
  (book.west);

\draw[rarrow, shorten <=1pt, shorten >=1pt]
  (book.east) --
  node[elabel, above=.35mm] {to}
  (mary.west);

% Possession/state relation: routed above the book--Mary edge
\draw[rarrow, shorten <=1pt, shorten >=1pt]
  (mary.north west)
  .. controls +(-7mm,6mm) and +(7mm,6mm) ..
  node[elabel, above, pos=.52] {has}
  (book.north east);

% Discourse inference: lower dashed path, separated from "has"
\draw[relarrow, shorten <=1pt, shorten >=1pt]
  (book.south)
  .. controls +(0mm,-10mm) and +(-23mm,-1mm) ..
  node[rellabel, below, pos=.48] {makes likely}
  (glasses.west);

% Action prediction: short diagonal arrow into the predicted phrase
\draw[rarrow, shorten <=1pt, shorten >=1pt]
  (mary.south east)
  to[out=-35,in=125]
  node[elabel, above right, pos=.55] {grabs}
  (glasses.north);

% =========================================================
% Right panel: A5-style state tracking
% =========================================================
\node[etitle] (atit)
  at ([xshift=6mm, yshift=-2.5mm]A.north west)
  {$A_5$-style state tracking};

\node[codebox, anchor=center]
  at ($(A.center)+(0,-3.5mm)$) (code)
  {\begin{tabular}{@{}l@{\,\,}l@{\,\,}l@{}}
   x = [a, b, c, d, e]\\[2pt]
   \textcolor{muted}{\# permutation (1\,2\,3)}\\
   tmp  & = & x[0]\\
   x[0] & = & x[2]\\
   x[2] & = & x[1]\\
   x[1] & = & tmp\\[2pt]
   \textcolor{et!75!black}{assert x[2] == b}
   \end{tabular}};

% =========================================================
% Bridge
% =========================================================
\node[bridge] (br)
  at ($(D.east)!0.5!(A.west)$)
  {special\\case};

\draw[bridgearr]
  (D.east) -- ([xshift=-1.2mm]br.west);

\draw[bridgearr]
  ([xshift=1.2mm]br.east) -- (A.west);

\end{tikzpicture}%
}
\caption{\textbf{Discourse tracking as richer state tracking.}
\textbf{Left:} predicting the masked content phrase requires tracking that the book has ended up with Mary and using that discourse state to make reading-related objects likely.
\textbf{Right:} closed-world $A_5$ permutation updates apply ordered, deterministic state changes over a fixed set of slots, followed by a deterministic readout such as an assertion about one slot.}

\label{fig:discourse-a5-bridge}
\end{figure*}
```
# Application: Higher-Signal Evaluations for Pretraining Hybrid Models {#sec:higher_signal_evals}

The decomposition in §`\ref{sec:empirical_results}`{=latex} suggests that aggregate next-token loss mixes several distinct computation regimes that respond differently to the choice of sequence mixer. Because most tokens are easy for any reasonable architecture, aggregate loss is a relatively *low-signal* target for architecture search: real but systematic gaps can be diluted by the easy majority. We propose tracking *filtered* losses, computed from the same next-token NLL, as a higher-signal complement to aggregate validation. The filters serve two roles. First, they can amplify small architecture gaps by concentrating on the tokens where a capability matters. Second, they can reveal different empirical behaviors across token families that a single aggregate number hides. As a proof of concept, we show that *filtered* validation losses, computed from the same per-token NLL as standard validation, can reveal capability differences between architectures that are nearly invisible in aggregate loss.

**Filters.** We evaluate three filters built from the token-level features of §`\ref{sec:empirical_method}`{=latex}: (i) [All tokens]{.smallcaps}, the standard aggregate; (ii) [Top-10]{.smallcaps} $\cap$ [No-Copy]{.smallcaps}, restricted to the ten most hybrid-favored open-class POS families (`\Cref{fig:tag_raw_vs_effect}`{=latex}) and excluding positions that complete a repeated $n$-gram for $n\le4$; (iii) [Copy-5 only]{.smallcaps}, positions that complete a repeated 5-gram. By construction, (ii) targets *state-conditioned readout* while removing visible-prefix retrieval, and (iii) isolates retrieval.

**Setup.** To check that filtered evals carry useful architectural signal during small-scale pretraining, we evaluate checkpoints from three 1B-parameter development training runs released by @merrill2026olmohybrid: a **Transformer**, a **Hybrid** (interleaved GDN/attention, $3{:}1$ ratio), and a **Pure RNN** (GDN, no attention). The three are trained on a matched data mixture and budget under WSD scheduling; we score every annealed checkpoint and report token-loss curves in `\Cref{fig:filtered_eval_1b}`{=latex}.

**State-oriented filters amplify small architecture gaps.** Under aggregate [All tokens]{.smallcaps} loss (`\Cref{fig:filtered_eval_1b}`{=latex}, left), the Transformer--Hybrid separation is small, with a maximum of roughly $0.06$ nats and smaller gaps later in training. Under [Top-10]{.smallcaps} $\cap$ [No-Copy]{.smallcaps} (middle), the maximum Transformer--Hybrid separation is roughly $0.12$ nats, about twice as large, and the ordering becomes [Hybrid]{.smallcaps} $<$ [Pure RNN]{.smallcaps} $<$ [Transformer]{.smallcaps}. This supports the intended use of the filter: removing copy positions and restricting to open-class targets leaves a regime that is more sensitive to how well an architecture supports latent-state construction and state-conditioned readout.

**Copy filters expose what aggregate loss averages away.** In the aggregate [All Tokens]{.smallcaps} curve, the Transformer and Pure RNN appear roughly matched, especially at later checkpoints. Taken alone, this comparison would suggest that the two architectures have similar validation behavior. The filtered curves show that this apparent match is an average over different regimes. The [Top-10]{.smallcaps} $\cap$ [No-Copy]{.smallcaps} filter separates the models on state-oriented non-copy targets, where the Pure RNN does better than the Transformer. The [Copy-5,only]{.smallcaps} filter reveals the compensating weakness: on positions that continue a repeated 5-gram, where prediction mainly requires retrieving material from the visible prefix, the Pure RNN is consistently about $0.10$--$0.20$ nats worse than the two attention-based models. Thus filtered loss makes visible a split that aggregate loss largely obscures: recurrence-only models can look competitive on average while still lagging on visible-prefix retrieval.

**A diagnostic for hybrid architecture design.** Aggregate loss reports whether an architecture is competitive overall; filtered evals diagnose *why*. A Pure RNN that closes the [Top-10]{.smallcaps} $\cap$ [No-Copy]{.smallcaps} gap but stays far behind attention-based models on [Copy-5 only]{.smallcaps} has improved its state representation but not the retrieval problem; an architecture that does the reverse is borrowing attention's strength without gaining on state-tracking. This decomposition is especially useful when iterating on *hybrid* architectures, where the design space spans the layer ratio, the choice of recurrent mixer, and the placement of attention: filtered evals make it possible to ask which of those knobs is buying which capability, rather than reading a single aggregate number that aggregates over both regimes. Because both filters are computed from the same per-token NLL as standard validation, they add negligible cost, and we suggest reporting them alongside aggregate loss during architecture sweeps.

![**Filtered token losses surface architecture differences during 1B pretraining.** Token-loss curves at WSD-annealed checkpoints for a Transformer, a Hybrid, and a Pure RNN. ](figures_main/fig_1b_loss_curves__ALL.png){#fig:filtered_eval_1b width="0.9\\linewidth"}

# Conclusion

We compared a matched pair of models---a transformer model (Olmo 3) and a hybrid model (Olmo Hybrid)---at the token level to identify which next-token predictions drive the average loss gap. Three patterns recur: (i) the hybrid's edge concentrates on open-class content predictions; (ii) opening delimiters are hybrid-favored while closing delimiters are transformer-favored; and (iii) the hybrid advantage vanishes on visible-prefix copying. Overall, these findings seem to align with the expressivity benefits of hybrid models: attention provides direct visible-prefix retrieval, while recurrent layers support ordered state update and state-conditioned readout. It therefore makes sense that the hybrid advantage concentrates on predictions whose answer depends on an evolving discourse, program, or document state, rather than on copying or bracket closure.

Two directions follow. First, $A_5$-style benchmarks capture only a closed-world special case of the state tracking natural language requires; formal-language tasks admitting dynamic entity introduction and relational structure, evaluated on linear-RNN variants (§`\ref{sec:discourse-benchmarks}`{=latex}), would test whether architectures that better realize discourse-style state tracking also improve on the token families driving the hybrid gap. Second, the same token-level profile can diagnose data-mixture and post-training choices: underperformance on a narrow family (e.g., numeric literals, closing delimiters) signals where targeted data might help, while the rest of the profile checks that gains do not erode state-dependent predictions. Our filtered-evaluation results provide an encouraging proof of concept for this diagnostic use: token-level decompositions can sharpen pretraining comparisons, expose capability tradeoffs that aggregate loss hides, and identify concrete token families for targeted intervention. Future work can build on this proof of concept by exploring the practical implications of token-level profiles for architecture search, data selection, and post-training.

```{=latex}
\begin{ack}
The authors thank Allyson Ettinger and Kyle Richardson for relevant feedback.
\end{ack}
```
```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
# Empirical methodology details {#app:method_details}

## Domains, packing, and pairing protocol {#app:domains_details}

We evaluate both models on the same evaluation sequences and compute NLL at every position. Text is packed into contiguous sequences of length $T=8192$. Allcomparisons are *paired* at the level of a single next-token decision: *same checkpoint, same prefix/context, same target token*. For each position $i$ we compute $\Delta_i=\ell^{\mathrm{T}}_i-\ell^{\mathrm{H}}_i$.

For the main text we focus on prose (PG-19, CC-News, Wikipedia, ArXiv) and three structured domains (Python, HTML, LaTeX). The pooled prose regression uses a slightly expanded prose suite at the final matched checkpoint (including additional textbooks and scientific papers), but uses the same packing and pairing protocol.

## Token tagging and alignment {#app:tagging_details}

#### Prose POS tags and content/function mapping.

We POS-tag prose at the *word* level using the Brown tagset [@francis1979brown; @bird2009nltk], then map each Brown tag to a coarse family. For the aggregate prose analysis, we define:

-   **Content words:** [Noun]{.smallcaps}, [Verb]{.smallcaps}, [Adjective]{.smallcaps}, [Adverb]{.smallcaps}, [Interjection]{.smallcaps}, [Qualifier]{.smallcaps}.

-   **Function words:** [Existential]{.smallcaps}, [Pronoun]{.smallcaps}, [Det/Article]{.smallcaps}, [Preposition]{.smallcaps}, [Conjunction]{.smallcaps}, [Aux BE]{.smallcaps}, [Aux HAVE]{.smallcaps}, [Aux DO]{.smallcaps}, [Modal]{.smallcaps}, [TO]{.smallcaps}, [Wh-word]{.smallcaps}.

-   **Other:** punctuation, brackets, numerals, particles, and residual tags.

#### Coarse Brown-tag mapping.

Table `\ref{tab:coarse_mapping}`{=latex} gives the exact mapping used to produce the coarse families in the main-text prose figures.

```{=latex}
\small
```
```{=latex}
\begin{tabular}{lp{9.5cm}}
    \toprule
    \textbf{Coarse category} & \textbf{Brown tags} \\
    \midrule
    \emph{Content words} \\
    Noun        & \texttt{NN}, \texttt{NN\$}, \texttt{NNS}, \texttt{NNS\$}, \texttt{NP}, \texttt{NP\$}, \texttt{NPS}, \texttt{NPS\$}, \texttt{NR}, \texttt{NR\$}, \texttt{NRS} \\
    Verb        & \texttt{VB}, \texttt{VBD}, \texttt{VBG}, \texttt{VBN}, \texttt{VBZ} \\
    Adjective   & \texttt{JJ}, \texttt{JJ\$}, \texttt{JJR}, \texttt{JJS}, \texttt{JJT} \\
    Adverb      & \texttt{RB}, \texttt{RB\$}, \texttt{RBR}, \texttt{RBT}, \texttt{RN} \\
    Interjection & \texttt{UH} \\
    Qualifier   & \texttt{QL}, \texttt{QLP} \\
    \midrule
    \emph{Function words} \\
    Pronoun     & \texttt{PP\$}, \texttt{PP\$\$}, \texttt{PPL}, \texttt{PPLS}, \texttt{PPO}, \texttt{PPS}, \texttt{PPSS}, \texttt{PN}, \texttt{PN\$} \\
    Det/Article & \texttt{AT}, \texttt{DT}, \texttt{DT\$}, \texttt{DTI}, \texttt{DTS}, \texttt{DTX}, \texttt{ABL}, \texttt{ABN}, \texttt{ABX}, \texttt{AP}, \texttt{AP\$} \\
    Preposition & \texttt{IN} \\
    Conjunction & \texttt{CC}, \texttt{CS} \\
    Aux BE      & \texttt{BE}, \texttt{BED}, \texttt{BEDZ}, \texttt{BEG}, \texttt{BEM}, \texttt{BEN}, \texttt{BER}, \texttt{BEZ} \\
    Aux HAVE    & \texttt{HV}, \texttt{HVD}, \texttt{HVG}, \texttt{HVN}, \texttt{HVZ} \\
    Aux DO      & \texttt{DO}, \texttt{DOD}, \texttt{DOZ} \\
    Modal       & \texttt{MD} \\
    TO          & \texttt{TO} \\
    Existential & \texttt{EX} \\
    \midrule
    \emph{Other categories} \\
    Wh-word     & \texttt{WDT}, \texttt{WP\$}, \texttt{WPO}, \texttt{WPS}, \texttt{WQL}, \texttt{WRB} \\
    Numeral     & \texttt{CD}, \texttt{CD\$}, \texttt{OD} \\
    Particle    & \texttt{RP} \\
    Punctuation & \texttt{,}, \texttt{.}, \texttt{:}, \texttt{'}, \texttt{''}, \texttt{``}, \texttt{--}, \texttt{*} \\
    Open Bracket & \texttt{(} \\
    Close Bracket & \texttt{)} \\
    \bottomrule
  \end{tabular}
```
#### Projecting word tags onto subword tokens.

Because the LM uses subword tokenization, we assign each word-level tag to every LM token whose decoded character span overlaps the tagged word. Each LM token also receives a word-position indicator [Whole]{.smallcaps}/[Prefix]{.smallcaps}/ [Middle]{.smallcaps}/[Suffix]{.smallcaps} to separate category effects from word-onset effects.

#### Structured domains and multi-tag overlap.

For Python we tokenize files using the standard `tokenize` module. For HTML and LaTeX we use lightweight parser/regex pipelines that separate text-like content from structural tokens (tags/delimiters/commands/whitespace). We align source-level tags to LM subword tokens by character-span overlap.

A single LM token can overlap *multiple* source tokens (e.g., `):` spans both a close-paren and a colon). We therefore use **multi-tag attribution**: the LM-token loss contributes to every overlapping source tag rather than forcing a single primary label.

# Structured-domain tagsets and coarse breakdowns {#app:structured_domains}

This section supplies the structured-domain tag taxonomies.

## Python {#sec:python_analysis}

#### Tagging and multi-tag alignment.

We tokenize source files with Python's `tokenize` module and map the resulting source-level tags onto LM subword tokens by character-span overlap. Because one LM token can contain several Python tokens (e.g., `):`), the Python analysis uses multi-tag attribution.

```{=latex}
\footnotesize
```
```{=latex}
\begin{tabularx}{\linewidth}{>{\arraybackslash}p{1.45cm}>{\arraybackslash}p{3.8cm}X}
    \toprule
    Coarse & Fine-grained tags & Examples \\
    \midrule
    Identifier & \texttt{identifier}, \texttt{upper\_camel}, \texttt{private}, \texttt{dunder} & \texttt{x}, \texttt{MyClass}, \texttt{\_val}, \texttt{\_\_init\_\_} \\
    Keyword    & \texttt{keyword}, \texttt{soft\_keyword} & \texttt{def}, \texttt{return}, \texttt{match} \\
    Builtin    & \texttt{builtin} & \texttt{print}, \texttt{len}, \texttt{ValueError} \\
    Operator   & \texttt{arithmetic}, \texttt{assignment}, \texttt{comparison}, \texttt{bitwise} & \texttt{+}, \texttt{+=}, \texttt{==}, \texttt{\&} \\
    Delimiter  & \texttt{enclosure}, \texttt{delimiter}, \texttt{other} & \texttt{(}, \texttt{)}, \texttt{,}, \texttt{:}, \texttt{.} \\
    String     & \texttt{string}, \texttt{triple\_quoted}, \texttt{fstring}, \texttt{raw}, \texttt{bytes} & \texttt{'hello'}, \texttt{"""doc"""}, \texttt{f'\{x\}'} \\
    Number     & \texttt{integer}, \texttt{float}, \texttt{hex}, \texttt{octal}, \texttt{complex} & \texttt{42}, \texttt{3.14}, \texttt{0xFF} \\
    Comment    & \texttt{COMMENT} & \texttt{\# ...} \\
    Structure  & \texttt{NEWLINE}, \texttt{INDENT}, \texttt{WHITESPACE} & indentation and layout \\
    \bottomrule
  \end{tabularx}
```
```{=latex}
\FloatBarrier
```
## HTML {#sec:html_analysis}

We tag HTML with a parser-plus-regex pipeline that distinguishes opening/closing markup, attributes, text nodes, comments, punctuation glyphs, and whitespace.

```{=latex}
\footnotesize
```
```{=latex}
\begin{tabularx}{\linewidth}{>{\arraybackslash}p{1.35cm}>{\arraybackslash}p{3.7cm}X}
    \toprule
    Coarse & Fine-grained tags & Examples \\
    \midrule
    Tag         & \texttt{open}, \texttt{close}, \texttt{self\_close}, \texttt{void}, \texttt{doctype} & \texttt{<div>}, \texttt{</div>}, \texttt{<br/>}, \texttt{<img>} \\
    Attribute   & \texttt{name}, \texttt{equals}, \texttt{value\_quoted}, \texttt{value\_unquoted} & \texttt{class}, \texttt{=}, \texttt{"main"} \\
    Text        & \texttt{content}, \texttt{entity} & free-form page text \\
    Comment     & \texttt{comment} & \texttt{<!-- ... -->} \\
    Punctuation & \texttt{angle\_open}, \texttt{angle\_close}, \texttt{slash}, \texttt{quote} & \texttt{<}, \texttt{>}, \texttt{/}, \texttt{"} \\
    Whitespace  & \texttt{ws} & spaces and indentation \\
    \bottomrule
  \end{tabularx}
```
```{=latex}
\FloatBarrier
```
## LaTeX

We tag LaTeX with a lightweight regex pipeline that separates commands, environments, math, tables, grouping symbols, text, comments, and newlines.

```{=latex}
\footnotesize
```
::: {#tab:latex_tags}
  `\arraybackslash`{=latex}Coarse        `\arraybackslash`{=latex} Fine-grained tags                                          Examples
  -------------------------------------- ------------------------------------------------------------------------------------ ---------------------------
  `\arraybackslash`{=latex}Command       `\arraybackslash`{=latex} `control`, `section`, `ref`, `cite`, `formatting`, etc.    `\section`, `\textbf`
  `\arraybackslash`{=latex}Environment   `\arraybackslash`{=latex} `begin/end`, `begin/end_math`, `begin/end_table`           `\begin{table}`
  `\arraybackslash`{=latex}Math          `\arraybackslash`{=latex} `inline/display_open/close`, `env_name`                    `$`, `$$`
  `\arraybackslash`{=latex}Table         `\arraybackslash`{=latex} `ampersand`, `hline`, `rule`, `multicolumn`, `linebreak`   `&`, `\\`
  `\arraybackslash`{=latex}Group         `\arraybackslash`{=latex} `brace/bracket_open/close`                                 `{`, `}`
  `\arraybackslash`{=latex}Text          `\arraybackslash`{=latex} `content`, `space`                                         ordinary words and spaces
  `\arraybackslash`{=latex}Comment       `\arraybackslash`{=latex} `comment`                                                  `% ...`
  `\arraybackslash`{=latex}Special       `\arraybackslash`{=latex} `escaped`, `tilde`, `caret`, `underscore`                  escaped symbols
  `\arraybackslash`{=latex}Newline       `\arraybackslash`{=latex} `newline`, `blank_line`                                    line-level layout

  : LaTeX token taxonomy.
:::

```{=latex}
\FloatBarrier
```
# Controlled synthetic probes: templates and scoring {#app:controlled_details}

The main text reports results for three controlled probe families (pronoun memory, entity tracking, structural closure). This section records the minimal information needed to reproduce the probes.

#### Distances and sampling.

We evaluate antecedent distances $d \in \{32, 64, 128, 256, 512, 1024\}$. For each family and distance we generate a fixed number of examples with a shared filler distribution across models.

#### Scoring.

Pronoun memory and entity tracking are scored *contrastively* at the target position using (i) accuracy $\mathbf{1}[m>0]$ and (ii) a log-probability margin $m = \log p(y^+ \mid c) - \log p(y^- \mid c)$, where $y^+$ is the correct target and $y^-$ is a matched distractor under the same prefix $c$. Structural closure is scored by the raw NLL on the closing token, $\ell_{\mathrm{close}} = -\log
p(y_{\mathrm{close}}\mid c)$.

#### Prompt templates.

We use simple, highly regular templates so that the only manipulated factor is antecedent distance.

-   **Pronoun memory:** introduce two entities with different genders and roles; later, reference the role and force a pronoun choice (e.g., `\tok{he}`{=latex} vs. `\tok{she}`{=latex}).

-   **Entity tracking:** bind two same-gender entities to distinct attributes; later, query the attribute and force a name choice.

-   **Structural closure:** open a structure (e.g., an HTML tag or bracketed span), insert filler, and score the closing token.

```{=latex}
\FloatBarrier
```
# Additional regression control diagnostics {#app:control_coefficients}

The main text focuses on the dedicated repeated-$n$-gram analysis in Figure `\ref{fig:bracket_copy_gap}`{=latex}. Here we report the remaining control-feature diagnostics from the original coarse-tag regression. Figure `\ref{fig:appendix_control_coefficients}`{=latex} separates two quantities: raw paired gaps for concrete repeated-token subsets, and fitted coefficients from the frequency-adjusted regression in Equation `\ref{eq:prose_reg}`{=latex}. The raw gaps ask whether the hybrid has lower NLL on a subset in absolute terms; the coefficients ask how each feature shifts the paired hybrid--transformer gap after controlling for domain, tag, word position, sequence position, difficulty, previous-token distance, and target-token frequency.

The sign convention is the same as in the main text. Positive raw gaps mean lower hybrid NLL, while positive coefficients increase the adjusted hybrid advantage. The repeated-token coefficients are negative or near zero, consistent with the conclusion that visible-prefix reuse reduces the hybrid advantage once comparable tokens are matched by the regression. The remaining controls are included as adjustment variables rather than central findings.

![**Full control-feature coefficient diagnostic.** Left: raw paired loss gaps for repeated 1--4-gram events. Right: frequency-adjusted coefficients from the coarse-tag regression, including difficulty, position, previous-token distance, target-token frequency, and repeated-token controls. Raw gaps show absolute hybrid advantage on repeated-token subsets, whereas coefficients show adjusted shifts in the paired hybrid--transformer gap after the full regression specification.](figures_main/fig_m3_freq_controls_raw_vs_coef.png){#fig:appendix_control_coefficients width="0.86\\linewidth"}

```{=latex}
\FloatBarrier
```
# Limitations {#app:limitations}

Our analysis has several limitations. First, our coarse POS, bracket, and copy tags are noisy proxies for the underlying computation a token requires. A noun token can be a state-conditioned readout in one context and a near-copy in another; regression controls reduce but do not eliminate this confounding. Second, the tag-stratified and regression analyses are correlational. The synthetic probes provide controlled manipulation, but each isolates a narrow construction and does not directly measure the contribution of individual recurrent layers. Third, our interpretation in §`\ref{sec:discussion}`{=latex} is a description of where the gap concentrates, not a mechanistic claim. We do not localize the hybrid advantage to specific layers or heads, and dynamic-semantics formalisms are used as a high-level lens rather than as a probed representation. Fourth, all evaluations are in English prose, Python, HTML, and LaTeX; we do not test other natural languages or programming languages.

# Compute resources {#app:compute}

All token-level analyses use forward passes only (no training) on the released Olmo 3 7B and Olmo Hybrid 7B checkpoints. Scoring the full evaluation suite (packed length-8192 sequences across the four prose and three structured domains) requires approximately 100 GPU-hours on a single 8xH100 node. The 1B filtered-evaluation curves in §`\ref{sec:higher_signal_evals}`{=latex} re-score released WSD-annealed checkpoints from @merrill2026olmohybrid and require approximately 20 additional GPU-hours.

# Licenses for existing assets {#app:licenses}

We use the following released artifacts. **Models:** Olmo 3 7B and Olmo Hybrid 7B [@olmo2025olmo3; @merrill2026olmohybrid] are released under the Apache 2.0 license; the 1B development checkpoints are released under the same terms. **Datasets:** PG-19 (Apache 2.0); CC-News (CC-BY); Wikipedia (CC BY-SA 3.0/4.0 / GFDL); ArXiv articles (per individual paper licenses; we use only abstracts/text exposed via the public dump). **Tools:** NLTK (Apache 2.0) for the Brown POS tagset; the Python `tokenize` module (PSF license). We use these assets in accordance with their licenses for non-commercial research.

[^1]: Because $\Delta_i$ is a log-probability difference, a mean gap of $\bar{\Delta}$ nats corresponds to a geometric-mean probability ratio of $\exp(\bar{\Delta})$ in favor of the hybrid. For example, $0.04$ nats is about a $4.1\%$ probability ratio. It is not a log-odds ratio.

[^2]: [Content]{.smallcaps} includes open-class lexical categories such as nouns, main verbs, adjectives, and adverbs. [Function]{.smallcaps} includes closed-class grammatical categories such as determiners, prepositions, conjunctions, pronouns, auxiliaries, modals, wh-categories, particles, and infinitival [TO]{.smallcaps}. [Other]{.smallcaps} includes punctuation, brackets, symbols, numerals, and remaining tags.

[^3]: One caveat is that our tags and copy features are noisy proxies for these latent components, so individual tokens may mix several regimes. The argument should therefore be read as a token-family-level explanation of the observed directions.
