---
abstract: |
  `\noindent `{=latex}Traditional fixed-depth architectures scale quality by increasing training FLOPs, typically through increased parameterization, at the expense of a higher memory footprint, or data. A potential alternative is *looped architectures*, which instead increase FLOPs by sending activations through a block of layers in a loop. While promising, existing recipes for training looped architectures can be unstable, suffering from residual explosion and loss spikes. We address these challenges by recasting looping as a nonlinear time-variant dynamical system over the residual stream. Via a linear approximation to this system, we find that instability occurs in existing looped architectures as a result of large spectral norms in their injection parameters. To address these instability issues, we propose ***Parcae***, a novel *stable*, looped architecture that constrains the spectral norm of the injection parameters via discretization of a negative diagonal parameterization. As a result, Parcae achieves up to 6.3% lower validation perplexity over prior large-scale looped models. Using our stable looped architecture, we investigate the scaling properties of looping as a medium to improve quality by increasing FLOPs in training and test-time. For training, we derive predictable power laws to scale FLOPs while keeping parameter count fixed. Our initial scaling laws suggest that looping and data should be increased in tandem, given a fixed FLOP budget. At test-time, we find that Parcae can use looping to scale compute, following a predictable, saturating exponential decay. When scaled up to 1.3B parameters, we find that Parcae improves CORE and Core-Extended quality by 2.99 and 1.18 points when compared to strong Transformer baselines under a fixed parameter and data budget, achieving a relative quality of up to 87.5% a Transformer twice the size.
author:
- Hayden Prairie
- Zachary Novack
- 'Taylor Berg-Kirkpatrick'
- Daniel Y. Fu
bibliography:
- main.bib
title: 'Parcae: Scaling Laws For Stable Looped Language Models'
---

```{=latex}
\newcommand{\theHalgorithm}{\arabic{algorithm}}
```
```{=latex}
\newcommand{\sysname}{\textsc{Parcae}\xspace}
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\newcommand*{\ShowNotes}{}
```
```{=latex}
\newcommand{\yell}[1]{\textcolor{red}{#1}}
```
```{=latex}
\newcommand{\assume}[1]{#1}
```
```{=latex}
\newcommand{\norm}[1]{\left|\left|#1\right|\right|}
```
```{=latex}
\newcommand{\argmin}[2]{\textrm{argmin}_{#1}~#2}
```
```{=latex}
\newcommand{\argmax}[2]{\textrm{argmax}_{#1}~#2}
```
```{=latex}
\newcommand{\ip}[2]{\left\langle#1, #2\right\rangle}
```
```{=latex}
\newcommand{\tr}{\textrm{tr}}
```
```{=latex}
\newcommand{\E}[2]{\mathbb{E}_{#1}\left[#2\right]}
```
```{=latex}
\newcommand{\Ehat}[1]{\hat{\mathbb{E}}\left[#1\right]}
```
```{=latex}
\newcommand{\Var}[2]{\textrm{Var}_{#1}\left[#2\right]}
```
```{=latex}
\newcommand{\Cov}[2]{\textrm{\textbf{Cov}}_{#1}\left[#2\right]}
```
```{=latex}
\newcommand{\ind}[1]{\mathbbm{1}\left\{#1\right\}}
```
```{=latex}
\newcommand{\indpm}[1]{\mathbbm{1}^{\pm}\left\{#1\right\}}
```
```{=latex}
\newcommand{\sech}[0]{\textrm{sech}}
```
```{=latex}
\newcommand{\diagm}[1]{\textrm{diagm}\left(#1\right)}
```
```{=latex}
\newcommand{\supp}{\text{supp}}
```
```{=latex}
\newcommand\independent{\protect\mathpalette{\protect\independenT}{\perp}}
```
```{=latex}
\def\independenT#1#2{\mathrel{\rlap{$#1#2$}\mkern2mu{#1#2}}}
```
```{=latex}
\newcommand{\colornote}[3]{{\color{#1}\bf{#2 #3}\normalfont}}
```
```{=latex}
\newcommand{\colornote}[3]{}
```
```{=latex}
\newcommand {\todo}[1]{\colornote{cyan}{TODO}{#1}}
```
```{=latex}
\newcommand {\authorone}[1]{\colornote{darkgreen}{A1:}{#1}}
```
```{=latex}
\newcommand {\authortwo}[1]{\colornote{burntorange}{A2:}{#1}}
```
```{=latex}
\newcommand {\authorthree}[1]{\colornote{red}{A3:}{#1}}
```
```{=latex}
\newcommand{\num}[1]{{\color{red}\bf{#1}\normalfont}}
```
```{=latex}
\newcommand{\num}[1]{#1}
```
```{=latex}
\newcommand{\meanrecurrence}{$\mu_{\text{rec}}$\xspace}
```
```{=latex}
\newcommand{\meanbackward}{$\mu_{\text{bwd}}$\xspace}
```
```{=latex}
\newcommand{\prelude}{$\mathcal{P}$\xspace}
```
```{=latex}
\newcommand{\recurrent}{$\mathcal{R}$\xspace}
```
```{=latex}
\newcommand{\coda}{$\mathcal{C}$\xspace}
```
```{=latex}
\newcommand{\dt}{\Delta}
```
```{=latex}
\newcommand{\A}{\bm{A}}
```
```{=latex}
\newcommand{\B}{\bm{B}}
```
```{=latex}
\newcommand{\C}{\bm{C}}
```
```{=latex}
\newcommand{\D}{\bm{D}}
```
```{=latex}
\newcommand{\da}{\overline{\bm{A}}}
```
```{=latex}
\newcommand{\db}{\overline{\bm{B}}}
```
```{=latex}
\newcommand{\dA}{\overline{\bm{A}}}
```
```{=latex}
\newcommand{\dB}{\overline{\bm{B}}}
```
```{=latex}
\newcommand{\ddt}{\overline{\bm{\Delta}}}
```
```{=latex}
\newcommand{\AB}{(\A, \B)}
```
```{=latex}
\newcommand{\ABC}{(\A, \B, \C)}
```
```{=latex}
\newcommand{\dtAB}{(\dt, \A, \B)}
```
```{=latex}
\newcommand{\dtABC}{(\dt, \A, \B, \C)}
```
```{=latex}
\newcommand{\dAB}{(\dA, \dB)}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\newcommand{\notimplies}{%
  \mathrel{{\ooalign{\hidewidth$\not\phantom{=}$\hidewidth\cr$\implies$}}}}
```
```{=latex}
\setlength{\parindent}{0pt}
```
`\setlength{\parskip}{0.5em}`{=latex}

```{=latex}
\maketitle
```
`{hprairie,znovack,tberg,danfu}@ucsd.edu`

Introduction {#sec:intro}
============

Scaling laws have established that model performance improves predictably with increased FLOPs [@kaplan2020scalinglawsneurallanguage; @hoffmann2022trainingcomputeoptimallargelanguage], typically by increasing parameter count or training data. These scaling laws suggest that FLOP-optimal training increases parameters and training data in tandem following empirical power laws. As a result, the depth and width of state-of-the-art models have grown in an effort to scale with data, subsequently inflating the memory footprint to deploy these models [@dettmers2023case4bitprecisionkbit; @lin2024awqactivationawareweightquantization].

However, as inference deployments take on an increasingly large portion of compute [@touvron2023llamaopenefficientfoundation], and deployments begin to move to the edge [@moon2024lpulatencyoptimizedhighlyscalable; @narayan2025minionscostefficientcollaborationondevice], there is increasing interest in scaling model quality without increasing parameters. One mechanism to do this is layer-looped models, such as looped transformers [@dehghaniUniversalTransformers2019; @geiping_scaling_2025; @zhuScalingLatentReasoning2025], which iteratively loop activations through a block of layers. Initial results have been encouraging, with looped models matching the quality of larger fixed-depth architectures [@geiping_scaling_2025; @zhuScalingLatentReasoning2025]. Moreover, they show potential for latent reasoning [@avi_learn_algorithm; @yangLoopedTransformersAre2023] and per-token adaptive compute [@geiping_scaling_2025; @mcleish_retrofitted_recurrence].

```{=latex}
\begin{figure*}[t]
    
    
    \includegraphics[trim={1.36cm 0 0 0cm},clip, width=\linewidth]{figures/main/main_fig.pdf}
    
    \caption{\textbf{\sysname and the Scaling Laws of Looping.}
    (\emph{Left}) \sysname constrains the spectral norm of $\dA$ and normalizes the input injection, stabilizing the residual stream $h_t$ across loops. (\emph{Right}) We observe looping to be an orthogonal axis of scaling compute which follows a power law.}
    \label{fig:block-diagram}
    
\end{figure*}
```
Unfortunately, prior research [@geiping_scaling_2025; @mcleish_retrofitted_recurrence; @LoopFormerElasticDepthLooped2025] and our work observe these models' training to be unstable, exhibiting residual state explosion and loss spikes. Since these models loop the layers of complex non-linear architectures (e.g., transformer blocks [@vaswani2023attentionneed]), the source of instability in looped models can be difficult to understand analytically. As a result, training requires sensitive hyperparameter selection and residual normalization (e.g., Post-Norm) to correct this instability [@geiping_scaling_2025]. Furthermore, even in convergent training runs, we observe loss spikes as looped models train on stochastic amounts of depth to induce stronger test-time scaling [@anilPathIndependentEquilibrium]. In this paper, we study this instability and ask whether stabilizing these models can unlock looping as a predictable, orthogonal axis for scaling compute.

To analyze instability, we observe that prior looped architectures can be recast as a nonlinear time-variant dynamical system over the residual stream [@olsson2022incontextlearninginductionheads], taking the form: $$\label{eq:alpha-af}
    h_{t+1} = \dA h_t + \dB e + \overline{\mathcal{R}}(h_t, e),$$ where for an input $e$, the hidden state $h$ across the depth of an architecture is modulated by $\dA$, controlling the balance between prior and current residual states; $\dB$, conditioning the residual on the input $e$; and a non-linear operator $\overline{\mathcal{R}}$, which subsumes the original transformer modules (e.g., Attention, MLPs). By linearizing this framework (e.g., removing $\overline{\mathcal{R}}$), we observe that `\cref{eq:alpha-af}`{=latex} resolves to a linear time invariant (LTI) system from which classic control theory can be used to infer divergence conditions on the residual stream based on the spectral norm of $\dA$. We observe that prior looped architectures can learn unstable parameterizations of $\dA$, which we empirically find to induce residual stream explosion (see `\cref{fig:spectral-norm}`{=latex}).

To address these issues, we propose ***`\sysname`{=latex}***, a novel looped transformer that corrects the parameter instability conditions of `\cref{eq:alpha-af}`{=latex} and uses algorithmic fixes to reduce loss spikes during training. ***`\sysname`{=latex}*** explicitly uses discretization on a continuous representation $\A$ of `\cref{eq:alpha-af}`{=latex} and parametrizes $\A$ as a negative diagonal matrix, constraining the spectral norm to prevent residual explosion in looped layers. Additionally, `\sysname `{=latex}introduces a normalization on $e$, which empirically prevents loss spikes in late stages of training. Finally, `\sysname `{=latex} modifies the training algorithm (which aims to minimize the expected loss over variable depths) by enabling intra-batch per-sequence depth sampling to further reduce loss spikes.

We evaluate `\sysname `{=latex}on end-to-end quality, training FLOP scaling, and test-time scaling:

-   **End-to-End Quality.** We compare `\sysname `{=latex}against parameter- and data-matched RDMs [@geiping_scaling_2025] and Transformers. Against RDMs, `\sysname `{=latex}reduces val. PPL by 6.3%. When scaled up to 1.3B parameters and 100B tokens, `\sysname `{=latex}outperforms parameter-matched Transformers by up to 2.99 and 1.18 points on Core and Core-Extended [@li2025datacomplmsearchgenerationtraining] benchmarks, respectively --- matching Transformers up to twice the size.

-   **Training FLOP Scaling.** To evaluate FLOP training scaling, we study scaling laws for looping in a parameter-matched isoFLOP setting (i.e., whether to scale FLOPs with increased data or looping). We find that looping introduces an orthogonal scaling axis, similar to parameters and data. Specifically, FLOP-optimal training increases looping and data following empirical power laws (see `\cref{fig:block-diagram}`{=latex} \[*right*\]).

-   **Test-Time Scaling.** We study looping as a mechanism to scale test-time compute, observing that recurrence follows predictable exponential decay with an irreducible loss. We further combine both test-time and training power laws to create a single unifying scaling law for looping in `\sysname `{=latex}models.

Background {#sec:background}
==========

We first provide a brief background on looped models (`\cref{sec:rdm-basics}`{=latex}), LTI systems (`\cref{sec:lti-basics}`{=latex}), and modeling scaling laws (`\cref{sec:scaling-laws-basics}`{=latex}). Prior work has studied looped architectures along several design axes: loop placement (pre-, mid-, or post-looping) [@saunshiReasoningLatentThoughts2025b], halting mechanism (explicit routers [@baeMixtureofRecursionsLearningDynamic2025; @zhuScalingLatentReasoning2025] vs. implicit stochastic depth [@geiping_scaling_2025; @mcleish_retrofitted_recurrence]), topology (single block [@geiping_scaling_2025] or hierarchical [@wangHierarchicalReasoningModel2025b; @jolicoeur-martineauLessMoreRecursive2025]) and differentiation (explicit or implicit backpropagation [@bai2019deepequilibriummodels]). Our work focuses on implicit-halting middle-looped architectures using explicit differentiation; an extended review is in `\cref{sec:lit-review}`{=latex}.

Existing Middle-Looped Architectures {#sec:rdm-basics}
------------------------------------

In this paper, we focus on middle-looped architectures [@saunshiReasoningLatentThoughts2025b; @geiping_scaling_2025]. Middle-looped recurrent depth architecture contains three units: an initial *prelude* unit `\prelude`{=latex}, a middle *recurrent* unit `\recurrent`{=latex}, and a final *coda* unit `\coda`{=latex}. Formally, given an input $s \in V^n$, where $V$ is vocabulary and $n$ is sequence dimension, the outputs $p \in \mathbb{R}^{n \times |V|}$ can be computed by the following update rule: $e = \mathcal{P}(s),~h_{t+1} = \mathcal{R}(h_t, e),~p = \mathcal{C}(h_T),$ where $h_0 \sim \mathcal{N}(0, \sigma^2 I_{d\times d})$ and $d$ the embedding dimension. Intuitively, `\prelude `{=latex}embeds inputs into the latent space, conditioning `\recurrent `{=latex}as it recursively updates the hidden state $h_t \in \mathbb{R}^{n \times d}$ for $T$ iterations, which `\coda `{=latex}uses to generate $p$. Within `\recurrent`{=latex}, prior work inject $e$ using addition $h_{t+1} = \mathcal{R}(h_t + e)$ [@yangLoopedTransformersAre2023] or concatenation with projection $h_{t+1} = \mathcal{R}(W[h_t; e])$ [@geiping_scaling_2025], where $W \in \mathbb{R}^{d \times 2d}$.

While looped models can be viewed as weight-sharing layers, modern variants allow for variable depth. During training, depth $T$ is sampled per micro-batch [@bansalEndtoendAlgorithmSynthesis2022] from $\Lambda$ (e.g., Poisson with mean `\meanrecurrence`{=latex}), exposing the model to variable depths for stronger test-time scaling [@anilPathIndependentEquilibrium]. The training objective thus minimizes the expectation over the dataset and $\Lambda$. Lastly, truncated backpropagation through depth, analogous to BPTT [@Hinton2013TrainingRN], limits the backward pass to a constant `\meanbackward `{=latex}[@geiping_scaling_2025].

#### Stability.

@geiping_scaling_2025 found looped models unstable at scale and adopted a block pattern, combining Pre- and Post-Norm to normalize the residual: $\bar{x}^{(\ell)} = \text{LN}(\text{MHA}(\text{LN}(x^{(\ell-1)})) + x^{(\ell-1)}), \quad x^{(\ell)} = \text{LN}(\text{FFN}(\text{LN}(\bar{x}^{(\ell)})) + \bar{x}^{(\ell)})$ where $\mathrm{LN}(\cdot)$ denotes layer normalization, $\mathrm{MHA}(\cdot)$ multi-head attention, and $\mathrm{FFN}(\cdot)$ feed-forward networks. We later show that residual normalization is unnecessary when stability is properly controlled.

Linear Time-Invariant Dynamical Systems {#sec:lti-basics}
---------------------------------------

To study the instability of looped models, we will use an LTI dynamical system as a tractable linear surrogate for complex non-linear looped models. In control theory, LTI systems are formalized through first-order differential equations $\dot{h}(t)= \A h(t) + \B e(t),~y(t) = \C h(t)$ that describe the evolution of a hidden state $h(t) \in \mathbb{R}^{d_h}$ given an input signal $e(t) \in \mathbb{R}^{d_e}$, where $\A \in \mathbb{R}^{d_h \times d_h}$ governs the dynamics of the system, $\B \in \mathbb{R}^{d_h \times d_e}$ controls how external inputs influence the state, and $\C \in \mathbb{R}^{d_e \times d_h}$ projects the hidden state to the output $y(t) \in \mathbb{R}^{d_e}$. The continuous system can be discretized to obtain $h_{t} = \dA h_{t-1} + \dB e_t,  y_t = \C h_t$ using a step size $\dt$; for instance, zero-order hold (ZOH) would yield $\dA = \exp(\Delta \A)$ and $\dB = (\Delta \A)^{-1}(\exp(\Delta \A) - I) \cdot \Delta \B$.

LTI systems fall into three regimes: *stable* (bounded and convergent), *marginally stable* (oscillatory), and *unstable* (explosive and divergent). A fundamental property of LTI systems is that their *stability* is determined by the eigenvalues of $\A$. Continuous LTI systems require negative eigenvalues of $\A$; Discrete LTI systems requires $\rho(\dA) < 1$ [@1082819], where $\rho$ computes the spectral norm, with unstable systems having $\rho(\dA)>1$.

Modeling Scaling Laws {#sec:scaling-laws-basics}
---------------------

We follow @hoffmann2022trainingcomputeoptimallargelanguage, which modeled scaling law behaviors via parabolic and parametric fits for varying model sizes and training tokens with a fixed FLOP budget. For parabolic fits, a quadratic is fit to several FLOP budgets to estimate the loss-optimal model size or number of training tokens. For parametric fits, a function form of $\widehat{\mathcal{L}}(N,D) = E + X \cdot N^{-x} + Y \cdot D^{-y}$ is fit using the Huber loss [@huber] between the predicted and empirical log loss values for varying parameters $N$ and tokens $D$, using L-BFGS [@lbfgs] to minimize.

Understanding Instability in Looped Architectures {#sec:dynamical-systems}
=================================================

```{=latex}
\begin{figure*}[t]
    
    \includegraphics[width=\linewidth]{figures/main/instability_comparison.pdf}
    \caption{\textbf{Training Instability of Looped Architectures.} (\emph{left}) Pre-Norm looped models diverge, while residual norm. and \sysname converge. (\emph{right}) Instability stems from an exploding recurrent state norm $||h_T||_2$, the hidden embedding norm after $T$ recurrences.}
    \label{fig:instability}
    
\end{figure*}
```
In this section, we study the instability of looped architectures. Using an LTI view over the residual, we find that instability stems from an unconstrained residual state explosion (`\cref{fig:instability}`{=latex}; `\cref{tab:hyperparameters}`{=latex} \[*Baseline*\]; `\cref{sec:stability-ablations}`{=latex}). While residual normalization helps mitigate this issue, it requires sensitive hyperparameter tuning (`\cref{tab:hyperparameters}`{=latex} \[*Res. Norm*\]), similar to fixed-depth transformers [@xu2019understandingimprovinglayernormalization; @xiongLayerNormalizationTransformer2020]. Using this LTI framework, we derive stability conditions for the eigenvalues of $\dA$. We find that prior work does not satisfy these conditions for $\dA$, which we empirically verify creates major state explosion (`\cref{fig:spectral-norm}`{=latex}).

#### Dynamical System over Residual Stream.

Our key insight is to recast the forward pass as a dynamical system over the residual stream. Consider a transformer-based looped model as defined in `\cref{sec:rdm-basics}`{=latex} for language modeling, where `\prelude `{=latex}is an embedding layer that maps a sequence of tokens $s \in V^n$ into embedding space $e \in \R^{n \times d_h}$, `\coda `{=latex}is a projection head that maps into probability space $g: d_h \to |V|$, and `\recurrent `{=latex}is parameterized with $L$ transformer blocks. While several methods of input injection could condition `\recurrent `{=latex}on $e$, building on prior work [@yang2024loopedtransformersbetterlearning; @geiping_scaling_2025; @mcleish_retrofitted_recurrence], we focus on linear methods of injection (e.g., $\mathcal{R}(h_t, e) = \mathcal{R}(W_1 h_t + W_2 e)$, where $W_1 \in \R^{d_h  \times d_h}$ and $W_2 \in \R^{d_h \times d_e}$).[^1]

Recall that `\recurrent `{=latex}denotes the full recurrent update $h_{t+1} = \mathcal{R}(h_t,e)$, encompassing all transformer operations, including residual connections. The recurrent update can be exactly formulated as a non-linear time-variant dynamical system of the form $h_{t} = \dA h_{t-1} + \dB e + \overline{\mathcal{R}}(h_{t-1}, e), ~ y_t = \C h_t,$ where $\C \in R^{d_c \times d_h}$ decouples the `\coda `{=latex}and `\recurrent `{=latex}embedding dimension (i.e. $p=\mathcal{C}(\C(h_T))$). This derivation is shown in `\cref{sec:derivation-instabilty}`{=latex}. Though this formulation does not immediately elucidate instability, linearizing of this system (i.e., dropping $\overline{\mathcal{R}}$) yields a discrete LTI system of the form: $$h_{t+1} = \dA h_t + \dB e
\label{eq:lti}$$

```{=latex}
\renewcommand{\arraystretch}{1.15}
```
```{=latex}
\small
```
::: {#tab:stability-equation-comparison}
  Method                                          $\dA$                                     $\dB$                     $\rho(\dA)$          LTI Stability
  --------------------------- --------------------------------------------- ------------------------------------- -------------------- ---------------------
  Addition                                         $I$                                       $I$                    $\rho(\dA) = 1$     *marginally-stable*
  Concatenation                           $\R^{d_h \times d_h}$                     $\R^{d_h \times d_e}$          $\rho(\dA) \in \R$       *unstable*
  `\sysname `{=latex}(ours)    $\text{ZOH}(\texttt{Diag}(-\exp(\R^{d_h}))$   $\text{Euler}(\R^{d_h \times d_e})$    $\rho(\dA) < 1$          *stable*

  : **Comparison of Prior Update Rule Stability based on LTI Representation.**
:::

```{=latex}
\small
```
::: {#fig:spectral-norm}
  **LR**        **Base**          **Res. Norm**        **Parcae**
  -------- ------------------- ------------------- -------------------
  2e-4      `\cmark `{=latex}   `\cmark `{=latex}   `\cmark `{=latex}
  4e-4      `\xmark `{=latex}   `\cmark `{=latex}   `\cmark `{=latex}
  6e-4      `\xmark `{=latex}   `\xmark `{=latex}   `\cmark `{=latex}
  8e-4      `\xmark `{=latex}   `\xmark `{=latex}   `\cmark `{=latex}
  1e-3      `\xmark `{=latex}   `\xmark `{=latex}   `\cmark `{=latex}

  : **Spectral Radius of Unconstrained $\protect\dA$.** For a Pre-Norm RDM, we plot the $\rho(\dA)$ throughout training using different learning rates, observing divergent runs learn $\rho(\dA) > 1$. The state explosion, in `\cref{fig:instability}`{=latex} is thus directly linked to $\dA$.
:::

```{=latex}
\hfill
```
```{=latex}
\makeatletter
```
```{=latex}
\def\@captype{figure}
```
```{=latex}
\makeatother
```
![image](figures/main/spectral_radius_lr_sweep.png){width="\\linewidth"}

#### State Explosion from Unconstrained $\dA$ and $\dB$.

Analyzing the stability of `\cref{eq:lti}`{=latex} identifies $\rho(\dA)$ as a critical factor governing instability. As shown in `\cref{tab:stability-equation-comparison}`{=latex}, prior work [@geiping_scaling_2025; @yang2024loopedtransformersbetterlearning] chooses parameterizations of $\dA$ such that $\rho(\dA)=1$ or $\rho(\dA)$ is unconstrained. Critically, these are *marginally-stable* or *unstable parameterizations*.

`\cref{fig:spectral-norm}`{=latex} and `\cref{tab:hyperparameters}`{=latex} confirm this empirically: divergent runs learn a spectral radius of $\rho(\dA) \geq 1$, with convergent runs maintaining $\rho(\dA) < 1$, affirming that LTI stability constraints are necessary. Finally, at scale, we observe loss spikes late in training (e.g., after 170k steps), which we address by normalizing the input to $\dB$ (see `\cref{sec:prelude-norm}`{=latex} for ablation).

`\sysname`{=latex}: A Stable Looped Architecture {#sec:parcae}
================================================

Using our dynamical systems framework, we create ***`\sysname`{=latex}***, a looped architecture that explicitly satisfies the stability constraints (`\cref{sec:rfm-derivation}`{=latex}). Additionally, we propose a per-sequence depth sampling method to stabilize variance introduced by variable depth (`\cref{sec:rfm-training}`{=latex}).

Block Design and Stable Parameterization of `\sysname`{=latex} {#sec:rfm-derivation}
--------------------------------------------------------------

We parameterize $\A$ and $\B$ in continuous form, and discretize using a learned $\dt \in \R^{d_h}$with ZOH and Euler schemes (i.e., $\dA = \exp(\dt \A)$ and $\dB = \dt \B$),[^2] following prior sequence modeling work [@gu2024mambalineartimesequencemodeling; @dao2024transformersssmsgeneralizedmodels]. To achieve our target stability conditions by constraining the eigenvalues of $\A$ to be negative, we parameterize $\A := \texttt{Diag}(-\exp(\texttt{log\_A}))$ as a negative diagonal matrix, where $\texttt{Diag}(-\exp(\cdot))$ of a vector enforces negativity and $\texttt{log\_A}\in \R^{d_h}$ is our learnable vector. While many formulations of $\A$ would work, ensuring negative eigenvalues in the diagonal case is simple and cheap. $\B$ is left unconstrained; however, we introduce a normalization layer to the input $e$ to further stabilize training (see `\cref{sec:prelude-norm}`{=latex} for ablation). With this, our update rule, given an input sequence $s$, becomes $$e = \text{LN}(\mathcal{P}(s)), \qquad h_{t+1} = \dA h_t + \dB e + \overline{\mathcal{R}}(h_t, e), \qquad p = \mathcal{C}(\C h_T),$$ where $h_0 \sim \mathcal{N}(0,~\sigma I_{d_h \times d_h})$ and $T$ is the number of loops.

We parameterize `\prelude`{=latex}, $\overline{\mathcal{R}}$, and `\coda `{=latex}using $L_{\mathcal{P}},L_{\mathcal{R}}$ and $L_{\mathcal{C}}$ transformer bloc:ks respectively. For exact block architecture, we match two different architectural setups: one for prior RDMs [@geiping_scaling_2025] and one for strong Transformer baselines [@nanochat]. `\sysname`{=latex}'s architecture matches RDMs, differing only in residual normalization and the dynamical systems parameters (e.g., $\A, \B, \C, \dt$). Against Transformers, we follow a simplified `nanochat` [@nanochat] setup, where we match exact architecture, except we loop the middle third layers and include our dynamical systems parameters and a prelude norm. Exact model definitions and a forward pass can be found in `\cref{sec:model-definitions}`{=latex} and `\cref{sec:algorithm}`{=latex}, respectively.

Stable Training Algorithms for `\sysname`{=latex} {#sec:rfm-training}
-------------------------------------------------

We further stabilize Parcae by adjusting the training objective. Specifically, looped models' training objective is $\theta^\star \;=\; \arg\min_{\theta}\; \mathbb{E}_{(x,y)\sim \mathcal{D},\,T\sim \Lambda}\!\left[\;\ell\!\big(f_{\theta}(x;T),\, y\big)\;\right]$, implying that more depths should be sampled per global batch to more faithfully model the expectation over $\Lambda$. Thus, we introduce a per-sequence depth sampling algorithm within a micro-batch, which we empirically observe to reduce loss spikes (ablation in `\cref{sec:loss-spikes}`{=latex}). Additionally, unlike prior work, we parameterize $\Lambda$ based on `\meanrecurrence `{=latex}alone, as we find that truncating based on `\meanbackward `{=latex}significantly hurts extrapolation to both lower and higher recurrences (ablation in `\cref{sec:sampling-truncated-recurrence}`{=latex}). Finally, we choose $\mu_{\text{bwd}} = \lceil \frac{\mu_{\text{rec}}}{2} \rceil$ throughout (see `\cref{sec:scaling-of-truncated-backpropigation}`{=latex} for ablation). A detailed training algorithm is in `\cref{sec:algorithm}`{=latex}.

Results {#sec:results}
=======

We evaluate `\sysname `{=latex}on end-to-end quality (`\cref{sec:e2e}`{=latex}), training FLOP scaling (`\cref{sec:train-scaling}`{=latex}), and test-time scaling (`\cref{sec:inf-scaling}`{=latex}). We find that `\sysname `{=latex}outperforms both parameter- and data-matched RDMs and Transformers, optimal looping and data follow predictable power laws, and test-time looping follows a saturating exponential decay.

```{=latex}
\begin{table*}[!t]
    
    \small
    \setlength{\tabcolsep}{4pt}
    \renewcommand{\arraystretch}{1.05}
    \begin{tabular*}{\textwidth}{@{\extracolsep{\fill}} llccc|ccccccc @{}}
        \toprule
         & \textbf{Model} & $\mathbf{T}$ & Val. & WikiText & Hellaswag & ARC-c & ARC-e & PIQA & BoolQ & SciQ & Avg. \\
         \midrule
         \rotatebox{90}{100M}
         & RDM& 16 & 14.23 & 63.27 & 27.16 & 17.66 & 42.38 & 59.14 & 51.35 & \textbf{72.50} & 45.03 \\
         & \sysname & 16 & \textbf{13.59} & \textbf{60.33} & \textbf{27.18} & \textbf{18.09} & \textbf{43.10} & \textbf{59.30} & \textbf{61.83} & 71.50 & \textbf{46.83} \\
        \midrule
         \rotatebox{90}{350M}
         & RDM& 8  & 10.76 & 41.31 & 28.55 & 20.90 & 47.26 & 61.75 & \textbf{61.53} & 76.70 & 49.45 \\
         & \sysname & 8  & \textbf{10.09} & \textbf{37.53} & \textbf{29.23} & \textbf{21.08} & \textbf{48.78} & \textbf{62.08} & 60.73 & \textbf{78.80} & \textbf{50.12} \\
         \bottomrule
    \end{tabular*}
    \caption{\textbf{Zero-Shot and Perplexity Results Trained on RDM Setup.} Comparison of \sysname and RDM \citep{geiping_scaling_2025} on 
    a variety of open source benchmarks and perplexity held-out validation set and Wikitext \citep{merity2016pointer}. \textbf{Best} results are \textbf{bolded}.}
    \label{tab:rdm-parcae}
\end{table*}
```
```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{4.5pt}
```
::: {#tab:ablation}
  ------------------------------------------------------------------------------------------------- ------------------------- ---------------------- ----------------------- ------------------------- -------------------------- -------------------------- ------------------------- ------------------------- -------------------------
                                                                                                     Val Loss ($\downarrow$)    Core ($\uparrow$)     Core Ext ($\uparrow$)                                                                                                                                      
  `\cmidrule`{=latex}(lr)2-4 `\cmidrule`{=latex}(lr)5-7 `\cmidrule`{=latex}(lr)8-10 Configuration           $T\!=\!1$               $T\!=\!4$               $T\!=\!8$                $T\!=\!1$                 $T\!=\!4$                  $T\!=\!8$                  $T\!=\!1$                 $T\!=\!4$                 $T\!=\!8$
  RDM                                                                                                 *Divergent Training*     *Divergent Training*   *Divergent Training*                                                                                                                                       
  + Constrained $\dA$                                                                                         8.99                     3.15                   2.97                $-2.0_{\pm0.1}$           $11.0_{\pm0.1}$            $13.2_{\pm0.2}$            $0.5_{\pm0.1}$            $7.8_{\pm0.0}$            $9.1_{\pm0.5}$
  + Per-Seq. Sampling                                                                                         3.38                     3.01                   2.98            $\mathbf{7.6_{\pm0.2}}$       $13.4_{\pm0.2}$            $14.0_{\pm0.2}$        $\mathbf{5.9_{\pm0.4}}$       $9.3_{\pm0.2}$        $\mathbf{9.9_{\pm0.2}}$
  + Prelude Norm                                                                                            **3.28**                 **2.97**               **2.95**              $7.5_{\pm0.3}$        $\mathbf{13.5_{\pm0.0}}$   $\mathbf{14.0_{\pm0.2}}$       $5.8_{\pm0.3}$        $\mathbf{9.4_{\pm0.1}}$       $9.7_{\pm0.3}$
  ------------------------------------------------------------------------------------------------- ------------------------- ---------------------- ----------------------- ------------------------- -------------------------- -------------------------- ------------------------- ------------------------- -------------------------

  : **Stability Results Trained on Transformer Setup.** To illustrate stability, we retrofit a baseline 140M Transformer into a RDM and then sequentially add our stability improvements.
:::

`\sysname `{=latex}Improves End-to-End Quality {#sec:e2e}
----------------------------------------------

We compare `\sysname `{=latex}against parameter- and data-matched RDMs and Transformers, finding that `\sysname `{=latex}is more stable than prior looped models and that it outperforms both in quality.

#### Setup.

For RDMs, we follow @geiping_scaling_2025, using the Huginn dataset and tokenizer for training. For transformers, we follow @nanochat and train on `FineWeb-Edu` [@penedo2024finewebdatasetsdecantingweb]. For both RDM and Transformer setups, we perform hyperparameter sweeps for both RDMs and Transformers, and then use them for `\sysname `{=latex}(i.e., we perform no hyperparameter sweeps for `\sysname `{=latex}models). Extended model definitions, hyperparameter selection, and evaluation setup can be found in `\cref{sec:model-definitions}`{=latex}, `\cref{sec:hyperparameters}`{=latex}, and `\cref{sec:evaluation-setup}`{=latex}, respectively.

**Comparison against RDMs**. `\cref{tab:rdm-parcae}`{=latex} shows that `\sysname `{=latex}reduces perplexity by up to 6.2 % and 9.1 % on a held-out validation set and WikiText [@merity2016pointer] against prior RDMs [@geiping_scaling_2025], while additionally performing up to 1.8 points better on the average of several downstream benchmarks. `\cref{tab:ablation}`{=latex} ablates that each modification of `\sysname `{=latex}contributes: constraining $\dA$ enables convergence at high $T$ (e.g., $\mu_{\text{rec}}=T\!=\!8$), per-sequence sampling stabilizes lower test-time depths, and the prelude norm further improves quality across all $T$ (and late stage stability `\cref{sec:prelude-norm}`{=latex}).

```{=latex}
\begin{table*}[!t]
    
    \small
    \setlength{\tabcolsep}{4pt}
    \renewcommand{\arraystretch}{1.05}
    \begin{tabular*}{\textwidth}{@{\extracolsep{\fill}} llc|cccc @{}}
        \toprule
         & \textbf{Model} & $\mathbf{T}$ & Val. PPL ($\downarrow$) & Lambada PPL ($\downarrow$) & Core ($\uparrow$) & Core-Extended ($\uparrow$) \\
        \midrule
        \rotatebox{90}{140M} & Transformer & -- & 21.48 & 127.39 & 13.00 ± 0.15 & 8.80 ± 0.21 \\
         & \sysname & 8  & \textbf{19.06} & \textbf{80.64} & \textbf{14.04 ± 0.20} & \textbf{9.67 ± 0.28} \\
        \midrule
        \rotatebox{90}{370M} & Transformer & -- & 15.79 & 40.77 & 17.46 ± 0.03 & 11.71 ± 0.22 \\
         & \sysname & 8  & \textbf{14.49} & \textbf{32.74} & \textbf{20.00 ± 0.06} & \textbf{12.75 ± 0.31} \\
        \midrule
        \rotatebox{90}{770M} & Transformer & -- & 13.08 & 22.37 & 22.42 ± 0.20 & 14.20 ± 0.63 \\
         & \sysname & 8  & \textbf{12.49} & \textbf{19.71} & \textbf{25.07 ± 0.33} & \textbf{15.19 ± 0.43} \\
        \midrule
        \rotatebox{90}{1.3B} & Transformer & -- & 11.95 & 17.26 & 25.45 ± 0.08 & 15.90 ± 0.23 \\
         & \sysname & 8  & \textbf{11.42} & \textbf{14.71} & \textbf{28.44 ± 0.28} & \textbf{17.08 ± 0.09} \\
        \bottomrule
    \end{tabular*}
    \caption{\textbf{Comparing \sysname to Fixed-Depth Transformers.} We pretrain Transformers and \sysname with a \texttt{nanochat} setup at several scales, evaluating on a held-out validation set, Lambada \citep{paperno2016lambada}, Core, and Core-Extended \citep{li2025datacomplmsearchgenerationtraining}. 
    \textbf{Best} results are \textbf{bolded.}}
    \label{tab:trans-parcae}
\end{table*}
```
**Comparison Against Transformers.** `\cref{tab:trans-parcae}`{=latex} shows that `\sysname `{=latex}reduces validation perplexity by 4.3--9.2% and improves Core and Core-Extended Scores by up to 2.99 and 1.18 points, respectively. We find that our 770M `\sysname `{=latex}model achieves quality comparable to the 1.3B Transformer on Core [@li2025datacomplmsearchgenerationtraining] with roughly half the parameters. Measured as a fraction of the quality gap to the next larger Transformer (e.g., for 140M Core-Extended: $\frac{9.67-8.80}{11.71-8.80} \cdot 100 \approx 29.9 \%$), `\sysname `{=latex}achieves a ***23.3-87.5% and 29.9-58.2%*** better parameter efficiency for Core and Core-Extended, respectively.

![**Looping Scales Training Compute Optimally.** (*Left*) Parametric isoLoss contours over `\meanrecurrence `{=latex}and data. The efficient frontier (blue line) traces the lowest FLOP budget required to achieve each loss level, showing that optimal training requires increased looping. (*Right*) Parabolic isoFLOP fits for 140M and 370M models reveal a clear optimum `\meanrecurrence `{=latex}at each FLOP budget, indicating that looping is an orthogonal scaling axis to data.](figures/main/combined_contours_isoflop.png){#fig:train-scaling-laws width="\\linewidth"}

Looping as an Orthogonal Scaling Axis in Training {#sec:train-scaling}
-------------------------------------------------

In this section, we explore the FLOP efficiency of looping under a fixed FLOP and parameter budgets. We find that looping introduces an orthogonal axis for scaling compute, where compute-optimal training increases `\meanrecurrence `{=latex}and data in tandem following empirical power laws.

#### Setup.

We train 140M and 370M `\sysname `{=latex}models under fixed FLOP and parameter budgets, varying training tokens and mean recursion `\meanrecurrence `{=latex}using the `nanochat` setup. Additional training details and FLOP estimates can be found in `\cref{sec:scaling-laws-setup}`{=latex} and `\cref{sec:flop-estimate}`{=latex}, respectively.

**Modeling Scaling Laws of Looping**. At 140M and 370M scales, isoFLOP curves show that increasing `\meanrecurrence `{=latex}while proportionally reducing tokens yields lower validation loss than training at low recurrence (`\cref{fig:train-scaling-laws}`{=latex} \[*right*\]). Using a parabolic fit, we extract the optimal `\meanrecurrence `{=latex}and token budget at each FLOP level, finding that both follow predictable power laws (`\cref{fig:power-laws}`{=latex}) with consistent exponents ($\gamma_{\mu} \approx 0.40$, $\gamma_D \approx 0.78$). We also fit a parametric function $\widehat{\mathcal{L}}(\mu_{\text{rec}}, D) = E + X\cdot \mathbf{N}(\mu_{\text{rec}})^{-x} + Y \cdot D^{-y}$ over the effective parameterization $\mathbf{N}(\mu_{\text{rec}})$ (i.e., parameters of unrolling the looped model) and tokens $D$ (`\cref{fig:train-scaling-laws}`{=latex}, \[*left*\]; details in `\cref{sec:fit-par}`{=latex}), enabling predictable extrapolation of loss to unseen budgets. To verify, we predict the validation loss of held-out models in `\cref{sec:e2e}`{=latex}, achieving 1.3% and 0.8% error at 140M and 370M, respectively.

![**Optimal `\meanrecurrence `{=latex}and Tokens Follows Predictable Power Laws.** We fit a parabola to each isoFLOP budget for both 140M and 370M `\sysname `{=latex}models, using its minima to approximate the optimal `\meanrecurrence `{=latex}and token budget at each scale. We observe that optimal recurrence (*left plots*) and tokens (*right plots*) follow a predictable power law with similar coefficients at both scales.](figures/main/optimal_recurrence_data_combined.png){#fig:power-laws width="\\linewidth"}

```{=latex}
\makeatletter
```
```{=latex}
\def\@captype{figure}
```
```{=latex}
\makeatother
```
![image](figures/main/frontier_vs_rec1_fit.png){width="\\linewidth"}

```{=latex}
\hfill
```
::: {#tab:qual-fix-loop}
+:-----------------------------------------------------:+:------------------:+:----------------------:+:----------------------------------:+:---------------:+:---------------:+:---------------:+
|                                                       | **FLOPs**          |                        | **Optimal $\mu_{\mathrm{rec}}^*$** | **Fixed-Depth** |                 |                 |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
| `\cmidrule`{=latex}(lr)4-5 `\cmidrule`{=latex}(lr)6-7 | ($\times 10^{18})$ | $\mu_{\mathrm{rec}}^*$ | Core                               | Core Ext.       | Core            | Core Ext.       |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
| ```{=latex}                                           | $1$                | 2                      | $7.6$                              | $5.7$           | $\mathbf{7.9}$  | $\mathbf{6.1}$  |
| \rotatebox[origin=c]{90}{\textbf{140M}}               |                    |                        |                                    |                 |                 |                 |
| ```                                                   |                    |                        |                                    |                 |                 |                 |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
|                                                       | $2$                | 2                      | $9.0$                              | $6.2$           | $\mathbf{10.5}$ | $\mathbf{6.4}$  |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
|                                                       | $4$                | 4                      | $\mathbf{11.2}$                    | $\mathbf{8.4}$  | $10.7$          | $8.1$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
|                                                       | $8$                | 6                      | $10.5$                             | $\mathbf{7.8}$  | $\mathbf{11.8}$ | $7.7$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
|                                                       | $16$               | 8                      | $\mathbf{14.6}$                    | $\mathbf{9.8}$  | $13.0$          | $8.8$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
|                                                       | $64$               | 10                     | $\mathbf{16.2}$                    | $\mathbf{11.0}$ | $15.0$          | $9.5$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
| ```{=latex}                                           | $32$               | 4                      | $15.2$                             | $10.1$          | $\mathbf{16.8}$ | $\mathbf{11.2}$ |
| \rotatebox[origin=c]{90}{\textbf{370M}}               |                    |                        |                                    |                 |                 |                 |
| ```                                                   |                    |                        |                                    |                 |                 |                 |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
|                                                       | $64$               | 6                      | $\mathbf{18.1}$                    | $11.6$          | $\mathbf{18.1}$ | $\mathbf{12.1}$ |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+
|                                                       | $128$              | 6                      | $\mathbf{20.1}$                    | $\mathbf{13.0}$ | $18.1$          | $12.0$          |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+-----------------+-----------------+-----------------+

: **Core Scores Comparison of Looping Optimal Frontier over Purely Scaling Data.** We evaluate the downstream quality of fixed-depth (`\meanrecurrence=1`{=latex}) and looped `\sysname `{=latex}models trained with fixed parameters and FLOP budgets. At both scales, using the optimal `\meanrecurrence `{=latex}results in better Core and Core-Extended scores at extended FLOP budgets. Expanded results can be found in `\cref{sec:fixed-comparison-expanded}`{=latex}.
:::

**IsoFLOP comparison of Looping with Fixed-Depth** `\cref{fig:optimal-frontier}`{=latex} shows fixed-depth `\sysname `{=latex}models without looping at each FLOP budget. The optimal curve achieves a strictly lower loss, which translates to 1.2-2.0 points higher Core scores (`\cref{tab:qual-fix-loop}`{=latex}).

Test-Time Scaling Laws of `\sysname`{=latex} {#sec:inf-scaling}
--------------------------------------------

We study looping as a mechanism for scaling test-time compute. We find the test-time compute follows a predictable saturating exponential decay, which can be unified with `\cref{sec:train-scaling}`{=latex}, connecting both training and test-time scaling laws.

#### Setup.

We train 140M and 370M `\sysname `{=latex}models under a fixed data budget with $\mu_{\text{rec}} \in \{2, 4, 6, 8, 10, 12\}$ following our `nanochat` setup, evaluating up to $T = 24$. We additionally evaluate models from `\cref{sec:train-scaling}`{=latex} for the unified scaling laws. See `\cref{sec:scaling-laws-setup}`{=latex} for details.

**Saturation of Test-Time Compute.** While prior works observed test-time generalization in small synthetic tasks [@yangLoopedTransformersAre2023; @bansalEndtoendAlgorithmSynthesis2022], we find quality to be bounded in large-scale language modeling. Evaluating models from `\cref{sec:e2e}`{=latex} at $2\times$ `\meanrecurrence `{=latex}across all four scales (`\cref{fig:parcae-test-time-scaling}`{=latex}), we observe that gains plateau near `\meanrecurrence`{=latex}, suggesting training depth determines the test-time scaling ceiling.

![**Test-Time Scaling of `\sysname`{=latex}.** When evaluating `\sysname `{=latex}models from `\cref{tab:trans-parcae}`{=latex}, we observe test-time looping follows a predictable saturating trend, consistent across model sizes.](figures/main/test_time_scaling.png){#fig:parcae-test-time-scaling width="\\linewidth"}

![**Scaling Test-Time Compute follows a Predictable Power Laws.** We plot the validation loss with different `\meanrecurrence `{=latex}as a function of test-time recurrence $T$, and find the fitted exponential decay (solid curve for each `\meanrecurrence`{=latex}) tightly captures the test-time performance of looping. ](figures/main/unified_test_time_fixed_tokens_exp_default_tab10.png){#fig:test-time-scaling-laws width="\\linewidth"}

**Modeling Scaling Laws of Test-Time Looping.** We find that the test-time scaling curves are well-described by a saturating exponential decay of the form: $\mathcal{L}(T) = \mathcal{L}_\infty + Ze^{-z\cdot T}$. This form tightly captures the saturation dynamics for each model (`\cref{fig:test-time-scaling-laws}`{=latex}; see `\cref{sec:test-par}`{=latex} for details), achieving an average Huber loss of $2.5 \times 10^{-7}$ and $1.8 \times 10^{-7}$ for 140M and 370M, respectively.

**Unifying Training and Test-Time Scaling Laws.** From the learned fits in `\cref{fig:test-time-scaling-laws}`{=latex}, we observe that $\mathcal{L}_\infty$ matches the training law prediction at $T = \mu_\text{rec}$ (`\cref{sec:train-scaling}`{=latex}), and that the per-curve decay rate scales inversely with training depth as $z / \mu_\text{rec}$ (see `\cref{sec:test-par}`{=latex} for details). These observations motivate a unified scaling law that connects training and test-time compute: $$\widehat{\mathcal{L}}_\text{unified}(T \mid \mu_\text{rec}, D) = \underbrace{E + X \cdot \mathbf{N}(\mu_\text{rec})^{-x} + Y \cdot D^{-y}}_{\text{Training Law Floor } \widehat{\mathcal{L}}_\text{train}(\mu_\text{rec}, D)} + \underbrace{Z \cdot \exp\!\left(-z \cdot T \cdot \mu_\text{rec}^{-1}\right)}_{\text{Test-Time Decay}}
   \label{eq:unified}$$ where $\widehat{\mathcal{L}}_{\text{train}}(\mu_{\text{rec}}, {D})$ is the training law in `\cref{sec:train-scaling}`{=latex}, and $(Z, z)$ are two fitted parameters governing the test-time scaling. The training law sets the irreducible floor, while the decay rate $-z \cdot T / \mu_\text{rec}$ captures how quickly additional recurrences approach it. On held-out 140M and 370M `\sysname `{=latex}models (`\cref{sec:e2e}`{=latex}), the unified fit predicts test-time loss within 0.85-1.31% average error, dropping further to 0.1-0.17% average error when the empirical loss at $T = \mu_{\text{rec}}$ is used. This confirms that `\cref{eq:unified}`{=latex} captures saturation dynamics, with residual error attributable to the training law's $\sim1\%$ extrapolation gap (see `\cref{sec:test-par}`{=latex} for extended details).

Discussion and Future Work {#sec:lim-fut}
==========================

In this section, we briefly discuss limitations and future directions.

#### Looped Architectures.

While several design choices around looped architectures have been guided by small-scale empirical results, a deep investigation of loop-unit placement [@jacobs2026blockrecurrentdynamicsvisiontransformers], composition (e.g., number of parameters in the recurrent unit and usage of different architectures), and extreme looping (e.g., increasing mean recurrence to deeper depths) at a larger scale is warranted. Within our dynamical systems framework, the use of different discretizations, full-rank parameterizations, and recurrent update rules warrants investigation to enable recurrence at larger depths.

#### Scaling.

While we find Parcae to induce predictable, optimal scaling laws for layer looping, our observations are limited to small architectures. It remains to be seen if Parcae compares favorably when scaling these observations to large FLOP budgets and parameterizations. We are also interested in the interplay of parameters, data, and recurrence as orthogonal axes, and how they should be efficiently scaled together. Finally, one limitation of looping is that, as `\meanrecurrence `{=latex}increases, the number of test-time steps required to achieve equivalent quality increases. An investigation of techniques that maintain quality with fewer inference time steps is an interesting future direction.

Conclusion {#sec:conclusion}
==========

In this work, we study the stability of looped models through a dynamical systems framework and propose **`\sysname`{=latex}**, a stable looped architecture that prevents residual explosion by constraining the spectral norm of the injection parameters. `\sysname `{=latex}outperforms data- and parameter-matched prior looped models and baseline Transformers, matching downstream quality of models up to twice its size. We further establish scaling laws for looping: FLOP-optimal training increases looping and data in tandem following predictable power laws, while test-time looping follows a saturating exponential decay law, yielding a unified scaling law connecting training and inference compute.

```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\appendix
```
```{=latex}
\newpage
```
Glossary
========

We include a brief glossary of both notations and common metrics used to define and analyze looped architectures.

Notation {#sec:notation}
--------

::: {#tab:placeholder}
  Notation                     Description                                                                
  ---------------------------- -------------------------------------------------------------------------- --
  $d$                          Embedding **dimension** of the model                                       
  $t$                          Discrete temporal **state** axis of `\recurrent `{=latex}on $\mathbb{N}$   
  $b$                          Global batch size used during pretraining                                  
  `\prelude `{=latex}          Initial **prelude** block of a recurrent architecture                      
  `\recurrent `{=latex}        Middle **recurrent** block of a recurrent architecture                     
  `\coda `{=latex}             Final **coda** block of a recurrent architecture                           
  $\A$                         The linear continuous state transition matrix                              
  $\B$                         The linear continuous state injection matrix                               
  $\C$                         The linear state output matrix                                             
  $\dt$                        Learnable discrete parameter for decay, discretizing our model             
  $s$                          Input sequence to a model                                                  
  $e$                          Output embedding of the prelude block `\prelude `{=latex}                  
  $h$                          Hidden embedding of the recurrent block `\recurrent `{=latex}              
  `\meanrecurrence `{=latex}   Mean recurrent forward propagation steps during pre-training               
  `\meanbackward `{=latex}     Mean recurrent backward propagation steps during pre-training              
  $n$                          Sampled number of recurrent steps with no gradient updates                 
  $k$                          Sampled number of recurrent steps with gradient updates                    
  $T$                          Sampled or fixed number of recurrent steps actually taken                  
  $\Lambda$                    Distribution that recurrences are sampled from during training             

  : Glossary of notation and terminology. (*Top*) Frequently used dimensions for tensors. (*Middle*) Definition of Parcae blocks. (*Bottom*) Tensors and distributions are used to express recurrent depth models.
:::

Common Metrics {#sec:metrics}
--------------

-   Recurrent Residual Metric: $||h_T - h_{T-1}||_2$, where $T \sim \Lambda$. This metric tells us how much we jump around at the final recurrence. Overly small jumps indicate that `\recurrent `{=latex}isn't learning anything meaningful, while overly large jumps indicate `\recurrent `{=latex}is suffering from state explosion or is unable to learn fixed-point dynamics.

-   Recurrent State Norm: $||h_T||$, where $T \sim \Lambda$. In general, we don't want an overly large recurrent state norm as it creates numerical instabilities and leads to overly large gradients.

```{=latex}
\newpage
```
Extended Literature Review {#sec:lit-review}
==========================

Looping model depth has been well explored by prior work; with a large body of work studying looping within general language modeling [@dehghaniUniversalTransformers2019; @zhuScalingLatentReasoning2025; @geiping_scaling_2025; @mcleish_retrofitted_recurrence; @baeMixtureofRecursionsLearningDynamic2025] or small-scale algorithmic problems [@avi_learn_algorithm; @yangLoopedTransformersAre2023; @bansalEndtoendAlgorithmSynthesis2022; @wangHierarchicalReasoningModel2025b; @jolicoeur-martineauLessMoreRecursive2025]. Within looped architectures, the design of training paradigms can be relatively split between architectures with explicit halting mechanisms [@dehghaniUniversalTransformers2019; @zhuScalingLatentReasoning2025; @baeMixtureofRecursionsLearningDynamic2025; @jolicoeur-martineauLessMoreRecursive2025; @wangHierarchicalReasoningModel2025b] and those with implicit halting mechanisms [@geiping_scaling_2025; @mcleish_retrofitted_recurrence; @LoopFormerElasticDepthLooped2025; @xuExpressivePowerLooped2025]. Looped architectures trained with an explicit halting mechanism use specialized architectures to predict when to early exit tokens, preventing additional computation updates on their recurrent stream [@wangHierarchicalReasoningModel2025b; @jolicoeur-martineauLessMoreRecursive2025; @baeMixtureofRecursionsLearningDynamic2025; @dehghaniUniversalTransformers2019; @elbayadDepthAdaptiveTransformer2020]. Specifically, @wangHierarchicalReasoningModel2025b [@jolicoeur-martineauLessMoreRecursive2025] formalize *adaptive-computation-time*, a method that utilizes Q-learning as a means to determine convergence. Similarly, works such as @baeMixtureofRecursionsLearningDynamic2025 define an architecture that uses light-weight routers to assign dynamic recursion depths, while @zhuScalingLatentReasoning2025 uses a prediction head to dynamically define a probability of exiting after recurrent passes. A majority of these approaches draw on methods of layer skipping [@elhoushiLayerSkipEnablingEarly2024; @raposoMixtureofDepthsDynamicallyAllocating2024]; however, these methods differ from using a shared parameterization for a recurrent block.

Alternatively, looped architectures with an implicit halting mechanism, such as @geiping_scaling_2025 [@mcleish_retrofitted_recurrence; @avi_learn_algorithm; @bansalEndtoendAlgorithmSynthesis2022], train models with stochastically sampled recurrent steps during pretraining, and then use the KL-divergence between two successive steps to decide when to exit from the recurrent unit early. Finally, @LoopFormerElasticDepthLooped2025 ignores adaptive early exiting altogether, instead pretraining a recurrent unit on a static number of recurrences and enforcing a consistency loss on intermediate recurrences. Our work focuses solely on implicit recurrent depth models [@geiping_scaling_2025; @mcleish_retrofitted_recurrence], which are derived from prior initial work [@avi_learn_algorithm; @bansalEndtoendAlgorithmSynthesis2022].

Beyond training paradigms, there are several differing architectural design choices for looped models [@geiping_scaling_2025; @bansalEndtoendAlgorithmSynthesis2022; @saunshiReasoningLatentThoughts2025b]. In simple looped architectures that only place a single recurrent unit, the placement of the looped unit is non-trivial, with certain works looping over all layers [@dehghaniUniversalTransformers2019; @Csordas2024MoEUTMU; @Bae2024RelaxedRT]. Alternatively, @saunshiReasoningLatentThoughts2025b find middle-looping recurrent units are the most effective in comparison to other formulations, such as pre-looping and post-looping, which loop the beginning and end of the model. The effectiveness of Middle-looping is consistent with the initial work in synthetic problems by @bansalEndtoendAlgorithmSynthesis2022 [@avi_learn_algorithm] and with the architecture choices of @geiping_scaling_2025 [@mcleish_retrofitted_recurrence] in large-scale language models before training. Within middle-looping architectures, the number of layers within each unit is mostly chosen ad hoc; however, when bootstrapping from a baseline model, @koishekenov2025encodethinkdecodescaling found that you optimize placement by algorithmically selecting layers within a model to loop.

While these prior formulations of looping focus on a single recurrent block, hierarchical [@wangHierarchicalReasoningModel2025b; @jolicoeur-martineauLessMoreRecursive2025], parallel [@wu2025parallellooptransformerefficient], and multi-step [@jacobs2026blockrecurrentdynamicsvisiontransformers] formulations of layer looping exist. Furthermore, while not all under the same architectural paradigm, layer looping has been explored in multiple domains (e.g., language [@geiping_scaling_2025; @mcleish_retrofitted_recurrence], images [@jacobs2026blockrecurrentdynamicsvisiontransformers], multi-modal systems [@alabdulmohsin2025recursiveinferencescalingwinning], synthetic algorithmic problems [@avi_learn_algorithm; @bansalEndtoendAlgorithmSynthesis2022; @yangLoopedTransformersAre2023]), with the choice of looping style and model architecture design changing based on the specific modality. Where layer looping is introduced, how it is affected by individual modalities, and efficient, FLOP-optimal implementations of layer looping remain open questions.

Finally, layer looping is often deeply tied to deep equilibrium (DEQ) models [@bai2019deepequilibriummodels; @bai2022neural], due to the fixed-point nature often learned in recurrence. DEQs find the equilibrium points via root-finding to approximate an *infinite depth* network. However, unlike looped architectures trained with truncated backpropagation, a key advantage of DEQ models is their use of implicit differentiation through *infinite depth*, which keeps memory constant and independent of effective depth used to solve the fixed point using a rooting finding algorithm. While the use of implicit differentiation in DEQs enables more efficient training, we focus on work that does explicit backpropagation rollouts [@geiping_scaling_2025; @mcleish_retrofitted_recurrence; @bansalEndtoendAlgorithmSynthesis2022; @yang2024loopedtransformersbetterlearning]. Within looped architectures, @geiping_scaling_2025 [@mcleish_retrofitted_recurrence] adopt the usage of path independence from equilibrium models [@anilPathIndependentEquilibrium] to warrant their choice of $h_0$ initialization.

Derivation of Instability Conditions of Prior Methods {#sec:derivation-instabilty}
=====================================================

Recall from `\cref{sec:rdm-basics}`{=latex}, that `\recurrent `{=latex}denotes the full recurrent update $h_{t+1} = \mathcal{R}(h_t,e)$, encompassing all transformer operations, including residual connections. A common interpretation views the residual stream as a communication channel where $h_T$ is the sum of the relative outputs of all previous layers and the original embedding [@olsson2022incontextlearninginductionheads]. Applying this to looped models, let $\overline{\mathcal{R}}$ denote the *relative contribution* of the nonlinear operations (i.e., $\overline{\mathcal{R}}(W_1 h_t + W_2 e) = \mathcal{R}(W_1 h_t + W_2 e) - (W_1 h_t + W_2 e)$). This gives the recurrent update rule $h_{t+1} = W_1 h_t + W_2 e + \overline{\mathcal{R}}(h_t, e)$ where we write $\overline{\mathcal{R}}(h_t, e) = \overline{\mathcal{R}}(W_1 h_t + W_2 e)$ for brevity. Although $\overline{\mathcal{R}}$ is highly non-linear, the recurrent update can be exactly formulated as a *non-linear time-variant dynamical system* of the form: $h_{t} = \dA h_{t-1} + \dB e + \overline{\mathcal{R}}(h_{t-1}, e), ~ y_t = \C h_t,$ where $\dA = W_1$, $\dB = W_2$, and $\C \in R^{d_c \times d_h}$ decouples the `\coda `{=latex}and `\recurrent `{=latex}embedding dimension (i.e. $p=\mathcal{C}(\C(h_T))$).

Using the *relative contribution* representation of looped models above, we can recast prior mediums of input injection discussed in `\cref{sec:rdm-basics}`{=latex} in a form similar to our framework. Specifically, for Pre-Norm looped models using addition as injection [@yangLoopedTransformersAre2023], the dynamical systems update rule can thus be written in the form $h_{t+1} = Ih_t + Ie + \overline{\mathcal{R}}(Ih_t + Ie)$. When linearized (i.e., dropping the nonlinear $\overline{\mathcal{R}}$ block), $\dA = I$, meaning that the model is a *marginally-stable* system as all eigenvalues are 1. Alternatively, the update rule for Pre-Norm looped models using concatenation as injection [@geiping_scaling_2025] can be rewritten in the form $h_{t+1} = W[h_t;e] + \overline{\mathcal{R}}(W[h_t;e]) = W_1 h_t + W_2 e + \overline{\mathcal{R}}(W_1 h_t + W_2e)$. Here $\dA = W_1$ is unbounded and thus can create an explosion of the state if not carefully maintained during training.

FLOP Estimate of Parcae {#sec:flop-estimate}
=======================

In standard, fixed-depth architectures, a common means to approximate the number of FLOPs used in training is $C = 6ND$ from @kaplan2020scalinglawsneurallanguage, where $N$ is the number of parameters and $D$ is the number of tokens used in training. However, looped architectures differ from traditional models in that they exhibit the notion of *effective parameters* $\hat N$ (e.g., for a model that is a single layer with $N$ parameters, if it is looped ten times, then it has an effective parameterization of $\hat N = 10 N$). Furthermore, as Parcae uses truncated backpropagation through depth, the effective parameters can thus be decoupled into two types: $\hat N_1$, which are effective parameters that ***are not backpropagated*** through, and $\hat N_2$, which are effective parameters that ***are backpropagated*** through. Thus, following @kaplan2020scalinglawsneurallanguage, we can formulate the effective FLOPs of Parcae as $C = (2\hat N_1 + 6 \hat N_2)D$, which further matches the setup of @mcleish_retrofitted_recurrence. Like @mcleish_retrofitted_recurrence, we exclude embedding parameters from $\hat N$, however, we do include unembedding parameters in $\hat N$ similar to @nanochat. Lastly, we additionally include an estimate for attention FLOPs following @chowdhery2022palmscalinglanguagemodeling [@nanochat].

Parcae Forward Pass and Training Algorithms {#sec:algorithm}
===========================================

A full forward pass of Parcae, combining our dynamical systems blocks $\A, \B, \C, \dt$ and looped models `\prelude`{=latex}, `\recurrent`{=latex}, `\coda `{=latex}blocks can be found in `\cref{alg:parcae}`{=latex}.

```{=latex}
\begin{algorithm}[!ht]\caption{Parcae Forward Pass}
\label{alg:parcae}
\begin{algorithmic}[1]
\Require Input sequence $s \in V^n$ and recurrent steps $T$.
\State $e \gets \text{LN}(\mathcal{P}(s))$ 
\State $h_0 \sim \mathcal{N}(0, \sigma^2 I_{n \times d})$ 
\State $\overline{\A}, \overline{\B} \gets \A, \B, \dt$ 
\For{$t = 1$ to $T$}
    \State $h_t \gets \overline{\A} h_{t-1} + \overline{\B} e + \overline{\mathcal{R}}(h_t, e)$ 
\EndFor
\State \textbf{return} $\mathcal{C}(\C h_T)$
\end{algorithmic}
\end{algorithm}
```
We display our algorithm to sample per-sequence depths during Parcae training while maintaining compute efficiency in Algorithm `\ref{alg:per-sequence-depth}`{=latex}. We do per-sequence depth sampling, but taking the max depth within a batch and performing no state updates at the *beginning* of the recurrent computation. This allows for batched processing of different depths while maintaining efficient gradient flow.

```{=latex}
\begin{algorithm}[!ht]\caption{Efficient Per-Sequence Stochastic Depth Training}
\label{alg:per-sequence-depth}
\begin{algorithmic}[1]
\Require Batch of sequences $\{s_i\}_{i=1}^{B}$, means $\mu_{\text{rec}}, \mu_{\text{bwd}}$, and sampling distribution $\Lambda$
\State $\bm{e}^{(i)} \gets \mathcal{P}(s_i)$ for all $i$ \hfill  \textcolor{magenta}{$\triangleright$ \texttt{embed sequences}}
\State Sample $T^{(i)} \sim \Lambda(\mu_{\text{rec}})$ for each $i \in [B]$
\State $T_{\max} \gets \max_i T^{(i)}$, \quad $\tau^{(i)} \gets T_{\max} - T^{(i)}$
\State $\bm{h}_0^{(i)} \sim \mathcal{N}(0, \sigma \bm{I})$ for all $i$
\State $\overline{\bm{A}}, \overline{\bm{B}} \gets \textsc{Discretize}(\bm{A}, \bm{B}, \dt)$
\For{$t = 0, \ldots, T_{\max} - 1$}
    \State \textbf{for all} $i$ \textbf{where} $t < \tau^{(i)}$: \quad $\bm{h}_{t+1}^{(i)} \gets \bm{h}_t^{(i)}$ \hfill  \textcolor{magenta}{$\triangleright$ \texttt{no state update}}
    \State \textbf{for all} $i$ \textbf{where} $\tau^{(i)} \leq t < T_{\max} - \mu_{\text{bwd}}$: \hfill  \textcolor{magenta}{$\triangleright$ \texttt{without gradients}}
    \State \quad $\bm{h}_{t+1}^{(i)} \gets \overline{\bm{A}} \bm{h}_t^{(i)} + \overline{\bm{B}} \bm{e}^{(i)} + \mathcal{R}(\bm{h}_t^{(i)}, \bm{e}^{(i)})$
    \State \textbf{for all} $i$ \textbf{where} $t \geq T_{\max} - \mu_{\text{bwd}}$: \hfill  \textcolor{magenta}{$\triangleright$ \texttt{with gradients}}
    \State \quad $\bm{h}_{t+1}^{(i)} \gets \overline{\bm{A}} \bm{h}_t^{(i)} + \overline{\bm{B}} \bm{e}^{(i)} + \mathcal{R}(\bm{h}_t^{(i)}, \bm{e}^{(i)})$
\EndFor
\State \textbf{return} $\{\mathcal{C}(\bm{C} \bm{h}_{T_{\max}}^{(i)})\}_{i=1}^{B}$
\end{algorithmic}
\end{algorithm}
```
Additional Stability Ablations {#sec:stability-ablations}
==============================

We include all training curves for our hyperparameter sweep experiments in `\cref{sec:hyperparameters}`{=latex}. We conduct a learning rate sweep over $\{ 2e-4, 4e-4, 6e-4, 8e-4, 1e-3\}$ observing that Parcae exhibits stable training over both baseline Pre-Norm RDMs and residual normalized RDMs. The training curves and the accompanying recurrent state norm can be observed in `\cref{fig:all-instability-curves}`{=latex}.

![Training instability of recurrent depth models across different learning rates. We show both training losses and recurrent state norm to understand divergence and state explosion.](figures/appendix/instability/instability_lr_2e-4.png "fig:"){#fig:all-instability-curves width="\\linewidth"} ![Training instability of recurrent depth models across different learning rates. We show both training losses and recurrent state norm to understand divergence and state explosion.](figures/appendix/instability/instability_lr_4e-4.png "fig:"){#fig:all-instability-curves width="\\linewidth"} ![Training instability of recurrent depth models across different learning rates. We show both training losses and recurrent state norm to understand divergence and state explosion.](figures/appendix/instability/instability_lr_6e-4.png "fig:"){#fig:all-instability-curves width="\\linewidth"} ![Training instability of recurrent depth models across different learning rates. We show both training losses and recurrent state norm to understand divergence and state explosion.](figures/appendix/instability/instability_lr_8e-4.png "fig:"){#fig:all-instability-curves width="\\linewidth"} ![Training instability of recurrent depth models across different learning rates. We show both training losses and recurrent state norm to understand divergence and state explosion.](figures/appendix/instability/instability_lr_1e-3.png "fig:"){#fig:all-instability-curves width="\\linewidth"}

```{=latex}
\newpage
```
Per-sequence Sampling Reduces Loss Spikes {#sec:loss-spikes}
=========================================

When running our per-sequence sampling experiments, we observed that the training curves of per-sequence sampling helped eliminate loss spikes during training. Specifically, in `\cref{fig:per-sample-training}`{=latex}, for our 350M parameter Parcae models, per-micro-batch has several loss spikes through training while per-sequence sampling does not. We can observe from `\cref{fig:residual-state-norm}`{=latex}, that these training spikes stem directly from overly large recurrent residual jumps at the final recurrence, implying the model is not learning to converge to a steady-state fixed point solution. It can then be observed that per-sequence depth helps provide a better estimate for our training objective, enabling convergent fixed-point behavior and preventing loss-spikes during training. The direct benefit of this can be observed in `\cref{tab:training-res}`{=latex}, where per-sequence sampling significantly improves the downstream quality of looped models, especially at low test-time recurrences. Finally, we note that per-sequence sampling adds a minimal amount of training overhead, increasing total wall clock time for pretraining by 1.8%, which we believe can be further optimized away with a cleaner implementation.

![Training curves showing per-sequence sampling effectively eliminates loss spikes in training over per-micro-batch sampling.](figures/appendix/stochastic_depth_350m_training_loss.png){#fig:per-sample-training width="\\linewidth"}

![Comparison of recurrent residual and state norm metrics (defined in `\cref{sec:notation}`{=latex}), which show that per-sequence sampling enables stronger fixed point behavior in training.](figures/appendix/stochastic_depth_350m_norm_residual.png){#fig:residual-state-norm width="\\linewidth"}

```{=latex}
\small
```
```{=latex}
\renewcommand{\arraystretch}{1.2}
```
::: {#tab:training-res}
                                                                     **Method**     **T=1**     **T=4**     **T=8**    **T=16**
  ---------------------------------------------------------------- -------------- ----------- ----------- ----------- -----------
  `\multirow{2}{*}`{=latex}\[0pt\]`\rotatebox{90}{100M}`{=latex}     Per-Batch      300.32       36.75       16.65       13.81
                                                                    Per-Sequence   **70.47**   **17.15**   **14.08**   **13.59**
  `\multirow{2}{*}`{=latex}\[0pt\]`\rotatebox{90}{350M}`{=latex}     Per-Batch      167.61       12.80       10.40       10.24
                                                                    Per-Sequence   **17.92**   **10.49**   **10.09**   **10.11**

  : **Per-Microbatch vs. Per-Sequence Comparison**. We compare perplexity of Parcae models trained with per-microbatch sampling [@geiping_scaling_2025] and per-sequence sampling, using different recurrences ($T$) on a held-out validation set. **Bolded** results indicate **best** at each scale.
:::

```{=latex}
\clearpage
```
Sampling of Truncated Recurrence {#sec:sampling-truncated-recurrence}
================================

```{=latex}
\begin{algorithm}[H]\caption{\textsc{\citet{geiping_scaling_2025}}}
\label{alg:poisson-fill}
\begin{algorithmic}[1]
\State \textbf{Input:} $\mu_{\text{rec}}$, $\mu_{\text{bwd}}, \Lambda$, $e$
\State $n \sim \Lambda(\mu_{\text{rec}} - \mu_{\text{bwd}})$
\State $k \gets \mu_{\text{bwd}}$
\State $T = n + k$
\State $h_0 \gets \mathcal{N}(0, \sigma^2 I)$
\For{$t = 1$ \textbf{to} $T$}
    \If{$t \leq n$}
        \State $h_t \gets \mathcal{R}(h_{t-1}, e)$ \textbf{w/o grad}
    \Else
        \State $h_t \gets \mathcal{R}(h_{t-1}, e)$ \textbf{w/ grad}
    \EndIf
\EndFor
\State \Return $x_T$
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\hfill
```
```{=latex}
\begin{algorithm}[H]\caption{\textsc{Correction (Ours)}}
\label{alg:poisson-trunc-full}
\begin{algorithmic}[1]
\State \textbf{Input:} $\mu_{\text{rec}}$, $\mu_{\text{bwd}}$, $\Lambda$, $e$
\State $T \sim \Lambda(\mu_{\text{rec}})$
\State $n \gets \max(T - \mu_{\text{bwd}}, 0)$
\State $k \gets \min(T, \mu_{\text{bwd}})$
\State $h_0 \gets \mathcal{N}(0, \sigma^2 I)$
\For{$t = 1$ \textbf{to} $T$}
    \If{$t \leq n$}
        \State $h_t \gets \mathcal{R}(h_{t-1}, e)$ \textbf{w/o grad}
    \Else
        \State $h_t \gets \mathcal{R}(h_{t-1}, e)$ \textbf{w/ grad}
    \EndIf
\EndFor
\State \Return $x_T$
\end{algorithmic}
\end{algorithm}
```
![A distributional mismatch can be observed from the recurrent sampling method of [@geiping_scaling_2025]. Specifically, if our desired pre-training distribution for `\meanrecurrence `{=latex}is a Poisson distribution, the distribution total recurrence $T$ of [@geiping_scaling_2025] is truncated based on `\meanbackward`{=latex}. However, our sampling method decouples the effects of `\meanbackward `{=latex}on $\Lambda$, allowing the recurrent distribution to be faithfully sampled from.](figures/appendix/distribution-mismatch.png){#fig:method-distribution-mismatch width="\\linewidth"}

In our very initial experiments, we observed that we could make a small change to the sampling algorithm of [@geiping_scaling_2025], which stems from [@avi_learn_algorithm], to enhance the training of Parcae[^3]. When given an arbitrary distribution to sample from $\Lambda$ and two hyperparameters `\meanrecurrence `{=latex}(the desired mean steps of the recurrent blocks in pre-training) and `\meanbackward `{=latex}(the desired mean back-propagation steps in pre-training), we observe that previous work by [@geiping_scaling_2025] had a distributional mismatch. Previously, the sampling method of [@geiping_scaling_2025] exactly followed `\cref{alg:poisson-fill}`{=latex} with a poisson log-normal distribution with the following distribution $$\begin{aligned}
    \tau \sim \mathcal{N}(\log(\mu_{\text{rec}} - \mu_{\text{bwd}}) - \frac{1}{2}\sigma^2, \sigma) \qquad n \sim \mathcal{P}(e^\tau) + 1 \qquad k \gets \mu_{\text{bwd}}\end{aligned}$$ where $\sigma = \frac{1}{2}$. To maintain a fixed computation memory budget, [@geiping_scaling_2025] sets $k$ to `\meanbackward`{=latex}; however, this minor change significantly impacts the underlying recurrent distribution, truncating and compressing the distribution of recurrence actually observed during pre-training. We propose making a minor algorithmic fix to the sampling method, which can be observed in `\cref{alg:poisson-trunc-full}`{=latex}. While minor, observe in `\cref{fig:method-distribution-mismatch}`{=latex} the impact of improving generalization to other recurrences.

To verify our change, we pretrain several small Parcae models on 10 billion tokens to ablate on our design choice. Specifically, we set $\mu_{\text{rec}} = \mu_{\text{bwd}} = 8$ and use $\Lambda \sim \text{Poisson}$, and use fixed architecture, hyperparameters, and data stream. We train three models: a baseline Parcae model that performs full backpropagation through recurrences, a Parcae model following `\cref{alg:poisson-fill}`{=latex} by [@geiping_scaling_2025], and a Parcae model following `\cref{alg:poisson-trunc-full}`{=latex}. The results of this ablation can be found in `\cref{fig:sampling-mismatch-results}`{=latex}.

![Training and validation curves of three 100 million parameter Parcae models pretrained on 10 billion tokens, comparing different truncated back-propagation methods (baseline is a model with no back-propagation truncation). Each model has identical architecture and hyperparameters, with `\meanrecurrence `{=latex}and `\meanbackward `{=latex}both being set to eight, all using $\Lambda \sim \text{Poisson}$. It can be observed that even though each model has similar training loss and validation loss when using $T=8$, our implementation more faithfully follows the validation loss of full back-propagation. Specifically for $T=4$, our implementation significantly improves validation loss compared to [@geiping_scaling_2025] sampling method.](figures/appendix/loss_curves_publication.png){#fig:sampling-mismatch-results width="\\textwidth"}

From `\cref{fig:sampling-mismatch-results}`{=latex}, observe that training trajectories and validation loss at $T=\mu_{\text{rec}}=8$ are almost identical for each run; however, our method significantly improves performance for the validation loss of $T \in [4,16,64]$. Simply put, the constricting effect [@geiping_scaling_2025] observed in `\cref{fig:method-distribution-mismatch}`{=latex} reduces the effective range of recurrence seen in pretraining, hurting the validation loss of using more or fewer recurrence at test-time.

```{=latex}
\newpage
```
Selecting $\mu_\text{rec}$ and $\mu_{\text{bwd}}$ {#sec:scaling-of-truncated-backpropigation}
=================================================

![Validation curves of six different recurrent depth models, pretrained on 10 billion tokens, with a fixed architecture and hyperparameters. Each model is pretrained with a fixed `\meanbackward `{=latex}of 8 and varying `\meanrecurrence `{=latex}in $[4,8,14,20,26,32]$. The key observation is that scaling up `\meanrecurrence `{=latex}while keeping `\meanbackward `{=latex}fixed results in models that perform worse than if just pretrained on `\meanrecurrence `{=latex}of eight.](figures/appendix/mean_depth_loss_curves.png){#fig:mean-recurrence-scaling width="\\linewidth"}

::: {#tab:mean-forward-results}
                    $\mu_{\text{rec}}=4$   $\mu_{\text{rec}}=8$   $\mu_{\text{rec}}=14$   $\mu_{\text{rec}}=20$   $\mu_{\text{rec}}=26$   $\mu_{\text{rec}}=32$
  ---------------- ---------------------- ---------------------- ----------------------- ----------------------- ----------------------- -----------------------
      Val Loss             2.477                **2.453**                 2.456                   2.457                   2.458                   2.458
   Val Perplexity          11.906               **11.624**               11.665                  11.671                  11.692                  11.687

  : Validation loss and perplexity for looped models trained with different `\meanrecurrence `{=latex}and a fixed $\mu_{\text{bwd}}=4$. We use $T=\mu_{\text{rec}}$. Surprisingly, $\mu_{\text{rec}}=8$ performs the best.
:::

A natural question is what choice of `\meanrecurrence `{=latex}and `\meanbackward `{=latex}is appropriate for pretraining looped models. To answer this question, we conduct an experiment where we scale up `\meanrecurrence`{=latex}, while keeping `\meanbackward `{=latex}fixed. In our very initial experiments, we pretrained several small recurrent depth models [@geiping_scaling_2025] on 10 billion tokens, with a fixed $\mu_{\text{bwd}}=4$ and with $\mu_{\text{rec}} \in [4,8,14,20,26,32]$[^4]. The results for each of these models on a held-out set of validation data can be observed in `\cref{fig:mean-recurrence-scaling}`{=latex}. We additionally include `\cref{tab:mean-forward-results}`{=latex}, which gives the validation loss of each model with $\mu_{\text{rec}} \in [4,8,14,20,26,32]$, where the recurrence that we use for each model at test-time is $T=\mu_{\text{rec}}$.

The fascinating observation of `\cref{fig:mean-recurrence-scaling}`{=latex} is that, contrary to our initial beliefs, models trained with additional `\meanrecurrence `{=latex}beyond 8 perform worse at both lower and higher $r$ used at test-time, though more FLOPs were spent during pretraining. While it is a natural expectation that models trained with lower `\meanrecurrence `{=latex}perform better than models with larger `\meanrecurrence `{=latex}at low $T$, the fact that a `\meanrecurrence `{=latex}of eight performs the best at higher $T$ (i.e., $T=16$ and $T=64$) is surprising. To determine if this is an inherent limitation of the capacity of looped models or an artifact of `\meanbackward`{=latex}, we ran an additional experiment where we fixed $\mu_{rec}=20$ and instead varied $\mu_{\text{bwd}} \in [4,6,8,10,12]$, pretraining on 8.5 billion tokens for each model. We keep hyperparameters fixed. The results for each of these models on a held-out set of validation data can be visualized in `\cref{fig:mean-backward-recurrence-scaling}`{=latex} and `\cref{tab:mean-backward-results}`{=latex}.

![Validation and training curves of looped models, pretrained on 8.5 billion tokens. Each model is trained with a fixed $\mu_{\text{rec}}=20$ and $\mu_{bwd} \in [4,6,8,10,12]$. Observe that scaling up `\meanbackward `{=latex}improves validation performance at higher and lower recurrences monotonically for $T=1,16,64$.](figures/appendix/mean_back_depth_loss_curves.png){#fig:mean-backward-recurrence-scaling width="\\linewidth"}

::: {#tab:mean-backward-results}
                    $\mu_{\text{bwd}}=4$   $\mu_{\text{bwd}}=6$   $\mu_{\text{bwd}}=8$   $\mu_{\text{bwd}}=10$   $\mu_{\text{bwd}}=12$
  ---------------- ---------------------- ---------------------- ---------------------- ----------------------- -----------------------
      Val Loss             2.500                  2.490                  2.480                   2.479                 **2.474**
   Val Perplexity          12.09                  12.06                  11.94                   11.93                 **11.86**

  : Validation loss and perplexity of looped models trained with variable `\meanbackward`{=latex}, but fixed `\meanrecurrence`{=latex}.
:::

While lower `\meanbackward `{=latex}(i.e., $\mu_{\text{bwd}}=4,6,8$) appears to perform better with lower validation recurrences than higher `\meanbackward`{=latex}, the validation loss using $T=16,64$ improves as `\meanbackward `{=latex}increases. This implies that the capabilities of looped models utilizing deeper recurrences are heavily coupled with `\meanbackward`{=latex}. However, it can be observed that increasing `\meanbackward `{=latex}from ten to twelve has minimal impact on validation performance, at the cost of higher pretraining FLOPs. Using this insight, for our main training runs, we choose to use $$\mu_{\text{bwd}} = \lceil \frac{\mu_{\text{rec}}}{2} \rceil
\label{equation:mean-backward}$$ We leave the exploration of FLOP optimal choices of `\meanrecurrence `{=latex}and `\meanbackward `{=latex}to future work.

```{=latex}
\clearpage
```
Ablation of Prelude Normalization {#sec:prelude-norm}
=================================

In our initial set of experiments, we found that Parcae was able to train stably on the 140M, 370M, and 770M model configurations. Unfortunately, at the 1.3B scale, training appeared stable for the first 150k optimizer steps, afterwards exhibited state explosion and loss spikes, an observation which can be made in `\cref{fig:prelude-instability}`{=latex}. To diagnose and fix these issues, we performed a deep exploration of the weight checkpoints before and during loss spikes, investigating both dynamical systems parameters (e.g., $\A, \B, \C, \dt$) and non-linear parameters $\overline{\mathcal{R}}$.

![**Late Stage Instability of 1.3B Parcae models.** We observe loss spikes and state explosion at the final stages of our large-scale run.](figures/appendix/prelude_norm.png){#fig:prelude-instability width="\\linewidth"}

![**Spectral Norms of $\dA, \dB, \C$ throughout training 1.3B Parcae.** We find that the spectral norm of $\dA$ and $\dB$ remain stable throughout training, while the spectral norm of $\C$ grows.](figures/appendix/spectral_norms_ABC.png){#fig:abc_spectral width="\\linewidth"}

We begin by exploring the spectral norm of $\dA$, $\dB$, $\C$ to see if our dynamical systems block was creating instability, results of which can be found in `\cref{fig:abc_spectral}`{=latex}. While we observe that the spectral norm remains relatively low for $\dA$ and $\dB$, we observed that the spectral norm of $\C$ grew significantly throughout training. While this could be concerning, we find that when passing real activations to $\C$, using a subset of the validation set, the empirical expanse ratio $\frac{||C(x)||}{||x||}$ (i.e., how much the norm of the residual $x$ grew after performing $\C(x)$) remained relatively low, as seen in `\cref{fig:c_expansion}`{=latex}.

![**Comparison of $\C$ Amplification with Spectral Norm.** We observe that the actual expansion ratio of $\C$ is small and decreasing slowly throughout training.](figures/appendix/C_expansion.png){#fig:c_expansion width="\\linewidth"}

![**Empirical Average of Recurrent State Norm over $T$ iterations.** For each checkpoint we have for our failed 1.3B Parcae model run, we evaluate the recurrent norm through $T=24$ recurrences at test time, on a held out validation set of fineweb-edu [@penedo2024finewebdatasetsdecantingweb]. We find that after an initial explosion on the first recurrence, the state remains relatively stable.](figures/appendix/state_norm_per_iteration.png){#fig:state_norm_diagnose width="\\linewidth"}

These results indicate that the dynamical systems units are likely not causing an explosion, and thus, we turn our exploration of the dynamics of the entire recurrent unit. Specifically, we track the recurrent state norm at test-time after $T=24$ recurrences, results of which can be found in `\cref{fig:state_norm_diagnose}`{=latex}. We found that on the first recurrence, the recurrent state norm jumped drastically, and then remained relatively stable throughout increased recurrences. To determine what caused the initial spike, we perform a fine-grained analysis of the first recurrence (i.e., $T=1$), tracking the recurrent state norm after injection and through each transformer block, the results of which can be found in `\cref{fig:injection-explosion}`{=latex}. The major takeaway from `\cref{fig:injection-explosion}`{=latex} is that the non-linear parts of Parcae do not appear to cause the explosion in state and that the initial explosion steps from the input injections of $e$, the output of the prelude block `\prelude`{=latex}. We confirm that this is the case, and visualization of which can be seen in `\cref{fig:prelude-explosion}`{=latex}.

![**Recurrent State Norm Progression After Each Transformer Block for $T=1$.** For each checkpoint we have for our failed 1.3B Parcae model run, we evaluate the recurrent norm after injection and each non-linear transformer block for only $T=1$. We find that the non-linear parts of Parcae have little effect on explosion, which instead mainly stems from the initial injection of prelude output $e$.](figures/appendix/core_block_norms.png){#fig:injection-explosion width="\\linewidth"}

![**State Norm Progression Throughout each Transformer Layer in the Prelude Block.** For each checkpoint we have for our failed 1.3B Parcae model run, we evaluate residual norm after each transformer block in the prelude `\prelude`{=latex}. We find that a single layer creates an explosion of the residual norm and leads to divergence.](figures/appendix/prelude_norms.png){#fig:prelude-explosion width="\\linewidth"}

Given this, we propose a simple fix of adding a normalization layer on the output of the prelude block `\prelude `{=latex}(i.e., for an input $x$ then $e \gets \text{LN}(\mathcal{P}(x))$, where $\text{LN}(\cdot)$ is some form of normalization). We note that this does two things: (1) normalizes the input to the recurrent unit, which we observe to further stabilize the recurrent dynamics of looping, and (2) stabilizes the gradient flow to the `\prelude`{=latex}.[^5] This simple fix enables our stable training run for the 1.3B Parcae reported in `\cref{sec:results}`{=latex}.

Empirically, we find that using a prelude norm directly stabilizes the recurrent norm further, preventing the recurrent norm from growing too large (see `\cref{fig:prelude-norm-better}`{=latex}). Additionally, we find that using a prelude norm leads to better convergence in both our 140M and 370M Parcae models (see `\cref{fig:prelude-quality}`{=latex}), with only a negligible improvement for our 770M and 1.3B Parcae models.

![**Prelude Norm Stabilizes Recurrent Norm.** We find that prelude norm helps stabilize recurrent state norm in Parcae models following the setup in `\cref{sec:e2e}`{=latex} for Transformers.](figures/appendix/prelude_norm_ablation.png){#fig:prelude-norm-better width="\\linewidth"}

![**Prelude Norm Improves Quality.** We find that in our 140M and 370M Parcae models trained in the same setup as `\cref{sec:e2e}`{=latex} for Transformers, normalizing the prelude output leads to better convergence.](figures/appendix/prelude_norm_val_loss.png){#fig:prelude-quality width="\\linewidth"}

```{=latex}
\clearpage
```
Fitting a Parametric Function for Looping {#sec:fit-par}
=========================================

We follow @hoffmann2022trainingcomputeoptimallargelanguage setup for fitting a parametric loss function. Specifically, using the models trained with several IsoFLOP budgets in `\cref{sec:train-scaling}`{=latex}, we fit a parametric function of the form $$\widehat{\mathcal{L}}_{\text{train}}(\mu_{\text{rec}}, \mathcal{D}) = E + A \cdot \mathbf{N}(\mu_{\text{rec}})^{-a} + B \cdot \mathcal{D}^{-b}$$ where $\mathbf{N}(\mu_{\text{rec}})$ is the *effective parameter count* of the model if you were to unroll all loops into real parameters, $\mathcal{D}$ is the number of tokens that were used in training, and $A, B, a, b$ are learned parameters. We specifically use Huber loss [@huber] on the log loss between the prediction of the parametric fit and the validation loss of the models, using L-BFGS [@lbfgs] to minimize. We choose the parametric function of this form as it exactly follows [@hoffmann2022trainingcomputeoptimallargelanguage], but with parameters $\mathbf{N}$ now being a function of `\meanrecurrence`{=latex}. Finally, we take the best result from 500 random restarts of L-BFGS, each with up to 10,000 iterations, selecting the initialization that achieves the lowest Huber loss. The results of fitting the parametric function can be visualized in `\cref{fig:parameteric-fit-app}`{=latex}, and the learned values can be observed in `\cref{tab:parametric-fit-app}`{=latex}.

![**Parametric Fit of Looping.** Visualization of our parametric function $\widehat{\mathcal{L}}_{\text{train}}(\mu_{\text{rec}}, D)$, which displays the IsoLoss contours for both 140M Parcae (*left*) and 370M Parcae (*right*) models. ](figures/appendix/parametric_fit_contours.png){#fig:parameteric-fit-app width="\\linewidth"}

::: {#tab:parametric-fit-app}
  **Model**        $\boldsymbol{E}$   $\boldsymbol{A}$   $\boldsymbol{a}$   $\boldsymbol{B}$   $\boldsymbol{b}$   **Huber** ($\times 10^{-4}$)  
  --------------- ------------------ ------------------ ------------------ ------------------ ------------------ ------------------------------ --
  Small (140M)          2.662            522733.307           0.771            25420.102            0.525                     0.44              
  Medium (370M)         2.439            832134.346           0.775             6386.865            0.448                     0.01              

  : **Optimal Scaling Coefficients for Parametric Fits.**
:::

Fitting Parametric Functions to Test-Time Looping {#sec:test-par}
=================================================

In this section, we provide a more detailed analysis of the test-time scaling laws discussed in `\cref{sec:inf-scaling}`{=latex}. Following the setup discussed in `\cref{sec:scaling-laws-setup}`{=latex}, we train several Parcae models on varying `\meanrecurrence`{=latex}, fixing data and parameter count, and evaluate each at test-time recurrences up to $T=24$.

#### Choice of Functional Form.

We aim to find a parametric function that captures the saturating relationship between test-time recurrence $T$ and validation loss. We consider four candidate functional forms, each with an irreducible loss floor $\mathcal{L}_\infty$ (except the pure power law):

1.  $\mathcal{L}(T) = \mathcal{L}_\infty + Z \cdot e^{-zT}$ `\hfill `{=latex}(exponential decay)

2.  $\mathcal{L}(T) = \mathcal{L}_\infty + Z \cdot (1+T)^{-z}$ `\hfill `{=latex}(shifted power law)

3.  $\mathcal{L}(T) = \mathcal{L}_\infty + Z \cdot T^{-z}$ `\hfill `{=latex}(power law)

4.  $\mathcal{L}(T) = Z \cdot T^{-z}$ `\hfill `{=latex}(power law, no floor)

Each form has 3 free parameters ($\mathcal{L}_\infty, Z, z$), except (d), which has 2. We fit each form independently to every test-time curve using least-squares on log-loss, and report the average Huber loss ($\delta = 10^{-3}$) across all curves. To evaluate extrapolation, we additionally fit each form on $T \leq \mu_{\text{rec}}$ and evaluate on held-out $T > \mu_{\text{rec}}$.

::: {#tab:functional-form-ablation}
                                              $\mathcal{L}_\infty {+} Z e^{-zT}$   $\mathcal{L}_\infty {+} Z(1{+}T)^{-z}$   $\mathcal{L}_\infty {+} Z T^{-z}$   $Z T^{-z}$
  ------------------------------------------ ------------------------------------ ---------------------------------------- ----------------------------------- ------------
  *In-Distribution*                                                                                                                                            
  `\quad `{=latex}140M                                     **2.52**                                 5.42                                  11.11                   112.89
  `\quad `{=latex}370M                                     **1.88**                                 5.26                                  10.77                   104.95
  *Extrapolation ($T > \mu_{\text{rec}}$)*                                                                                                                     
  `\quad `{=latex}140M                                     **3.18**                                21.41                                  43.99                   397.90
  `\quad `{=latex}370M                                     **2.29**                                18.51                                  38.68                   369.83

  : Functional form comparison for test-time scaling. We report average Huber loss ($\times 10^{-7}$) across all per-curve fits, both in-distribution (all $T$) and in extrapolation (fit $T \leq \mu_{\text{rec}}$, evaluate $T > \mu_{\text{rec}}$). Lower is better.
:::

As shown in `\cref{tab:functional-form-ablation}`{=latex}, the exponential decay form achieves the lowest Huber loss both in-distribution ($2.3\times$ better than the shifted power law) and under extrapolation ($7.1\times$ better), consistently across both model sizes. Notably, omitting the irreducible floor $\mathcal{L}_\infty$ (form (d)) increases error by over $40\times$, confirming that test-time scaling saturates to a finite loss determined by training (this is also obvious from looking at `\cref{fig:test-time-scaling-laws}`{=latex}).

While purely speculative, there is a nice connection between the exponential form and Parcae's dynamical systems framework. In classical control theory literature, a stable discrete-time linear system with a spectral radius below unity converges exponentially in the state norm. The observed exponential decay in loss is thus consistent with the dynamical system formulation that Parcae uses.

#### Recovery of the training law at $T = \mu_\text{rec}$.

We additionally observe that the fitted irreducible loss $\mathcal{L}_\infty$ closely matches the empirical loss at $T = \mu_\text{rec}$ (`\cref{tab:recovery}`{=latex}), motivating the use of the training scaling law $\hat{\mathcal{L}}_{\mathrm{train}}(\mu_\text{rec}, D)$ as the irreducible floor in a unified law.

::: {#tab:recovery}
  Model    Mean % Err   Max % Err
  ------- ------------ -----------
  140M       0.16%        0.59%
  370M       0.05%        0.22%

  : Mean and max absolute percent error between $\mathcal{L}_\infty$ and $\mathcal{L}(T{=}\mu_\text{rec})$ across all isoFLOP configurations.
:::

#### Conditioning on Training Recurrence.

To model test-time scaling across models trained at different `\meanrecurrence`{=latex}, the decay rate must depend on the training depth. We compare three forms for the unified test-time law, all using the training scaling law $\hat{\mathcal{L}}_{\mathrm{train}}(\mu_{\text{rec}}, D)$ from `\cref{sec:train-scaling}`{=latex} as the irreducible floor:

1.  $\hat{\mathcal{L}}_{\mathrm{train}} + Z \cdot \exp\!\bigl(-z \cdot \mu_\text{rec}^{-\gamma} \cdot T\bigr)$ `\hfill `{=latex}(learned $\gamma$, 3 params)

2.  $\hat{\mathcal{L}}_{\mathrm{train}} + Z \cdot \exp\!\bigl(-z / \mu_\text{rec}\cdot T\bigr)$ `\hfill `{=latex}($\gamma = 1$, 2 params)

3.  $\hat{\mathcal{L}}_{\mathrm{train}} + Z \cdot \exp\!\bigl(-z \cdot T\bigr)$ `\hfill `{=latex}(no conditioning, 2 params)

::: {#tab:mu-conditioning-ablation}
                                             $Z e^{-z \mu^{-\gamma} T}$   $Z e^{-z T / \mu}$ ($\gamma{=}1$)   $Z e^{-z T}$ (no $\mu$)
  ----------------------------------------- ---------------------------- ----------------------------------- -------------------------
  *Train (isoFLOP)*                                                                                          
  `\quad `{=latex}140M                                0.001116                        0.001177                       0.003253
  `\quad `{=latex}370M                                0.000229                        0.000283                       0.001438
  *Test (held-out, $\mu_\text{rec}{=}8$)*                                                                    
  `\quad `{=latex}140M                              **0.000207**                      0.000212                       0.000266
  `\quad `{=latex}370M                                0.000133                      **0.000131**                     0.000189

  : Ablation of `\meanrecurrence `{=latex}conditioning in the unified test-time law. We report total Huber loss on the isoFLOP training set and on held-out Table 5 models ($\mu_\text{rec} = 8$, fixed data budget). Lower is better.
:::

As shown in `\cref{tab:mu-conditioning-ablation}`{=latex}, removing $\mu_\text{rec}$ conditioning entirely increases training error by $3.5\times$ and held-out error by ${\sim}33\%$, confirming that the decay rate must depend on training depth (also obvious from looking at `\cref{fig:test-time-scaling-laws}`{=latex}). The learned $\gamma$ offers a modest improvement (${\sim}8\%$) over $\gamma = 1$ on the training set, with fitted values of $\gamma = 1.19$ (140M) and $\gamma = 1.17$ (370M) consistent across scales; on held-out models, the two are indistinguishable. We therefore adopt $\gamma = 1$ for simplicity, yielding the unified law: $$\hat{\mathcal{L}}_{\mathrm{unified}}(T \mid \mu_\text{rec}, D) = \underbrace{E + X \cdot N(\mu_\text{rec})^{-x} + Y \cdot D^{-y}}_{\text{Training Law Floor } \hat{\mathcal{L}}_{\mathrm{train}}(\mu_\text{rec}, D)} + \underbrace{Z \cdot \exp\!\left(-\frac{z \cdot T}{\mu_\text{rec}}\right)}_{\text{Test-Time Decay}}$$ where the test-time term depends on the ratio $T / \mu_\text{rec}$, i.e., the fraction of training depth used at inference.

#### Testing the Unified Parametric Fit.

To evaluate generalization, we use the unified law fitted on isoFLOP data to predict the test-time scaling curves of held-out 140M and 370M Parcae models from `\cref{sec:e2e}`{=latex}, which were trained on fixed data budgets and are a completely out-of-distribution setting. As shown in `\cref{fig:unified-pred}`{=latex}, the unified fit (orange) predicts validation loss within 0.85--1.31% average error. When the training law floor is replaced with the empirical loss at $T = \mu_\text{rec}$ (oracle, blue), error drops to 0.10--0.17%, confirming that the test-time decay is faithfully captured and the residual error is attributable to the training law's ${\sim}1\%$ extrapolation gap.

![**Out-of-Distribution Prediction of Unified Parametric Fit.** We visualize the prediction of our unified parametric fit (orange) and an oracle fit using the empirical loss at $T = \mu_\text{rec}$ for $\widehat{\mathcal{L}}_\text{train}$ (blue) against empirical validation loss with increasing $T$ for models trained in `\cref{sec:e2e}`{=latex}.](figures/appendix/unified_joint_generalization_exp.png){#fig:unified-pred width="\\linewidth"}

```{=latex}
\newpage
```
Extended Evaluation Details and Setup {#sec:evaluation-setup}
=====================================

```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{0.92}
```
```{=latex}
\small
```
::: {#tab:eval-tasks}
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
| **Category**                                       | **Task**                                                                              | **Type** | **Shots** | **Core**          |
+:===================================================+:======================================================================================+:========:+:=========:+:=================:+
| ```{=latex}                                        | HellaSwag [@zellers2019hellaswag] (0-shot)                                            | MC       | 0         | `\cmark `{=latex} |
| \rotatebox[origin=c]{90}{\textit{Understanding}}   |                                                                                       |          |           |                   |
| ```                                                |                                                                                       |          |           |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | HellaSwag [@zellers2019hellaswag] (10-shot)                                           | MC       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | Lambada [@paperno2016lambada]                                                         | LM       | 0         | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | Winograd WSC [@wsc:2015]                                                              | S        | 0         | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | WinoGrande [@sakaguchi2021winogrande]                                                 | S        | 0         | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Language ID [@srivastava2023imitationgamequantifyingextrapolating]          | MC       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Conlang Translation [@srivastava2023imitationgamequantifyingextrapolating]  | LM       | 0         |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Conceptual Comb. [@srivastava2023imitationgamequantifyingextrapolating]     | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
| ```{=latex}                                        | Jeopardy [@kaggle200000Jeopardy]                                                      | LM       | 10        | `\cmark `{=latex} |
| \rotatebox[origin=c]{90}{\textit{World Knowl.}}    |                                                                                       |          |           |                   |
| ```                                                |                                                                                       |          |           |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench QA WikiData [@srivastava2023imitationgamequantifyingextrapolating]          | LM       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | ARC-Easy [@clark2018think]                                                            | MC       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | ARC-Challenge [@clark2018think]                                                       | MC       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | MMLU (0-shot) [@hendrycks2021measuringmassivemultitasklanguage]                       | MC       | 0         |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | MMLU (5-shot) [@hendrycks2021measuringmassivemultitasklanguage]                       | MC       | 5         |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Misconceptions [@srivastava2023imitationgamequantifyingextrapolating]       | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
| ```{=latex}                                        | COPA [@gordon-etal-2012-semeval]                                                      | MC       | 0         | `\cmark `{=latex} |
| \rotatebox[origin=c]{90}{\textit{Commonsense}}     |                                                                                       |          |           |                   |
| ```                                                |                                                                                       |          |           |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | CommonsenseQA [@talmor2019commonsenseqaquestionansweringchallenge]                    | MC       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | PIQA [@bisk2020piqa]                                                                  | MC       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | OpenBookQA [@mihaylov2018suitarmorconductelectricity]                                 | MC       | 0         | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | SIQA [@sap2019socialiqacommonsensereasoningsocial]                                    | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Novel Concepts [@srivastava2023imitationgamequantifyingextrapolating]       | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Strange Stories [@srivastava2023imitationgamequantifyingextrapolating]      | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Strategy QA [@srivastava2023imitationgamequantifyingextrapolating]          | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
| ```{=latex}                                        | BIG-Bench Dyck Languages [@srivastava2023imitationgamequantifyingextrapolating]       | LM       | 10        | `\cmark `{=latex} |
| \rotatebox[origin=c]{90}{\textit{Symbolic / Math}} |                                                                                       |          |           |                   |
| ```                                                |                                                                                       |          |           |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | AGI Eval LSAT AR [@zhong2021arlsat]                                                   | MC       | 3         | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench CS Algorithms [@srivastava2023imitationgamequantifyingextrapolating]        | LM       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Operators [@srivastava2023imitationgamequantifyingextrapolating]            | LM       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Repeat Copy Logic [@srivastava2023imitationgamequantifyingextrapolating]    | LM       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Elementary Math QA [@srivastava2023imitationgamequantifyingextrapolating]   | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Logical Deduction [@srivastava2023imitationgamequantifyingextrapolating]    | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | Simple Arithmetic (no spaces) [@llmfoundry]                                           | LM       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | Simple Arithmetic (w/ spaces) [@llmfoundry]                                           | LM       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | MathQA [@amini2019mathqa]                                                             | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | LogiQA [@liu2020logiqa]                                                               | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
| ```{=latex}                                        | SQuAD [@rajpurkar2016squad100000questionsmachine]                                     | LM       | 10        | `\cmark `{=latex} |
| \rotatebox[origin=c]{90}{\textit{Reading Comp.}}   |                                                                                       |          |           |                   |
| ```                                                |                                                                                       |          |           |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | CoQA [@reddy2019coqaconversationalquestionanswering]                                  | LM       | 0         | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BoolQ [@clark2019boolq]                                                               | MC       | 10        | `\cmark `{=latex} |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | PubMedQA (labeled) [@jin2019pubmedqa]                                                 | LM       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | AGI Eval LSAT RC [@zhong2023agieval]                                                  | MC       | 3         |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | AGI Eval LSAT LR [@wang2022lsat]                                                      | MC       | 3         |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | AGI Eval SAT English [@zhong2023agieval]                                              | MC       | 3         |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BIG-Bench Understanding Fables [@srivastava2023imitationgamequantifyingextrapolating] | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
| ```{=latex}                                        | Winogender MC (Female) [@rudinger2018genderbiascoreferenceresolution]                 | MC       | 10        |                   |
| \rotatebox[origin=c]{90}{\textit{Safety}}          |                                                                                       |          |           |                   |
| ```                                                |                                                                                       |          |           |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | Winogender MC (Male) [@rudinger2018genderbiascoreferenceresolution]                   | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | Enterprise PII Classification                                                         | MC       | 10        |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+
|                                                    | BBQ [@parrish2022bbqhandbuiltbiasbenchmark]                                           | MC       | 3         |                   |
+----------------------------------------------------+---------------------------------------------------------------------------------------+----------+-----------+-------------------+

: **Full list of downstream evaluation** Tasks marked with `\cmark{}`{=latex} are included in the **Core** [@li2025datacomplmsearchgenerationtraining]; all tasks are included in **Core-Extended** [@li2025datacomplmsearchgenerationtraining]. Type indicates the scoring method: **MC** (multiple choice, lowest mean NLL), **S** (schema-based NLL), or **LM** (exact greedy match).
:::

We include a complete list of benchmarks used for evaluation in `\cref{tab:eval-tasks}`{=latex}. For our results in `\cref{sec:results}`{=latex} where we are comparing against baseline transformers, we run each benchmark with three different seeds, as this changes both the initial recurrent state and the in-context few-shot examples.

Expanded Results For Fixed-Depth and Looping IsoFLOP Comparison {#sec:fixed-comparison-expanded}
===============================================================

We included an expanded form of `\cref{tab:qual-fix-loop}`{=latex} to ensure reproducibility, which additionally includes error bars in `\cref{tab:qual-fix-loop-expanded}`{=latex}.

::: {#tab:qual-fix-loop-expanded}
+:-----------------------------------------------------:+:------------------:+:----------------------:+:----------------------------------:+:----------------------------------------:+:-----------------------:+:-----------------------:+
|                                                       | **FLOPs**          |                        | **Optimal $\mu_{\mathrm{rec}}^*$** | **Fixed-Depth** ($\mu_{\mathrm{rec}}=1$) |                         |                         |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
| `\cmidrule`{=latex}(lr)4-5 `\cmidrule`{=latex}(lr)6-7 | ($\times 10^{18}$) | $\mu_{\mathrm{rec}}^*$ | Core                               | Core Ext.                                | Core                    | Core Ext.               |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
| ```{=latex}                                           | $1$                | 2                      | $7.6 \pm 0.3$                      | $5.7 \pm 0.5$                            | $\mathbf{7.9 \pm 0.2}$  | $\mathbf{6.1 \pm 0.1}$  |
| \rotatebox[origin=c]{90}{\textbf{140M}}               |                    |                        |                                    |                                          |                         |                         |
| ```                                                   |                    |                        |                                    |                                          |                         |                         |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
|                                                       | $2$                | 2                      | $9.0 \pm 0.2$                      | $6.2 \pm 0.1$                            | $\mathbf{10.5 \pm 0.1}$ | $\mathbf{6.4 \pm 0.2}$  |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
|                                                       | $4$                | 4                      | $\mathbf{11.2 \pm 0.0}$            | $\mathbf{8.4 \pm 0.2}$                   | $10.7 \pm 0.1$          | $8.1 \pm 0.3$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
|                                                       | $8$                | 6                      | $10.5 \pm 0.1$                     | $\mathbf{7.8 \pm 0.2}$                   | $\mathbf{11.8 \pm 0.2}$ | $7.7 \pm 0.2$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
|                                                       | $16$               | 8                      | $\mathbf{14.6 \pm 0.1}$            | $\mathbf{9.8 \pm 0.4}$                   | $13.0 \pm 0.2$          | $8.8 \pm 0.4$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
|                                                       | $64$               | 10                     | $\mathbf{16.2 \pm 0.2}$            | $\mathbf{11.0 \pm 0.1}$                  | $15.0 \pm 0.2$          | $9.5 \pm 0.4$           |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
| ```{=latex}                                           | $32$               | 4                      | $15.2 \pm 0.1$                     | $10.1 \pm 0.2$                           | $\mathbf{16.8 \pm 0.1}$ | $\mathbf{11.2 \pm 0.4}$ |
| \rotatebox[origin=c]{90}{\textbf{370M}}               |                    |                        |                                    |                                          |                         |                         |
| ```                                                   |                    |                        |                                    |                                          |                         |                         |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
|                                                       | $64$               | 6                      | $\mathbf{18.1 \pm 0.2}$            | $11.6 \pm 0.2$                           | $\mathbf{18.1 \pm 0.1}$ | $\mathbf{12.1 \pm 0.2}$ |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+
|                                                       | $128$              | 6                      | $\mathbf{20.1 \pm 0.1}$            | $\mathbf{13.0 \pm 0.1}$                  | $18.1 \pm 0.1$          | $12.0 \pm 0.1$          |
+-------------------------------------------------------+--------------------+------------------------+------------------------------------+------------------------------------------+-------------------------+-------------------------+

: **Expanded Core Scores Comparison of Looping Optimal Frontier over Purely Scaling Data.** Including variance bars now.
:::

Expanded Setup For Training and Test-Time Scaling Laws {#sec:scaling-laws-setup}
======================================================

For our scaling laws experiments, we train models under two setups: (1) an isoFLOP training setup where we train models with variable amounts of `\meanrecurrence`{=latex}, but with fixed FLOP and parameter budgets, and (2) where we vary `\meanrecurrence`{=latex}, but keep data and parameters constant. Additionally, for our unified scaling laws experiments, we reuse the models trained in setup (1) and then evaluate them with varying amounts of test-time recurrences. All of the experiments use the same exact experimental setup for Transformers described in `\cref{sec:hyperparameters}`{=latex} and `\cref{sec:model-definitions}`{=latex} (i.e., using a `nanochat` [@nanochat]). We will discuss each experiment in detail below.

#### (1) Setup for IsoFLOP Experiments.

For each parameter count (140M and 370M), we fix the total training FLOP budget and vary $\mu_{\text{rec}} \in \{2, 4, 6, 8, 10, 12\}$, adjusting the number of training tokens to maintain the FLOP budget (i.e., increasing $\mu_{\text{rec}}$ reduces the token budget proportionally). For 140M models, we use FLOP budgets of $\{1, 2, 4, 8, 16, 64\} \times 10^{18}$; for 370M models, $\{32, 64, 128\} \times 10^{18}$. This yields 36 and 18 trained models for 140M and 370M, respectively. Each model is evaluated on a held-out validation set at $T = \mu_{\text{rec}}$. We use these validation losses to fit the parametric training scaling law $\widehat{\mathcal{L}}_{\text{train}}(\mu_{\text{rec}}, \mathcal{D})$ and to extract optimal $\mu_{\text{rec}}^*$ at each FLOP budget via parabolic fits. Additionally, we train fixed-depth ($\mu_{\text{rec}} = 1$) Parcae models at each FLOP budget to serve as baselines for the looping frontier comparison. Expanded details of the predicted frontiers calculation can be found in `\cref{sec:fixed-comparison-expanded}`{=latex}.

#### (2) Setup for Test-Time Saturation and Power Laws.

To study how test-time recurrence scales quality, we train 140M and 370M Parcae models under a fixed data budget of 11.2B tokens with $\mu_{\text{rec}} \in \{2, 4, 6, 8, 10, 12\}$. Each model is then evaluated on a held-out validation set at test-time recurrences $T \in \{1, 2, 3, \ldots, 24\}$, yielding a saturation curve per $\mu_{\text{rec}}$. We fit an independent exponential decay law $\mathcal{L}(T) = \mathcal{L}_\infty + Z \cdot \exp(-z \cdot T)$ to each curve following the procedure in Section `\ref{sec:test-par}`{=latex}. We additionally evaluate the Parcae models from `\cref{sec:e2e}`{=latex} (140M--1.3B, trained at $\mu{\text{rec}}=8$) at test-time recurrences $T \in \{1, \ldots, 16\}$ to verify that the saturation behavior is consistent across model sizes.

#### (3) Setup for Unified Scaling Law.

To fit the unified scaling law (`\cref{eq:unified}`{=latex}), we reuse the isoFLOP models from setup (1) and evaluate each at test-time recurrences $T \in \{1, 2, 4, 6, 8, 10, 12, 16, 20, 24\}$, yielding approximately 540 data points per model size. We fit all 8 parameters of `\cref{eq:unified}`{=latex} jointly on this data using Huber loss on the log loss with L-BFGS over 1,000 random restarts. To validate, we evaluate the unified fit on held-out 140M and 370M Parcae models from `\cref{sec:e2e}`{=latex}, which were trained on fixed data budgets outside the isoFLOP sweep, at test-time recurrences $T \in \{1, \ldots, 16\}$.

Model Definitions {#sec:model-definitions}
=================

As we perform experiments in two setups, one following prior work in recurrent depth models [@geiping_scaling_2025] and one following a strong baseline transformer [@nanochat], we separate the model definitions into `\cref{sec:rdm-parcae-model-def}`{=latex} and `\cref{sec:trans-parcae-model-def}`{=latex}, respectively.

Model Definitions for RDM and Parcae Comparison {#sec:rdm-parcae-model-def}
-----------------------------------------------

In this section, we will discuss the model configuration used for models in `\cref{sec:e2e}`{=latex} for RDMs [@geiping_scaling_2025]. For all `\prelude`{=latex}, `\recurrent`{=latex}, and `\coda `{=latex}modules, we follow @geiping_scaling_2025, and use standard, causal self-attention and gated SwiGLU MLP [@shazeer2020gluvariantsimprovetransformer]. For attention, we use RoPE [@su2023roformerenhancedtransformerrotary] with $\theta=50000$ and for normalization we use RMSNorm [@zhang2019rootmeansquarelayer]. We use Pre-Norm transformer blocks for all modules within Parcae, and follow @takase2025spikemorestabilizingpretraining, initializing weights using $\mathcal{N}(0,\frac{2}{5d})$, where $d$ is the model dimension.

::: {#tab:parcae-definitions}
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
|                                 | Parcae-100M                                                                                      | Parcae-350M | RDM-100M    | RDM-350M    |
+:===============================:+:================================================================================================:+:===========:+:===========:+:===========:+
| **Parameters**                  | 114,242,560                                                                                      | 378,558,464 | 114,242,560 | 382,765,056 |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Layers in `\prelude  `{=latex}  | 1                                                                                                | 1           | 1           | 1           |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Layers in `\coda `{=latex}      | 1                                                                                                | 1           | 1           | 1           |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Layers in `\recurrent `{=latex} | 1                                                                                                | 2           | 1           | 2           |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| $d_{\text{model}}$              | 1,024                                                                                            | 2,048       | 1,024       | 2,048       |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| $d_{\text{intermediate}}$       | 3,520                                                                                            | 7,040       | 3,520       | 7,040       |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Attention                       | Causal Self-Attention [@vaswani2023attentionneed]                                                |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| MLP                             | SwiGLU [@elfwing2017sigmoidweightedlinearunitsneural; @shazeer2020gluvariantsimprovetransformer] |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Pos. Embed.                     | RoPE [@su2023roformerenhancedtransformerrotary]                                                  |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Vocab Size                      | 65,536                                                                                           |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Norm                            | RMS-Norm [@zhang2019rootmeansquarelayer]                                                         |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Init                            | Scaled [@takase2025spikemorestabilizingpretraining]                                              |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Tied Embeddings                 | Yes                                                                                              |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| State Init.                     | `like-init` [@geiping_scaling_2025]                                                              |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| ```{=latex}                     | 16                                                                                               | 8           | 16          | 8           |
| \meanrecurrence                 |                                                                                                  |             |             |             |
| ```                             |                                                                                                  |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Backprop Depth                  | 8                                                                                                | 4           | 8           | 4           |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+
| Sampling                        | Poisson Distribution                                                                             |             |             |             |
+---------------------------------+--------------------------------------------------------------------------------------------------+-------------+-------------+-------------+

: Model definitions of both Parcae and baseline residual-norm RDMs [@geiping_scaling_2025].
:::

Model Definitions for Transformer and Parcae Comparison {#sec:trans-parcae-model-def}
-------------------------------------------------------

In this section, we will discuss the model definitions used for our experiments in `\cref{sec:e2e}`{=latex} for Transformers. Our architecture is derived from @nanochat, while being slightly adapted to fit with GPT2 [@radford2019language] style parameter classes. Model definitions of both Parcae and baseline Transformers can be found in `\cref{tab:transformer-parcae-definitions}`{=latex}, while the difference in parameter count can be found in `\cref{tab:transform-parcae-parameter-count}`{=latex}.[^6]

```{=latex}
\small
```
::: {#tab:transformer-parcae-definitions}
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   |                                         | Small (140M)                                                                                               | Medium (370M) | Large (770M) | XLarge (1.3B) |
+:==================================================+:=======================================:+:==========================================================================================================:+:=============:+:============:+:=============:+
| ```{=latex}                                       | Layers (Transformer)                    | 6                                                                                                          | 12            | 18           | 24            |
| \rotatebox[origin=c]{90}{\textit{Architecture}}   |                                         |                                                                                                            |               |              |               |
| ```                                               |                                         |                                                                                                            |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Layers in `\prelude `{=latex}(Parcae)   | 2                                                                                                          | 4             | 6            | 8             |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Layers in `\recurrent `{=latex}(Parcae) | 2                                                                                                          | 4             | 6            | 8             |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Layers in `\coda `{=latex}(Parcae)      | 2                                                                                                          | 4             | 6            | 8             |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | $d_{\text{model}}$                      | 768                                                                                                        | 1,024         | 1,280        | 1,536         |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | $d_{\text{intermediate}}$               | 3,072                                                                                                      | 4,096         | 5,120        | 6,144         |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Attention Heads                         | 6                                                                                                          | 8             | 10           | 12            |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Head Dimension                          | 128                                                                                                        | 128           | 128          | 128           |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
| ```{=latex}                                       | Attention                               | Causal Self-Attention [@vaswani2023attentionneed] w/ QK-Norm [@henry2020querykeynormalizationtransformers] |               |              |               |
| \rotatebox[origin=c]{90}{\textit{Shared Details}} |                                         |                                                                                                            |               |              |               |
| ```                                               |                                         |                                                                                                            |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | MLP                                     | $\text{ReLU}^2$ [@zhang2024relu2winsdiscoveringefficient]                                                  |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Value Embeddings                        | Gated, alternating layers [@tian2023resformerscalingvitsmultiresolution]                                   |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Pos. Embed.                             | RoPE ($\theta{=}50{,}000$) [@su2023roformerenhancedtransformerrotary]                                      |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Vocab Size                              | 32,768                                                                                                     |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Norm                                    | RMS-Norm (Pre-Norm) [@zhang2019rootmeansquarelayer]                                                        |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Context Length                          | 2,048                                                                                                      |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Bias                                    | None                                                                                                       |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Init.                                   | Scaled-zero [@takase2025spikemorestabilizingpretraining; @nanochat]                                        |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Tied Embeddings                         | Yes                                                                                                        |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
| ```{=latex}                                       | Injection                               | Diagonal                                                                                                   |               |              |               |
| \rotatebox[origin=c]{90}{\textit{Parcae}}         |                                         |                                                                                                            |               |              |               |
| ```                                               |                                         |                                                                                                            |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | State Init.                             | `like-init` [@geiping_scaling_2025]                                                                        |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | ```{=latex}                             | 8                                                                                                          |               |              |               |
|                                                   | \meanrecurrence                         |                                                                                                            |               |              |               |
|                                                   | ```                                     |                                                                                                            |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Backprop Depth                          | 4                                                                                                          |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+
|                                                   | Sampling                                | Poisson (truncated, per-sequence)                                                                          |               |              |               |
+---------------------------------------------------+-----------------------------------------+------------------------------------------------------------------------------------------------------------+---------------+--------------+---------------+

: Model definitions of both Parcae and baseline Transformers.
:::

```{=latex}
\small
```
::: {#tab:transform-parcae-parameter-count}
                            Small (140M)   Medium (370M)   Large (770M)   XLarge (1.3B)
  ------------------------ -------------- --------------- -------------- ---------------
  Transformer Parameters    143,141,184     385,903,104    773,375,040    1,333,868,544
  Parcae Parameters         144,323,136     388,003,328    776,655,680    1,338,591,744
  Additional Parameters      1,181,952       2,100,224      3,280,640       4,723,200
  Additional (%)               0.83%           0.54%          0.42%           0.35%

  : Comparison of Parcae and Transformer parameter count.
:::

```{=latex}
\newpage
```
Hyperparameters and Training Details {#sec:hyperparameters}
====================================

Again, as we perform experiments in two setups, one following prior work in recurrent depth models [@geiping_scaling_2025] and one following a strong baseline transformer [@nanochat], we separate the hyperparameter configurations into `\cref{sec:rdm-parcae-hyp}`{=latex} and `\cref{sec:trans-parcae-hyp}`{=latex}, respectively.

Hyperparameters for Parcae and RDM Comparison {#sec:rdm-parcae-hyp}
---------------------------------------------

In this section, we will discuss the hyperparameter configuration used in `\cref{sec:e2e}`{=latex} for RDMs [@geiping_scaling_2025]. We train with a warm-up and cool-down (4096 steps following [@geiping_scaling_2025]) and a constant learning rate ($\eta = 4 \times 10^{-3}$ for 100M models and $\eta = 2 \times 10^{-3}$ for 350M models) [@pmlr-v202-geiping23a; @Zhai_2022_CVPR]. As our optimizer, we use Adam with decoupled weight regularization ($\beta_1 = 0.9, \beta_2 0.95$) [@kingma2017adammethodstochasticoptimization; @loshchilov2019decoupledweightdecayregularization], using update clipping [@wortsman2023stable] and removing the $\epsilon$ constant [@everett2024scalingexponentsparameterizationsoptimizers]. Gradients above 1 are clipped.

For learning rates, we swept our selection of learning rates for RDMs [@geiping_scaling_2025], over the search space $[2e-4, 4e-4, 6e-4, 8e-4, 1e-3]$, approximately using 10 to 1 token to parameter ratio. We then select the best learning rate for each scale (e.g., 4e-4 for 100M and 2e-4 for 350M). We perform no learning rate sweep for Parcae, using the best learning rate for RDMs [@geiping_scaling_2025]. We do this so that our comparison between Parcae and prior methods is fair, as we observed significant divergence in training for RDMs based on learning rate (see `\cref{sec:stability-ablations}`{=latex}). We stipulate that Parcae models would likely perform better with stronger hyperparameter tuning.

Hyperparameters for Parcae and Transformer Comparison {#sec:trans-parcae-hyp}
-----------------------------------------------------

In this section, we will discuss the hyperparameter configuration used in `\cref{sec:e2e}`{=latex} for Transformers. We use a simplified version of `nanochat` [@nanochat], with the main difference being a simplified learning rate selection. Specifically, in `nanochat` [@nanochat], different parameter groups have different learning rates (e.g., MLP, value-embeddings, and projection head have different learning rates), which we simplify into just two parameter groups, one for AdamW [@kingma2017adammethodstochasticoptimization; @loshchilov2019decoupledweightdecayregularization] and one for Muon [@jordan2024muon]. A breakdown of which parameters are placed with each of these groups follows `nanochat` [@nanochat], and can be found in `\cref{tab:parameter-groups}`{=latex}.

```{=latex}
\small
```
::: {#tab:parameter-groups}
  **Optimizer**                                                       **Parameters**
  ------------------------------------------------------------------- ----------------------------------------------------
  AdamW [@kingma2017adammethodstochasticoptimization]                 Token embeddings (`wte`)
                                                                      LM head (`lm_head`)
                                                                      Normalization layers (`RMSNorm`)
                                                                      Value embedding gates (`ve_gate`)
                                                                      All 1D parameters
  AdamW [@kingma2017adammethodstochasticoptimization] (Parcae only)   Injection parameters ($\A$, $\dt$, $\B$)
                                                                      Readout projection ($\C$)
  Muon [@jordan2024muon]                                              Attention projections ($W_Q$, $W_K$, $W_V$, $W_O$)
                                                                      MLP weights ($W_{\text{fc}}$, $W_{\text{proj}}$)

  : Optimizer parameter group assignment for Parcae and baseline Transformers.
:::

As we simplify the learning rate setup used in `nanochat` [@nanochat], we perform a rigorous hyperparameter sweep of baseline Transformers to create the strongest baseline. Specifically, for small and medium models, we form a sweep over $\{3e-4, 5e-4, 6e-4, 8e-4, 1e-3, 1.5e-3, 2e-3, 3e-3, 4e-3, 8e-3, 1e-2, 1.5e-2, 2e-2 \}$ for AdamW learning rates and a sweep over $\{3e-4, 5e-4, 1e-3, 2e-3, 4e-3, 8e-3, 1e-2, 1.5e-2, 2e-2\}$ for Muon learning rates using 1:20 param to token ratios for the search, where we find that for both models $8e-3$ works best for both sizes and optimizers. For large and xlarge transformer models, we perform a constrained sweep of learning rate in $\{2e-3, 3e-3, 4e-3, 6e-3, 8e-3\}$ for AdamW [@kingma2017adammethodstochasticoptimization], while keeping the Muon learning rate fixed at $8e{-3}$, using a 1:7 parameter to token ratio, where we find that a learning rate of $6e-3$ performs the best. We perform *no learning rate sweeps for Parcae*, to ensure that we are giving the fairest comparison. We expect that there likely exists a more optimal learning rate for Parcae, which could further improve performance.

Following `nanochat` [@nanochat], we use a fixed learning rate, with no warmup and 50% cooldown. For Muon [@jordan2024muon], we use five iterations of polar express orthogonalization [@amsel2025polarexpressoptimalmatrix], factored variance reductions [@si2025adamuonadaptivemuonoptimizer], and cautious weight decay [@chen2026cautiousweightdecay]. We train with BF16 mixed precision. For our data pipeline, we use a BOS-aligned dataloader with BestFit-Crop packing [@ding2024fewertruncationsimprovelanguage] and training on FineWeb-edu [@penedo2024finewebdatasetsdecantingweb]. We clip gradients above 1. A table of hyperparameter details can be found in `\cref{tab:transformer-parcae-hyperparameters}`{=latex}.

```{=latex}
\small
```
::: {#tab:transformer-parcae-hyperparameters}
                                           Small (140M)               Medium (370M)         Large (770M)        XLarge (1.3B)
  ------------------------------ --------------------------------- -------------------- -------------------- --------------------
  Training Tokens                              11.2B                      29.6B                61.6B                 104B
  Batch Size (sequences)                        256                        256                  256                  256
  Sequence Length                              2,048                      2,048                2,048                2,048
  Precision                                `bf16-mixed`                                                      
  AdamW LR                              $8 \times 10^{-3}$          $8 \times 10^{-3}$   $6 \times 10^{-3}$   $6 \times 10^{-3}$
  AdamW $(\beta_1, \beta_2)$               $(0.8, 0.95)$                                                     
  AdamW Weight Decay                           $0.0$                                                         
  AdamW $\epsilon$                          $10^{-10}$                                                       
  Muon LR                               $8 \times 10^{-3}$                                                   
  Muon Momentum                               $0.95$                                                         
  Muon Weight Decay                  $0.2$ (linear decay to 0)                                               
  Muon Orthogonalization Steps                   5                                                           
  LR Schedule                     Fixed (0% warmup, 50% cooldown)                                            
  Gradient Clipping                            $1.0$                                                         

  : Hyperparameter used from training Parcae and Transformer models in `\cref{sec:e2e}`{=latex} for Transformers.
:::

Lastly, following `nanochat` [@nanochat], we train our own tokenizer, which we use for all models. Details of the tokenizer training and setup can be found in `\cref{sec:tokenizer}`{=latex}.

Tokenizer Training {#sec:tokenizer}
==================

We train a custom BPE tokenizer with a vocabulary size of 32,768 using the HuggingFace `tokenizers` library. We follow a GPT-4 style configuration [@openai2024gpt4technicalreport]: byte-level BPE with byte fallback, no text normalization, and a GPT-4 style pre-tokenization split pattern. The tokenizer is trained on 2 billion characters from the FineWeb-Edu training set [@penedo2024finewebdatasetsdecantingweb], with individual documents capped at 10,000 characters. We define three special tokens: `<|bos|>`, `<|eos|>`, and `<|pad|>`. A small comparison of our tokenizer used in our experiments with others can be found in `\cref{tab:tokenizer-compression}`{=latex}.

```{=latex}
\small
```
::: {#tab:tokenizer-compression}
  ---------------------------- ---------------- ---------------------------- ------
  **Tokenizer**                 **Vocab Size**   **Bytes/Token** $\uparrow$  
  `\cmidrule`{=latex}(lr)3-4                               Train              Val
  GPT-2 (`gpt2`)                    50,257                  4.67              4.63
  GPT-4 (`cl100k`)                 100,277                  4.81              4.76
  Ours                              32,768                  4.72              4.65
  ---------------------------- ---------------- ---------------------------- ------

  : Compression ratio (bytes per token) on FineWeb-Edu for tokenizer used in training.
:::

[^1]: Both addition [@yang2024loopedtransformersbetterlearning] and concatenation [@geiping_scaling_2025] fall under this framework.

[^2]: With abuse of notation, we let $\dt \A = \dt \odot \A$ (i.e., elementwise multiplication).

[^3]: We make the same change to RDMs in the main body, observing that they perform better with it and to make comparison fair.

[^4]: Note that these experiments were run before the distribution mismatch fix discussed in `\cref{sec:sampling-truncated-recurrence}`{=latex}. As the mismatch becomes more drastic as `\meanrecurrence `{=latex}gets closer to `\meanbackward`{=latex}, we expect the model pretrained with $\mu_{\text{rec}}=4$ to be performing sub-optimally.

[^5]: We do not directly prove or show this; however, it can be inferred by prior work on how normalization stabilizes forward and backward passes of transformers [@xu2019understandingimprovinglayernormalization; @xiongLayerNormalizationTransformer2020].

[^6]: We note that Parcae does technically introduce additional parameters over baseline Transformers; however, they are negligible in comparison to total parameter counts.
