---
abstract: |
  Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The depth of effective layers reached by looping determines the quality of the solution these models find. Similar to deep architectures, looped architectures are prone to a signal propagation problem induced by depth as the halting decision is postponed. In this paper, we address the signal propagation issue by using pre-norm layers and residual scaling. Building on these architectural modifications, we propose **FPRM**: a Transformer-based **F**ixed-**P**oint **R**easoning **M**odel that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. We show that fixed-point halting allows FPRM to adapt its compute to the difficulty of the task. FPRM proves effective on common reasoning benchmarks, namely Sudoku, Maze, state-tracking, and ARC-AGI. The implementation can be found [here](https://github.com/nilskiKonjIzDunava/fprm).
author:
- |
  **Sajad Movahedi**[^1]  ^1^ `\quad`{=latex} **Vera Milovanović**`\footnotemark[1]`{=latex}  ^1,2^ `\quad`{=latex} **Shlomo Libo Feigin**`\footnotemark[1]`{=latex}  ^1,2^ `\quad`{=latex} **Alexander Theus**`\footnotemark[1]`{=latex}  ^1,\ 2^\
  **Thomas Hofmann^2^** `\quad`{=latex} **Valentina Boeva**[^2]  **^2,\ 3,\ 4^** `\quad`{=latex} **T. Konstantin Rusch**`\footnotemark[2]`{=latex}  **^1,5^** `\quad`{=latex} **Antonio Orvieto**`\footnotemark[2]`{=latex}  **^1^**\
  ^1^ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center\
  ^2^ETH Zurich `\quad`{=latex} ^3^Swiss Institute of Bioinformatics `\quad`{=latex} ^4^Université Paris Cité `\quad`{=latex} ^5^Liquid AI\
  `{sajad.movahedi, vera.milovanovic}@tue.ellis.eu`
bibliography:
- bibliography.bib
title: |
  Fixed-Point Reasoners:\
  Stable and Adaptive Deep Looped Transformers
---

# Abstract

Looped architectures provide an inductive bias toward learning step-by-step procedures for tasks that require compositional reasoning. The depth of effective layers reached by looping determines the quality of the solution these models find. Similar to deep architectures, looped architectures are prone to a signal propagation problem induced by depth as the halting decision is postponed. In this paper, we address the signal propagation issue by using pre-norm layers and residual scaling. Building on these architectural modifications, we propose **FPRM**: a Transformer-based **F**ixed-**P**oint **R**easoning **M**odel that uses fixed-point convergence as an end-to-end halting mechanism in a looped architecture. We show that fixed-point halting allows FPRM to adapt its compute to the difficulty of the task. FPRM proves effective on common reasoning benchmarks, namely Sudoku, Maze, state-tracking, and ARC-AGI. The implementation can be found [here](https://github.com/nilskiKonjIzDunava/fprm).

# Introduction

```{=latex}
\begin{wrapfigure}[18]{r}{0.42\textwidth}
    
    
    \includegraphics[width=\linewidth]{figures/arxiv/fig_1/fprm_trm_fig_1_red-2.pdf}
    
    \caption{\textbf{Signal propagation and adaptivity, FPRM vs. TRM:} Sudoku-Extreme performance as a function of compute across difficulty. Despite being non-hierarchical, FPRM scales better, while correctly detecting the accuracy plateaus by using fixed-points for halting.}
    \label{fig:fprm_trm_halting_points}
    
\end{wrapfigure}
```
Reasoning in neural networks has increasingly been framed as a problem of scaling test-time compute: a model should be able to spend more computation on inputs it finds harder [@openai2024o1; @DBLP:journals/corr/abs-2408-03314]. However, doing so requires two ingredients. **(1)** **flexibility:** the possibility of spending a variable amount of compute on the problem. Once the model is flexible, the next step is **(2)** **adaptivity:** a way to scale the compute spent on the problem; i.e., when to halt the computation.

The standard way to achieve both is through a Chain-of-Thought (CoT) mechanism [@DBLP:conf/nips/Wei0SBIXCLZ22]. With CoT, the model scales compute through verbalization, and makes halting decisions based on predicting a specialized halting token. However, this emerging behavior requires a special training regime and hand-crafted reasoning traces [@DBLP:journals/nature/GuoYZSWZXZMBZY025]. This makes the method complex and undermines the desirable property of end-to-end training.

An alternative approach to end-to-end training for reasoning models is emerging in the form of looped architectures [@dehghani2019universaltransformers; @DBLP:conf/nips/BansalSBEHGG22; @saunshi2025reasoninglatentthoughtspower; @hao2025traininglargelanguagemodels]. Whereas in CoT, the compute scales along the sequence dimension, in looped architectures it scales along the depth dimension ($i$): $$\label{eqn:looping}
    \bmz_{i+1}=f_\theta(\bmz_i ; \bmx),$$ where $f_\theta(\bmz;\bmx)$ is a neural network, $\bmx$ is the input, and $\bmz_i$ is the $i^{\text{th}}$ latent representation. Thus, compute can be increased at test-time by looping in depth, naturally introducing flexibility from the architecture. Looped models have been shown to have an inductive bias towards learning algorithms [@yang2024loopedtransformersbetterlearning; @fan2025loopedtransformerslengthgeneralization] and have achieved remarkable success on reasoning benchmarks. For example, Hierarchical Reasoning Model (HRM) [@DBLP:journals/corr/abs-2506-21734] and Tiny Reasoning Model (TRM) [@jolicoeurmartineau2025morerecursivereasoningtiny] incorporate a hierarchical structure into the looping process, proving effective in solving puzzle tasks such as sudoku, maze, and ARC-AGI [@chollet2025arcprize2024technical].

The halting decision (i.e., deciding when to stop iterating), however, is no longer trivial in looped models. Most approaches either fix or randomly sample the number of loops [@DBLP:journals/corr/abs-2602-11451; @DBLP:journals/corr/abs-2510-25741; @geiping2025scalingtesttimecomputelatent; @saunshi2025reasoninglatentthoughtspower; @DBLP:journals/corr/abs-2404-07839; @DBLP:conf/nips/BansalSBEHGG22], which eliminates adaptivity, or use external modules trained to make halting decisions [@DBLP:journals/corr/abs-2506-21734; @jolicoeurmartineau2025morerecursivereasoningtiny; @dehghani2019universaltransformers]. The latter is associated with separate Adaptive Computation Time (ACT) networks [@DBLP:journals/corr/Graves16], which introduce optimization challenges, as they require a continuous relaxation of a discrete objective. As a result, ACT can fail to provide adaptivity, which we demonstrate in the case of HRM and TRM (Figures `\ref{fig:state-tracking-per-k}`{=latex}, `\ref{fig:adaptivity_sudoku}`{=latex} ). We mitigate this issue by introducing a different halting mechanism: let the model loop until its hidden state converges to a fixed-point, and use the convergence itself as the halting signal. Unlike ACT, the fixed-point halting mechanism requires no external module and lets the model spend as much compute as a given input demands.

```{=latex}
\begin{figure*}[t]
    

    % -------- Left: combined figure with two subfigures --------
    
        
        \begin{subfigure}[t]{0.48\linewidth}
            
            \includegraphics[width=\linewidth]{figures/arxiv/ut_norms/ut_norms_c_vs_seqlen.pdf}
            \caption{Trainability vs. context length}
            \label{fig:ut_max_len_vs_iters}
        \end{subfigure}
        \hfill
        \begin{subfigure}[t]{0.48\linewidth}
            
            \includegraphics[width=\linewidth]{figures/arxiv/ut_norms/init_norm_realA5.pdf}
            \caption{Activation norm at init.}
            \label{fig:ut_activation_norm}
        \end{subfigure}

        \caption{
            \textbf{The blessing and the curse of depth in Looped Transformers.}
            Increasing the number of effective layers can unlock expressivity, but also creates a stability challenge: pre-norm models without residual scaling can diverge in activation norm, while post-norm models may struggle to utilize the signal.
        }
        \label{fig:ut_norm}
    
    \hfill
    % -------- Right: separate figure --------
    
        
        \includegraphics[trim=230pt 437pt 874pt 110pt,clip,width=\linewidth]{figures/main/fprm_arch_new.pdf}
        
        \caption{
            \textbf{FPRM architecture.}
            Our fixed-point Looped Transformer uses pre-norm and residual
            scaling for improved signal propagation.
        }
        \label{fig:lofat-architecture}
    

\end{figure*}
```
A further challenge arises as the number of loops in the model increases. This is desirable for harder tasks, but unrolling the recursions yields a deep `\efflayer`{=latex}. As a result, looped architectures suffer from the same **signal propagation** issue as deep non-looped networks. To this end, we adapt architectural techniques originally developed to optimize traditional transformers at deep `\efflayers`{=latex}, carefully modifying them for the Looped Transformer setting. Our key intuition is that *Looped Transformers are, in part, very deep transformers*.

Viewed this way, one design choice is surprising: looped models commonly use post-norm [@dehghani2019universaltransformers; @geiping2025scalingtesttimecomputelatent; @DBLP:journals/corr/abs-2506-21734; @jolicoeurmartineau2025morerecursivereasoningtiny]. Deep non-looped architectures, by contrast, prefer pre-norm, since post-norm causes unstable training [@DBLP:conf/icml/XiongYHZZXZLWL20; @DBLP:conf/nips/NociABOSL22]. Yet in looped models, post-norm serves a specific purpose. It keeps the magnitude of activations bounded as the model iterates, preventing the hidden states from diverging [@labovich2026stabilitygeneralizationloopedtransformers] (see `\Cref{fig:ut_norm}`{=latex}). This raises the question: *can we switch the post-norm to pre-norm, while ensuring the activations stay bounded in a different way?* If so, this could make the training of looped models stable at large depths. In this paper, we use residual scaling parameters to give an affirmative answer to this question. In `\Cref{fig:fprm_trm_halting_points}`{=latex}, we observe that fixed-point halting, together with better signal propagation, allows FPRM, a **non**-hierarchical reasoning model, to outperform TRM at a lower cost on the Sudoku-Extreme benchmark. We provide a description of how the figure was made in `\Cref{sec:fig1-descript}`{=latex}.

We summarize our contributions as follows:

1.  **Successfully training a looped pre-norm Transformer.** We modify a Transformer layer to be trainable over deeper `\efflayers `{=latex}by switching post-norm to pre-norm and adding residual scaling parameters.

2.  **Reaching a fixed-point as a halting mechanism.** To enable stable training, we propose a theoretically motivated modification specific to fixed-point models to limit the oscillation around the fixed-point.

3.  **Proposing the framework FPRM: Fixed-Point Reasoning Model.** We show that it outperforms the baselines on Sudoku-Extreme, Maze-Hard, ARC-AGI-1, and state-tracking benchmarks $A_5$ and $S_5$ among 7M parameter models. Notably, we achieve our results without the hierarchical structure of HRM and TRM. To the best of our knowledge, FPRM is the first Transformer-based reasoning model that exhibits adaptivity of compute to task difficulty (Figures `\ref{fig:fprm_trm_halting_points}`{=latex},`\ref{fig:state-tracking-per-k}`{=latex}).

# Background and Related Works {#sec:background}

#### Looped models.

Looped architectures address the flexibility requirement of reasoning models by decoupling `\efflayer `{=latex}from parameter count, providing variable computation per input without scaling the number of parameters. This makes looping a natural inductive bias for tasks that can be solved by repeatedly applying local or compositional subroutines. Early examples include Neural GPUs [@kaiser2016neuralgpuslearnalgorithms] and Universal Transformers [@dehghani2019universaltransformers]; more recent works show that Looped Transformers can improve length generalization, learn algorithmic structure, and approximate much deeper untied models on reasoning tasks [@yang2024loopedtransformersbetterlearning; @fan2025loopedtransformerslengthgeneralization; @saunshi2025reasoninglatentthoughtspower; @kapl2026growingloopingunifiedview; @geiping2025scalingtesttimecomputelatent; @giannou2023looped; @kohli2026loop]. Recent recursive reasoning models such as HRM [@DBLP:journals/corr/abs-2506-21734] and TRM [@jolicoeurmartineau2025morerecursivereasoningtiny] instantiate this principle in compact architectures. Using small networks, these models have been able to outperform prominent LLM-based reasoning models. At the core of these methods is the idea of the need for a hierarchical looping mechanism, wherein the compute is distributed between a fast-looping component and a slow-looping component. URM [@DBLP:journals/corr/abs-2512-14693] follows up on these works and shows that by using short-conv, performance can be further improved. However, as we describe in this paper, previous works leave two central design axes open to improvement: First, *how should a looped model decide how many iterations to run on each input at test-time?* FPRM addresses this through fixed-point halting. Second, *how to make the model utilize its deep `\efflayer`{=latex}?* FPRM does so by switching post-norm to pre-norm and adding residual scaling. Importantly, FPRM achieves better performance than TRM and HRM without using a hierarchical structure for the loops.

#### Adaptive computation.

Classical ACT methods answer the adaptivity question by learning an explicit halting rule, such as halting probabilities, per-token stopping decisions, or learned distributions over computation depth [@dehghani2019universaltransformers; @banino2021pondernetlearningponder]. More recent works extend this idea by picking an adaptive depth for each token  [@bae2025mixtureofrecursionslearningdynamicrecursive; @song2026adaponderlmgatedponderinglanguage]. Such mechanisms can, in principle, allocate more computation to harder instances, but they all add a separate learned decision process on top of the recurrent computation itself, and this process is hard to optimize because the halting decision is discrete rather than continuous. Moreover, analyses of HRM find that ACT does not always scale the inference compute with the actual difficulty of the input [@ren2026reasoningmodelsreasoningguessing; @ge2025hierarchicalreasoningmodelsperspectives], which can more generally also be the case with CoT [@palod2025performativethinkingbrittlecorrelation], a limitation we also observe and address by instead halting when the iteration reaches a fixed-point.

#### Deep equilibrium and fixed-point models.

An alternative is to use the convergence of the latent dynamics as the halting criterion. In this view, computation stops when consecutive iterates become sufficiently close $\|\bmz_{i+1}-\bmz_i\| \leq \epsilon$, which corresponds to convergence toward a fixed-point $\bmz^\star = f_\theta(\bmz^\star;\bmx)$. This perspective is also supported by recent mechanistic analyses of looped language models, which observe that recurrent trajectories often converge to fixed-points, suggesting that weight-tied looped models can naturally implement an implicit fixed-point computation [@blayney2026mechanistic]. This view is most extensively developed in Deep Equilibrium Models [@DBLP:conf/nips/BaiKK19], which replace a finite stack of layers by the equilibrium point of a weight-tied transformation. However, DEQ models usually find the fixed-point via a quasi-Newton approach such as Broyden's method [@Broyden1965ACO; @DBLP:conf/nips/BaiKK19] or Anderson acceleration [@Anderson1965IterativePF; @geng2023torchdeqlibrarydeepequilibrium] rather than fixed-point iterations, which makes them different from looped models. Moreover, DEQ models are difficult to optimize [@bai2021stabilizing; @DBLP:conf/nips/AnilPLTWBKG22; @jolicoeurmartineau2025morerecursivereasoningtiny], which we address by our proposed architectural changes. Concurrent to our work, Attractor Models [@feinashley2026solveloopattractormodels] frame the iterations of TRM as a root-finding problem similar to DEQ, which they solve with Anderson acceleration. Additionally, they propose to optionally include a larger separate network to "guess" the initial latent. Another concurrent work is Equilibrium Reasoners (EqR) [@huang2026equilibriumreasonerslearningattractors], which also adopts a fixed-point perspective and shows that it is possible to scale the compute not only depth-wise, but also by making multiple initial guesses at training and inference time, favoring attractors with a wide basin. Our contributions are largely orthogonal: we target the signal-propagation issues that limit trainable depth, replacing post-norm with pre-norm and residual scaling, and damping the iteration to suppress oscillation around the fixed-point. Moreover, in contrast to these works we do not use the hierarchical looping of TRM.

#### Score-based methods.

Diffusion Models [@ho2020denoising; @song2020score] and Energy-Based Models (EBMs) [@lecun2006tutorial] provide another way of iterative computation. An EBM learns a scalar energy function, the core idea being that the energy represents the negative log of the unnormalized density, and solutions correspond to low energy states. EBMs generate a prediction either by descending the energy landscape [@belanger2016structured; @belanger2017end] or by sampling with MCMC methods [@du2019implicit]. In comparison, diffusion models learn the score of the underlying distribution, which can be viewed as the gradient of a time-dependent energy function, thus the two are closely connected [@salimans2021should; @du2023reduce]. While diffusion models typically integrate an SDE over a fixed time interval and in this sense are not adaptive, energy-based reasoning models [@du2022learningiterativereasoningenergy; @du2024learningiterativereasoningenergy; @gladstone2025energy] have an adaptive halting mechanism. In these models, the halting criterion is convergence to a local minimum of the energy landscape. Compared to these methods, we model the generative process via fixed-point iterations of a learned operator, as opposed to descending through a learned energy function. The setting of learning the operator directly is important for the theoretical analysis of our architectural modifications in `\Cref{sec:method}`{=latex}.

#### Signal propagation in looped models.

As unrolled looped architectures can be viewed as deep networks, they are exposed to the same signal-propagation difficulties that arise in very deep Transformers. In deep sequence models, increasing depth can make optimization harder and can prevent later layers from being effectively used, a phenomenon often discussed as the curse of depth [@DBLP:conf/icml/DongCL21; @DBLP:conf/nips/NociABOSL22; @DBLP:journals/corr/abs-2502-05795]. Pre-norm is the standard remedy for these issues, but in a looped setting it removes a property the architecture relies on: **bounded activations** (`\Cref{fig:ut_norm}`{=latex}). Bounded activations, important for stable training, are typically enforced through post-norm [@geiping2025scalingtesttimecomputelatent; @DBLP:journals/corr/abs-2506-21734; @jolicoeurmartineau2025morerecursivereasoningtiny]. Indeed, in our own experiments a pre-norm looped model without further modification diverges in activation norm as the iteration count grows, and fails to train at the large loop counts where this growth is most severe (`\Cref{fig:ut_norm}`{=latex}). Therefore, our FPRM combines pre-norm with residual scaling to recover bounded and stable dynamics while still allowing gradients and representations to propagate through many iterations.

# Method {#sec:method}

Looped architectures rely on bounded activations for stability: as the model loops, an unbounded layer can cause the hidden states to grow without limit (`\Cref{fig:ut_norm}`{=latex}). Current methods mostly employ post-norm to satisfy the boundedness condition [@jolicoeurmartineau2025morerecursivereasoningtiny; @DBLP:journals/corr/abs-2506-21734; @geiping2025scalingtesttimecomputelatent]. However, in fixed-depth models post-norm introduces a signal propagation issue, often associated with unstable training and restricted `\efflayer`{=latex}, as reported by @DBLP:conf/icml/DongCL21 [@DBLP:conf/nips/NociABOSL22; @DBLP:journals/corr/abs-2502-05795]. In `\Cref{sec:pre_post_toy}`{=latex}, we provide evidence for trainability issues in a toy looped model with post-norm, despite stability induced by bounded activations.

Consequently, in this section we propose to **(a) switch to pre-norm** to recover trainability at higher depth, while preserving boundedness by **(b) scaling the residual stream and the sub-layer outputs** (the attention and feed-forward maps). Together, these modifications yield a layer that is both stable under looping and trainable at large depth. Additionally, we introduce **(c) a fixed-point halting mechanism** that decides the `\efflayer `{=latex}adaptively per input. Interestingly, we find that with these changes we can **(d) remove the hierarchy** common in recent Looped Transformers [@DBLP:journals/corr/abs-2506-21734; @jolicoeurmartineau2025morerecursivereasoningtiny], resulting in a much simpler model. We provide an overview of our proposed model in `\Cref{fig:lofat-architecture}`{=latex}.

## Improving signal propagation with pre-norm {#sec: sig_prop}

We start by introducing pre-norm and post-norm in a Transformer layer. A Transformer layer consists of two sub-layers (with $f_{\theta^\ell}^\ell(.)$ denoting the $\ell^{th}\in\{1, \dots, 2L\}$ sub-layer): multi-head attention and a feed-forward network. We denote the Looped Transformer model as defined in `\Cref{eqn:looping}`{=latex}, consisting of multiple Transformer layers, by $f_\theta(.)$. In the Transformer layer, the two sub-layers are interleaved with layer normalization ($\mathrm{Norm}$). Two canonical placements of the normalization define two variants of the layer. The original *post-norm* formulation ($\mathrm{Norm}_{\mathrm{post}}$) [@DBLP:conf/nips/VaswaniSPUJGKP17] applies normalization after the residual addition with the residual stream ($\bmz^{\ell-1}$), while the *pre-norm* variant ($\mathrm{Norm}_{\mathrm{pre}}$) [@DBLP:conf/icml/XiongYHZZXZLWL20] applies normalization to the input of each sub-layer without modifying the residual stream: $$\bmz^\ell = \mathrm{Norm}_{\mathrm{post}}\!\left(\bmz^{\ell-1} + f^\ell_{\theta^\ell}\!\left(\mathrm{Norm}_{\mathrm{pre}}\!\left(\bmz^{\ell-1}\right)\right)\right), \qquad \ell = 1, \ldots, 2L,$$ where $L$ is the number of layers (Transformer blocks). In fixed-depth models, both normalization placements have been linked to training issues at large depth: post-norm bounds activation magnitudes but induces a signal propagation problem [@DBLP:conf/icml/Kim0POKYSHSY25; @DBLP:conf/nips/NociABOSL22], while pre-norm improves signal propagation but causes exponential growth in residual magnitude [@DBLP:conf/icml/Kim0POKYSHSY25; @DBLP:conf/icml/XiongYHZZXZLWL20]. Therefore, while using pre-norm in a deep neural network is desirable from the signal propagation standpoint, it can introduce instability due to unbounded activations.

Motivated by these observations, we investigate both normalization placements in Looped Transformers [@saunshi2025reasoninglatentthoughtspower; @dehghani2019universaltransformers]. In `\Cref{fig:ut_max_len_vs_iters}`{=latex}, we test the positive correlation between the `\efflayer `{=latex}and the expressivity (maximum sequence length with $>\!90\%$ test accuracy) of a Looped Transformer for the state-tracking task $A_5$ [@merrill2025illusionstatestatespacemodels]. We observe that increasing the `\efflayer `{=latex}of a Looped Transformer with post-norm does not translate into improved expressivity, while a pre-norm variant diverges at larger depth. This divergence can be attributed to the exponential growth of the activations, apparent in `\Cref{fig:ut_activation_norm}`{=latex}. Therefore, to use pre-norm and improve signal propagation in deep looped models, we must first stabilize it, which we do in the following.

## Recovering boundedness via residual scaling {#sec:bounded}

While pre-norm mitigates trainability problems in looped architectures, it removes the necessary boundedness condition that motivated the use of post-norm in them. This effect can be observed in `\Cref{fig:ut_activation_norm}`{=latex}, where the activations of the pre-norm model grow with deeper `\efflayer`{=latex}, causing trainability issues observable in `\Cref{fig:ut_max_len_vs_iters}`{=latex}. Consequently, we propose to restore boundedness by introducing scaling parameters applied at two different scales: one over each sub-layer of the network, and one across the iterations.

#### Layer-wise residual scaling.

Within a single application of $f_\theta\left(\bmz; \bmx\right)$, the residual stream and sub-layer output $f_{\theta^\ell}^\ell(\bmz^{\ell-1})$ are weighted by tied scalars $(\alpha_1, \beta_1)$ shared across all $L$ layers:

```{=latex}
\begin{lightgraybox}
    \begin{equation}
        \bmz^\ell = \alpha_1\, \bmz^{\ell-1} + \beta_1\, f^\ell_{\theta^\ell}\!\left(\mathrm{Norm}_{\mathrm{pre}}\!\left(\bmz^{\ell-1}\right)\right), \qquad \ell = 1, \ldots, 2L.
        \label{eq:layer_scaling}
    \end{equation}
\end{lightgraybox}
```
#### Iteration-wise input mixing.

Between consecutive applications of $f_\theta\left(\bmz;\bmx\right)$, we re-inject the input $\bmx$ with tied scalars $(\alpha_2, \beta_2)$ shared across all iterations [@DBLP:conf/nips/BaiKK19]:

```{=latex}
\begin{lightgraybox}
    \begin{equation}
        \bmz_{i+1}^0 = \alpha_2\, \bmz_i^{2L} + \beta_2\, \bmx.
        \label{eq:iter_mixing}
    \end{equation}
\end{lightgraybox}
```
The two scaling schemes are not independent. With an appropriate coupling between them, the resulting recurrence is *bounded for any input*, resulting in a stable looping [@DBLP:conf/icml/OrvietoSGFGPD23]. In the following statement, we formalize this claim:

```{=latex}
\begin{lightgrayframe}    
    \begin{theorem}[Boundedness of FPRM iterates]
    \label{thm:bounded-fp-iterates}
    Consider the model defined by Equations~\ref{eq:layer_scaling} and~\ref{eq:iter_mixing}, and assume each layer map satisfies
    $\Vert f^\ell_{\theta^\ell}(\bmu)\Vert \leq c_f$ for all $\ell$ and input $\bmu$. Let $0\leq \alpha_1, \alpha_2 < 1$, and set
    \begin{equation*}
        \beta_2 = 1 - \alpha_2\alpha_1^{2L},
        \qquad
        \beta_1 = \frac{\beta_2\,(1-\alpha_1)}{1-\alpha_1^{2L}}.
    \end{equation*}
    Then the fixed-point iterates $\{\bmz_i^0\}_{i\geq 0}$ from~\Cref{eq:iter_mixing} are bounded, and if $\bmz_i^0 \to \bmz_\infty^0$, then
    \begin{equation*}
        \Vert \bmz_\infty^0 \Vert \leq \Vert \bmx \Vert + \alpha_2\, c_f.
    \end{equation*}
    \end{theorem}
    \noindent The proof is in~\Cref{app:alphas-proof}. 
\end{lightgrayframe}
```
Note that the boundedness condition of the sequence model in  `\Cref{thm:bounded-fp-iterates}`{=latex} is satisfied when using pre-norm, as shown by @DBLP:conf/icml/KimPM21. However, boundedness still does not guarantee that the looping to converge to a fixed-point, which we propose to utilize for adaptivity. In the following, we show that there exists a choice of $\alpha_2$ that satisfies this requirement. Consequently, as empirically shown by @DBLP:conf/nips/BansalSBEHGG22, the model may become locally contractive during training.

::: theorem
**Theorem 2** (Small $\alpha_2$ implies convergence). *`\label{thm:damped-fp-convergence}`{=latex} Let $\lambda_f$ be the Lipschitz constant of the $L$-layer model defined by `\Cref{eq:layer_scaling}`{=latex}, i.e. the map $\bmz^0 \mapsto \bmz^{2L}$. Then the looped step $f_\theta(\,\cdot\,;\bmx)$ of Equations `\ref{eq:layer_scaling}`{=latex}--`\ref{eq:iter_mixing}`{=latex} is Lipschitz in its first argument with constant $\alpha_2\lambda_f$. In particular, if $\alpha_2\lambda_f < 1$, then $f_\theta(\,\cdot\,;\bmx)$ is a contraction and the iteration $\bmz_{i+1} = f_\theta(\bmz_i;\bmx)$ converges to a unique fixed-point $\bmz^\star=f_\theta(\bmz^*;\bmx)$ at a linear rate: $$\left\Vert f_\theta(\bmz_i;\bmx) - \bmz_i \right\Vert
    \leq
    \left(\alpha_2\lambda_f\right)^i
    \left\Vert f_\theta(\bmz_0;\bmx) - \bmz_0 \right\Vert.$$*
:::

`\noindent `{=latex}The proof is in `\Cref{app:damped-fp-convergence-proof}`{=latex}.

```{=latex}
\begin{wrapfigure}[20]{r}{0.5\textwidth}


\begin{algorithm}[H]
\caption{Fixed-point optimizer \textsc{FPOpt}: one damped step with patience-based decay}
\label{alg:fp_optimizer}
\begin{algorithmic}[1]
\Require initial damping $\eta_0$; decay $\gamma \in (0,1)$; patience $P$
\State \textbf{Internal state:} $\eta \gets \eta_0$,\ \ $p \gets P$,\ \ $r^\star \gets \infty$
\Statex
\Procedure{Step}{$\bmz,\, \tilde{\bmz}$}
    \State $r \gets \|\bmz - \tilde{\bmz}\|_\infty / (\|\tilde{\bmz}\|_\infty + \epsilon)$ \Comment{residual}
    \State $\bmz \gets \eta\, \tilde{\bmz} + (1 - \eta)\, \bmz$ \Comment{damped update}
    \If{$r < r^\star$}
        \State $r^\star \gets r$,\ \ $p \gets P$ \Comment{progress: reset}
    \Else
        \State $p \gets p - 1$
        \If{$p \leq 0$ \textbf{and} $r > \tau$}
            \State $\eta \gets \gamma\, \eta$,\ \ $p \gets P$ \Comment{decay $\eta$}
        \EndIf
    \EndIf
    \State \Return $\bmz,\, r$
\EndProcedure
\end{algorithmic}
\end{algorithm}

\end{wrapfigure}
```
While a contractive map is needed for convergence, an excessively contractive one severely limits expressivity [@DBLP:conf/nips/BaiKK19; @DBLP:conf/nips/AnilPLTWBKG22]. We avoid this by making $\alpha_1$ and $\alpha_2$ learnable. In practice, we find that initializing the network to be more contractive by setting $\alpha_2$ to be small yields better performance (`\Cref{tab:alpha_sweep}`{=latex}). In `\Cref{fig:alpha_distribution}`{=latex}, we observe that after training, the distribution of $\alpha$ values widens, with the median very close to the initial point. Intriguingly, these observations are also in line with the common solutions to signal propagation and rank-collapse problems in deep neural networks [@DBLP:conf/nips/NociABOSL22; @DBLP:journals/corr/abs-2502-05795], suggesting a connection between rank-collapse and the signal propagation issue in looped architectures. Together, these modifications yield a pre-norm Looped Transformer that maintains performance over longer looping horizons before saturating, compared to the post-norm variant (see `\Cref{fig:ut_norm}`{=latex}).

## Oscillation around the fixed-point {#sec:damping-alg}

So far, we have been able to establish that, given a small enough $\alpha_2$, the model introduced in `\Cref{eq:iter_mixing}`{=latex} becomes locally contractive and converges to a fixed-point. However, in practice the contraction factor of `\Cref{thm:damped-fp-convergence}`{=latex} is not itself guaranteed. For some inputs, we observe that the model often descends into an oscillatory behavior, causing the iteration to stay in a small region of latent space without converging. This non-convergent behavior is not in tension with `\Cref{thm:damped-fp-convergence}`{=latex}, as the theorem gives a sufficient condition for convergence, not a complete characterization of the iteration's behavior. In fact, oscillation around the fixed-point can happen when the Jacobian satisfies certain conditions.

Linearizing the iteration near a fixed-point $\bmz^\star$ gives $\bmz_{i+1} - \bmz^\star \approx \matJ\,(\bmz_i - \bmz^\star)$, where $\matJ = \partial f_\theta / \partial \bmz \,|_{\bmz^\star}$. Oscillation around $\bmz^\star$ arises when $\matJ$ has an eigenvalue with $\Re(\lambda_i) < 1$ but $|\lambda_i| \geq 1$, in which case the iteration spirals around $\bmz^\star$ rather than contracting toward it. The half-plane condition $\Re(\lambda_i) < 1$ is exactly what licenses a runtime fix that does not require modifying $f_\theta$:

```{=latex}
\begin{lightgrayframe}
    \begin{theorem}[Damping stabilizes oscillatory fixed-point dynamics]
    \label{thm:damping-stability}
    Suppose $f_\theta(\,\cdot\,;\bmx)$ is continuously differentiable in a neighborhood of a fixed-point $\bmz^\star$, and that every eigenvalue $\lambda_i$ of the Jacobian $\matJ$ at $\bmz^\star$ satisfies $\Re(\lambda_i) < 1$. Define the damped iteration map
    \begin{equation*}
        g_{\eta,\theta}(\bmz; \bmx) \;:=\; \eta\, f_\theta(\bmz;\bmx) + (1-\eta)\, \bmz.
    \end{equation*}
    Then there exists $\eta_0 \in (0, 1)$ such that, for every $\eta \in (0, \eta_0)$, the iteration $\bmz_{i+1} = g_{\eta,\theta}(\bmz_i;\bmx)$ converges locally to $\bmz^\star$. Moreover, $g_{\eta,\theta}(\,\cdot\,;\bmx)$ and $f_\theta(\,\cdot\,;\bmx)$ have the same fixed-points.
    \end{theorem}
    \noindent The proof is in~\Cref{app:damping-proof}.
\end{lightgrayframe}
```
`\Cref{thm:damping-stability}`{=latex} shows that a suitable damping factor $\eta$ eliminates the oscillations while preserving the fixed-points of $f_\theta(\bmz;\bmx)$. We measure convergence to a fixed-point at iteration $i$ as $$r_i \;=\; \frac{\Vert \bmz_i - f_\theta(\bmz_i;\bmx) \Vert_\infty}{\Vert f_\theta(\bmz_i;\bmx) \Vert_\infty + \epsilon},$$ which serves as the halting signal: the iteration stops once $r_i$ falls below a tolerance $\tau$. To choose $\eta$ adaptively at inference time, we use a patience mechanism that decreases $\eta$ whenever this residual stops improving. We track the smallest residual observed so far, $r^\star = \min_{j \le i} r_j$. A geometric decay $\eta \gets \gamma\, \eta$ with $\gamma \in (0, 1)$ is applied to the step-size after $P$ consecutive iterations with no improvement in the residuals. The full procedure is given in `\Cref{alg:fp_optimizer}`{=latex}, which is based on the implementation provided by @movahedi2025fixedpointrnnsinterpolatingdiagonal.

## Optimization of fixed-point models {#sec:fp-optimizer}

One advantage of contractive fixed-point models is that they can be trained using truncated back-propagation through time (BPTT) [@DBLP:conf/nips/GengZBWL21]. Let $\matJ = \frac{\partial f_\theta}{\partial \bmz}(\bmz^\star;\bmx)$ and $\matP = \frac{\partial f_\theta}{\partial \theta}(\bmz^\star;\bmx)$ denote the Jacobians of $f_\theta$ at the fixed-point with respect to the state and the parameters, respectively. Following the implicit function theorem we can write the gradient w.r.t. the parameters of the model as [@DBLP:conf/nips/BaiKK19]:

$$\label{eqn:ift}
    \frac{d\bmz^\star}{d\theta} \;=\; (\matI - \matJ)^{-1}\,\matP.$$

The Neumann series $(\matI-\matJ)^{-1} = \sum_{j\geq 0}\matJ^{\,j}$ converges, assuming $f_\theta(\bmz;\bmx)$ is contractive. Therefore, truncating the series at depth $k$ yields the estimate:

$$\label{eqn:neumann_trunc}
    \frac{d\bmz^\star}{d\theta} \;\approx\; {\textstyle\sum_{j=0}^{k-1}} \matJ^{\,j}\, \matP,$$ which is closely related to Jacobian-free backpropagation, where the full implicit linear solve is replaced by cheaper approximate gradients [@fung2022jfb]. The truncation depth $k$ trades off computation against accuracy. The following proposition bounds the resulting error under contractivity.

::: proposition
**Proposition 1** (Exponential decay of truncated-BPTT error). *`\label{prop:trunc-bptt-error}`{=latex} Let $\bmz^\star = f_\theta(\bmz^\star; \bmx)$ and $\matJ = \frac{\partial f_\theta}{\partial \bmz}(\bmz^\star;\bmx) \in \mathbb{R}^{D \times D}$. If $\matJ$ is contractive in spectral norm, $\|\matJ\|_2 = \sigma < 1$, then for every $k \geq 0$, $$\left\Vert (\matI - \matJ)^{-1} - {\textstyle\sum_{j=0}^{k-1}} \matJ^{\,j} \right\Vert_F \;\leq\; \sqrt{D}\,\frac{\sigma^k}{1 - \sigma}.$$*
:::

`\noindent `{=latex}The proof is in `\Cref{app:trunc-bptt-proof}`{=latex}. An essentially equivalent result appears in the proof of Theorem 2 of @DBLP:conf/nips/GengZBWL21.

`\Cref{prop:trunc-bptt-error}`{=latex} allows for a fixed memory footprint during training, essentially decoupling the number of loops from the memory complexity of the model. In the same spirit, HRM [@DBLP:journals/corr/abs-2506-21734] and TRM [@jolicoeurmartineau2025morerecursivereasoningtiny] approximate the gradient with a small number of backward passes at the fixed-point, although TRM argues that the fixed-point condition is unnecessary in practice.

## The fixed-point reasoning model

```{=latex}
\begin{wrapfigure}{r}{0.5\textwidth}


\begin{algorithm}[H]
\caption{FPRM training loop with truncated BPTT and deep supervision}
\label{alg:fprm_training}
\begin{algorithmic}[1]
\Require Model $f_\theta$; prediction head $h_\phi$; fixed-point optimizer \textsc{FPOpt}; model optimizer \textsc{ModelOpt}; input $\bmx$; target $\bmy$; BPTT depth $K$; initial state $\bmz_0$
\State $\bmz \gets \bmz_0$
\While{\textsc{FPOpt}.\textsc{cont}()} \Comment{outer loop}
    \For{$k = 1, \ldots, K$} \Comment{BPTT window}
        \State $\tilde\bmz_k \gets f_\theta(\bmz;\, \bmx)$
        \State $\bmz \gets \textsc{FPOpt}.\textsc{step}(\bmz,\, \tilde\bmz_k)$
    \EndFor
    \State $\hat{\bmy} \gets h_\phi(\bmz)$ \Comment{deep supervision}
    \State $\mathcal{L} \gets \textsc{CrossEntropy}(\hat{\bmy},\, \bmy)$
    \State \textsc{ModelOpt}.\textsc{backward}($\mathcal{L}$)
    \State $\bmz \gets \mathrm{detach}(\bmz)$
\EndWhile
\end{algorithmic}
\end{algorithm}

\end{wrapfigure}
```
So far, we introduced modifications to improve signal propagation in Looped Transformers (`\Cref{sec: sig_prop}`{=latex}) without sacrificing stability (`\Cref{sec:bounded}`{=latex}). These modifications yield fixed-point iterations that can be made non-oscillatory (`\Cref{sec:damping-alg}`{=latex}) and trainable through truncated-BPTT (`\Cref{sec:fp-optimizer}`{=latex}), making adaptivity through fixed-points reliable. We assemble these components into FPRM, summarized in `\Cref{alg:fprm_training}`{=latex} and illustrated in `\Cref{fig:lofat-architecture}`{=latex}. The result is a Looped Transformer that iterates until its hidden state converges, spending compute proportional to each input's difficulty---the adaptivity our halting mechanism was designed to provide. Since we observed no improvements in our experiments by using the hierarchical structure of @DBLP:journals/corr/abs-2506-21734, we opt for a classic looped architecture [@dehghani2019universaltransformers] instead. An overview of the framework is available in `\Cref{alg:fprm_training}`{=latex}. In `\Cref{architecture_details}`{=latex} we provide further details about FPRM.

During training, the model performs looping in windows of $k$ iterations, with $k$ being a hyperparameter, which determines the truncated-BPTT value. During inference, we set $k=1$. At each forward pass, the fixed-point optimizer introduced in `\Cref{alg:fp_optimizer}`{=latex} is called to dampen the fixed-point iteration steps. Then, we get a prediction from the current state $\bmz$ of the model and perform a deep supervision step, following @DBLP:journals/corr/abs-2506-21734 [@jolicoeurmartineau2025morerecursivereasoningtiny]. To truncate the computation graph between deep supervision steps during training, we detach the state $\bmz$ from the graph and stop when the optimizer detects fixed-points. This happens when the residual falls below the tolerance, or the step-size becomes too small.

# Experiments {#sec:exps}

We evaluate FPRM against looped reasoning models on puzzle, adaptivity, and signal-propagation benchmarks. Our implementation builds on the public TRM codebase [@jolicoeurmartineau2025morerecursivereasoningtiny] and adopts its deep-supervision training procedure. Additional experimental details are provided in  `\Cref{experimental details}`{=latex}. We first describe the datasets, then evaluate puzzle-solving performance, adaptivity to task difficulty, and depth-induced signal-propagation effects.

## Dataset description

In the following, we provide a brief description for each dataset used in this paper. We refer the reader to the cited literature for more information.

#### Sudoku-Extreme.

The task consists of exceptionally challenging, partially filled $9\times9$ Sudoku puzzles with unique solutions, introduced by @DBLP:journals/corr/abs-2506-21734. Each sample is flattened into a sequence of length $81$. The train data consists of $1000$ unique samples, each augmented $1000$ times, giving a total of \~$1$M train samples. The test data contains $422{,}786$ samples. The evaluation metric is exact (sequence) accuracy.

#### Maze-Hard.

The task consists of difficult $30\times 30$ shortest-path maze puzzles with unique solutions, introduced by @DBLP:journals/corr/abs-2506-21734. The train data and the test data each consist of $1000$ unique samples. Following @DBLP:journals/corr/abs-2506-21734, we do not use augmentation for this problem. The evaluation metric is exact (sequence) accuracy.

#### ARC-1 and ARC-2.

The Abstraction and Reasoning Corpus-1 (ARC-1) and -2 (ARC-2), introduced by @chollet2025arcprize2024technical, aim to assess the ability of the model to solve novel problems from minimal examples. ARC-1 consists of 2-3 2D grid-based input-output demonstration pairs with variable size (up to $30{\times}30$), through which the model is supposed to learn an underlying transformation rule and apply it to a held-out sample. The train data and the test data each contain \~$400$ samples. ARC-2 has similar characteristics to ARC-1, but with much more challenging problems involving several complex transformations, which are less susceptible to brute-force solutions. The benchmark contains \~$1000$ training samples and \~$360$ test samples. The evaluation metric is exact (sequence) pass@2 (top-2 predictions) accuracy. Importantly, pretrained reasoning LLMs with CoT often struggle with these tasks. For example, DeepSeek-R1 ($671$B model) achieves $15.8\%$ on ARC-1 and $1.3\%$ on ARC-2, Claude 3.7 Sonnet 16K achieves $28.6\%$ on ARC-1 and $0.7\%$ on ARC-2.

#### State tracking.

The $A_5$ and $S_5$ state-tracking tasks, introduced by @merrill2025illusionstatestatespacemodels, are algorithmic benchmarks based on permutation composition. Each sample consists of an initial state and a sequence of $k$ update permutations; the model must apply the updates in order and predict the resulting final state. $A_5$ uses the alternating group on five elements, i.e., the subgroup of even permutations, while $S_5$ uses the full symmetric group on five elements. These tasks are useful proxies for stateful reasoning problems such as entity tracking, code execution, and game-state tracking, since solving them requires learning a composable update rule rather than memorizing computations at fixed lengths. We train on sequences containing up to 32 updates and evaluate out-of-distribution length generalization on sequences containing up to 128 updates. The evaluation metric is exact final-state accuracy.

```{=latex}
\begin{table*}[ht]
    
    \caption{
        Test accuracy on Sudoku-Extreme, Maze-Hard, ARC-AGI-1, and ARC-AGI-2. For each task, the \textbf{best overall} result is bold-face, and the \underline{best result with 7M parameters} is underlined. The results denoted by~$^\dagger$ are reproduced using public \href{https://huggingface.co/arcprize/trm_arc_prize_verification}{checkpoints}. The ARC results for URM, denoted by ~$^\ddagger$, are only reported for Pass@1.
    }
    
    \small
    \setlength{\tabcolsep}{6pt} % default is usually 6pt
    \renewcommand{\arraystretch}{1.05}
    \begin{tabular}{llccccc}
        \toprule
        \textbf{Model} & \textbf{\# params.}  & \textbf{Single Loop}& \textbf{Sudoku-Ext.} & \textbf{Maze-Hard} & \textbf{ARC-1} & \textbf{ARC-2} \\
        & & \textbf{(No Hier.)} & \textbf{Pass@1} & \textbf{Pass@1} & \textbf{Pass@2} & \textbf{Pass@2}
        \\
        \midrule
        \textcolor{textgray}{\textit{Our reproduction attempt}} \\
        \midrule
        \textcolor{textgray}{Attractor Model} & \textcolor{textgray}{27M}& \textcolor{textgray}{\xmark} & \textcolor{textgray}{71.4\%} & \textcolor{textgray}{--} & \textcolor{textgray}{--} & \textcolor{textgray}{--} \\
        \textcolor{textgray}{TRM} & \textcolor{textgray}{7M} & \textcolor{textgray}{\xmark} & \textcolor{textgray}{72.6\%} & \textcolor{textgray}{79.0\%} & \textcolor{textgray}{40.0\%$^\dagger$} & \textcolor{textgray}{6.2\%$^\dagger$}
        \\ \midrule
        \textit{Reported} \\
        \midrule
        HRM  & 27M & \xmark & 55.0\% & 74.5\% & 40.3\% & 5.0\% \\
        Attractor Model & 27M & \xmark & 91.4\% & \textbf{93.1\%} & -- & -- \\
        URM & 14M & \xmark & 77.6\% & -- & \textbf{$\geq$53.8\%}$^\ddagger$ & \textbf{$\geq$16.0\%}$^\ddagger$ \\
        TRM  & 7M & \xmark & 74.7\% & 85.3\% & 44.6\% & \underline{7.8}\% \\
        EqR & 7M & \xmark & 93.0\% & -- & -- & -- \\
        Attractor Model & 7M & \xmark & 54.3\% & 46.7\% & -- & -- \\
        \rowcolor{fprmbluelight} FPRM & 7M & \cmark & \underline{\textbf{94.2\%}} & \underline{87.0\%} & \underline{47.5\%} & 6.2\% \\
        \bottomrule
    \end{tabular}

    % 
    \label{tab:sudoku-maze}
\end{table*}
```
## Puzzle tasks {#sec:puzzle-exps}

We first evaluate FPRM on puzzle problems, namely the Sudoku-Extreme, Maze-Hard [@DBLP:journals/corr/abs-2506-21734], ARC-1, and ARC-2 [@chollet2025arcprize2024technical] benchmarks. These benchmarks were designed to test whether latent recurrent reasoning models can solve symbolic search problems from limited supervision. In the following, we discuss the experimental results. For more information about the baselines, we refer the reader to `\Cref{sec:background}`{=latex}.

`\Cref{tab:sudoku-maze}`{=latex} demonstrates the effectiveness of FPRM as the best performing model with 7M parameters on Sudoku-Extreme, Maze-Hard, and ARC-1, with performance on par with TRM on ARC-2. On Sudoku-Extreme, FPRM improves upon even larger models. It is worth noting that these results are achieved without the breadth-search fixed-point method proposed by @huang2026equilibriumreasonerslearningattractors, which is orthogonal to FPRM's modifications. On Maze-Hard, FPRM underperforms compared to the larger Attractor model with \~$\times\!4$ more parameters. However, we note that we were not able to reproduce the results reported in the paper [@feinashley2026solveloopattractormodels].

On ARC-2, FPRM performs similarly to the publicly available TRM [checkpoints](https://huggingface.co/arcprize/trm_arc_prize_verification), but underperforms compared to the TRM results reported in [@jolicoeurmartineau2025morerecursivereasoningtiny] and the URM baseline, which has roughly twice as many parameters. However, considering the recent trend, we emphasize that the ARC benchmarks appear to be much more sensitive to parameter count compared to the other puzzle tasks considered in this section [@DBLP:journals/corr/abs-2602-02156; @DBLP:journals/corr/abs-2511-14761]. Therefore, we note that the comparison may not be fair in this case.

Given that FPRM performs on par with or better than the hierarchical baselines that use post-norm on the tasks from `\Cref{tab:sudoku-maze}`{=latex}, we hypothesize that the hierarchy might be alleviating signal propagation issues in these models. However, if this hypothesis holds, hierarchy's benefit appears limited: in `\Cref{fig:sudoku_inference_scaling}`{=latex}, both models improve as `\efflayer `{=latex}grows, but the performance gap widens in favor of FPRM, and TRM's performance stays below FPRM at matched `\efflayers`{=latex}.

Moreover, in `\Cref{tab:arch_sweep}`{=latex}, we attempt to investigate the effect of introducing our architectural modifications to the Transformer layers used in TRM on its performance on Sudoku-Extreme. We find the proposed modifications to have a detrimental impact on the performance of the models, which we attribute to, among other things, hyperparameter optimization and a careful redesigning of the looping.

## Adaptivity {#sec:adaptity}

In this section, we demonstrate the adaptivity of FPRM, and compare it to TRM. We use three benchmarks: Sudoku-Extreme from `\Cref{sec:puzzle-exps}`{=latex} and state-tracking benchmarks $A_5$ and $S_5$ introduced in [@merrill2025illusionstatestatespacemodels]. Each benchmark provides an intuitive measure of difficulty. For Sudoku-Extreme, this is the number of empty cells, an established proxy for difficulty [@prates2018problemsolvingedgechaos]. For state-tracking, it is the sequence length.

```{=latex}
\begin{figure*}[ht]
  

  \includegraphics[width=0.92\textwidth]{figures/arxiv/state_tracking/legend_common.pdf}

  
  \setlength{\tabcolsep}{0pt}%
  \begin{subfigure}[t]{0.245\textwidth}
    
    \includegraphics[width=\linewidth]{figures/arxiv/state_tracking/acc_vs_k_a5.pdf}
    \caption{Accuracy, $A_5$}
  \end{subfigure}\hfill
  \begin{subfigure}[t]{0.245\textwidth}
    
    \includegraphics[width=\linewidth]{figures/arxiv/state_tracking/Lsteps_incl_vs_k_a5-1.pdf}
    \caption{Adaptivity, $A_5$}
  \end{subfigure}\hfill
  \begin{subfigure}[t]{0.245\textwidth}
    
    \includegraphics[width=\linewidth]{figures/arxiv/state_tracking/acc_vs_k_s5.pdf}
    \caption{Accuracy, $S_5$}
  \end{subfigure}\hfill
  \begin{subfigure}[t]{0.245\textwidth}
    
    \includegraphics[width=\linewidth]{figures/arxiv/state_tracking/Lsteps_incl_vs_k_s5-1.pdf}
    \caption{Adaptivity, $S_5$}
  \end{subfigure}

  \caption{\textbf{Length generalization and adaptive compute as a function of sequence length.}
  Shaded bands show 95\% confidence intervals over seeds. The vertical dotted line marks
  the training length 32. The matched compute budget is 320 effective layers.}
  \label{fig:state-tracking-per-k}
\end{figure*}
```
#### State tracking.

As shown in Figure `\ref{fig:state-tracking-per-k}`{=latex}, plain TRM fails to extrapolate: at length 128, TRM and TRM with ACT enabled both obtain $45.8\%\pm3.9\%$ on $A_5$ and $39.4\%\pm1.9\%$ on $S_5$. Adding a causal 1D convolution layer substantially improves length generalization, reaching $91.4\%\pm2.3\%$ on $A_5$ and $97.2\%\pm2.5\%$ on $S_5$. This modification is not part of the original TRM, but matches the task structure: group composition is a local left-to-right scan, $s_i=s_{i-1}\cdot g_i$, and causal convolution provides a shared translation-equivariant primitive for this operation.

However, 1D convolution does not make ACT reliable. On $A_5$, TRM+conv+ACT scales compute with sequence length but drops to $65.3\%\pm2.2\%$, well below TRM+conv without ACT. On $S_5$, it reaches $96.2\%\pm3.2\%$, but uses substantially more compute than FPRM; moreover, only a few seeds learn to adapt, while the others exhaust the full compute budget. In contrast, FPRM scales compute smoothly with sequence length while maintaining high accuracy, reaching $98.1\%\pm2.2\%$ on $A_5$ and $98.8\%\pm0.9\%$ on $S_5$ at length 128.

#### Sudoku-Extreme.

We compare the performance and efficiency of the halting mechanisms in FPRM and TRM on Sudoku puzzles of varying difficulty, measured by the number of empty cells (`\Cref{fig:adaptivity_sudoku}`{=latex}). The sample counts per difficulty level are shown in `\Cref{fig:fprm_empty_cells}`{=latex}. We observe that halting mechanisms in both TRM and FPRM adapt the number of loops, i.e., inference compute, to the sample difficulty. In contrast, the default behavior of TRM during inference (deactivated ACT) does not adapt the compute to the task difficulty. Even when we enable ACT with TRM, FPRM proves to be more efficient and accurate.

<figure id="fig:sudoku_inference_scaling">
<img src="figures/arxiv/sudoku_empty_cells/legend_common.png" />
<figure id="fig:accuracy_difficulty_sudoku">
<img src="figures/arxiv/sudoku_empty_cells/sudoku_empty_cells_acc_vs_diff.png" />
<figcaption>Accuracy vs. difficulty</figcaption>
</figure>
<figure id="fig:compute_difficulty_sudoku">
<img src="figures/arxiv/sudoku_empty_cells/sudoku_empty_cells_compute.png" />
<figcaption>Inference compute vs. difficulty</figcaption>
</figure>
<img src="figures/arxiv/sudoku_scaling_budgets/sudoku_scaling_trm_fprm.png" />
<figcaption> <strong>Test-time scaling lowers residuals while improving accuracy.</strong> Both models run until the matched compute budget is exhausted. FPRM achieves higher accuracy across the entire range. Marker size denotes residual.</figcaption>
</figure>

#### Compute--accuracy trade-off in fixed-point halting.

The previous experiments show that FPRM adapts its compute to task difficulty. We now examine how this adaptation can be controlled. In `\Cref{fig:fprm_patience_decay}`{=latex}, we demonstrate the effect of the step-size decay rate ($\gamma$) and maximum patience ($P$) from `\Cref{alg:fp_optimizer}`{=latex} on the halting time and performance at test-time. Max. inference-time budget was set to be the same (70k effective layers) for all experiments. We observe that larger decay rates improve the performance. Moreover, we observe that maximum patience has minimal impact on the performance, with its impact completely disappearing for larger decay rates. On the other hand, the increased performance comes at the cost of reaching much deeper `\efflayers `{=latex}before halting. The largest decay rate that achieves the best performance also exhausts the max. inference budget.

These trends follow from how $\gamma$ and $P$ control the step size $\eta$ from `\Cref{alg:fp_optimizer}`{=latex}. A decay rate closer to 1 reduces $\eta$ only slightly once the residual stops improving. The state therefore keeps evolving, halting is deferred, and the model reaches deeper `\efflayers`{=latex}. The accuracy gain follows directly from these additional iterations. Patience $P$ sets only when a decay occurs, not its magnitude, so its effect is minor; as $\gamma$ approaches $1$ each decay becomes negligible and the two patience curves coincide. The decay rate thus controls the compute-accuracy trade-off directly: a larger $\gamma$ spends more compute and yields higher accuracy, while $P$ offers only secondary control that vanishes as compute saturates.

## Depth-induced signal propagation issues {#sec:signal-prop}

Here, we aim to investigate the role of pre-norm with residual scaling in mitigating the depth-induced signal propagation issues in looped models. For a comprehensive investigation, we approach the problem from three angles: **(1)** *trainability*, where we show that pre-norm with residual scaling keeps activations bounded and enables stable training at large depth (Section `\ref{sec:boundedness}`{=latex}); **(2)** *depth utilization*, where we show that boundedness alone is not sufficient, and that pre-norm with residual scaling additionally improves depth utilization in FPRM [@DBLP:journals/corr/abs-2502-05795] (Section `\ref{sec:depth_util}`{=latex}); and **(3)** *residual scaling*, where we analyze the training dynamics of the scaling parameters (Section `\ref{sec:alphas_analysis}`{=latex}).

### Boundedness of activation norms and trainability {#sec:boundedness}

As noted in `\Cref{sec: sig_prop}`{=latex}, the normalization scheme governs a trade-off between activation stability and signal propagation, which sharpens with depth. Post-norm keeps activations bounded but suffers from signal propagation issues. Pre-norm enables better signal propagation but lets activations grow exponentially. To isolate this trade-off, we adopt the Looped Transformer framework [@saunshi2025reasoninglatentthoughtspower; @dehghani2019universaltransformers; @kaiser2016neuralgpuslearnalgorithms] with a fixed number of `\efflayers`{=latex}. This controls for dynamic halting, which can reduce the `\efflayers`{=latex}, as introduced in `\Cref{sec:method}`{=latex}.

In `\Cref{fig:ut_activation_norm}`{=latex} we show the norm of the final activations of the residual branch of Looped Transformer at initialization. From the figure, we conclude that the magnitude of the activations grows exponentially in the case of pre-norm Looped Transformer. In contrast, when using post-norm or pre-norm with residual scaling, the architecture benefits from bounded activations. In  `\Cref{fig:ut_max_len_vs_iters}`{=latex} we give evidence that boundedness is a prerequisite for reaching deeper `\efflayers`{=latex}. Pre-norm architecture with its unbounded activations diverges and can't reach deep `\efflayers`{=latex}, while bounded architectures ensure trainability. Therefore, a Looped Transformer with pre-norm and without residual scaling cannot achieve deep `\efflayers`{=latex}, highlighting one aspect of the importance of residual scaling in pre-norm.

### Pre-norm with residual scaling enables higher depth utilization {#sec:depth_util}

An important measure for evaluating signal propagation issues in deep neural networks is depth utilization [@DBLP:journals/corr/abs-2502-05795]. Depth utilization measures whether all layers contribute meaningfully to the model. It is commonly probed by removing layers from a trained model and measuring the effect on performance at test-time. In models with signal propagation problems, removing deeper layers (closer to the output) usually does not impact the performance significantly, highlighting that these models fail to utilize the compute due to signal propagation issues. However, in a looped model there is no fixed stack of layers to remove, since the same block is applied to the previous latent representations. Consequently, we cannot perform the same experiment here. We therefore probe the same property from the opposite direction: instead of *removing computation* and measuring degradation, we *add computation* and measure improvement. We do so in two complementary regimes. In the first, using the state-tracking task, we ask whether the model can be trained to use the depth that a harder task requires. In the second, using the Sudoku-Extreme task, we ask whether a trained model can convert depth beyond its training regime into further gains at test time.

```{=latex}
\begin{figure*}[t]
    

    % -------- Left: FPRM loop-utilization (shared legend + two subfigures) --------
    
         % <-- top-align this minipage with the one beside it
        

        % Shared legend (spans both subfigures below)
        \includegraphics[width=\linewidth]{figures/fprm_residual/legend_common-1.pdf}
        

        \begin{subfigure}[t]{0.48\linewidth}
            
            \includegraphics[width=\linewidth]{figures/fprm_residual/fprm_accuracy.pdf}
            \caption{Test accuracy}
            \label{fig:fprm_norm_accuracy}
        \end{subfigure}
        \hfill
        \begin{subfigure}[t]{0.48\linewidth}
            
            \includegraphics[width=\linewidth]{figures/fprm_residual/fprm_residue.pdf}
            \caption{Fixed-point residuals}
            \label{fig:fprm_norm_residual}
        \end{subfigure}

        \caption{
            \textbf{Loop-utilization of FPRM on Sudoku.} \textbf{(a)} test accuracy of FPRM with pre-norm with residual scales vs. post-norm. \textbf{(b)} median residual. The pre-norm model is better at loop utilization, while both have similar residuals. This indicates similar latent-space convergence, with more meaningful updates in the pre-norm variant, resulting in improved performance.
        }
        \label{fig:fprm_norm}
     \hfill
    % -------- Right: stacked accuracy / compute vs. decay rate (separate legend) --------
    
         % <-- top-align with the left minipage (legends line up)
        
        % Separate shared legend on top, then the stacked plot below
        \includegraphics[width=0.6\linewidth]{figures/fprm_residual/accuracy_compute_stacked_legend.pdf}
        
        % 
        \includegraphics[width=\linewidth]{figures/fprm_residual/accuracy_compute_stacked.pdf}

        
        \caption{
            \textbf{Decay rate and patience.} Test accuracy and \efflayer of FPRM with fixed-point halting as a function of decay rate $\gamma$, for maximum-patience $P\in\{5, 10\}$.
        }
        \label{fig:fprm_patience_decay}
    

\end{figure*}
```
#### State-tracking.

In `\Cref{sec:boundedness}`{=latex}, we discussed the necessity of using normalization layers that ensure the boundedness of the activations, which is required for the stable training. However, bounded activations are not a sufficient condition to ensure effective utilization of large depth, once it is reached. We demonstrate this in `\Cref{fig:ut_max_len_vs_iters}`{=latex} using a state-tracking task, where increasing difficulty, corresponding to longer sequences, requires training at greater depths [@movahedi2025fixedpointrnnsinterpolatingdiagonal]. We train a Looped Transformer model with its number of loops (`\efflayer`{=latex}) tied to the train sequence length. We choose the sequence length from the set $\{8, 16, 32, 64, 128, 256, 512\}$. We plot the maximum sequence length solved with $>90\%$ accuracy against `\efflayer`{=latex}. Because we also at test-time match `\efflayer `{=latex}to the training sequence length, a model with no signal-propagation bottleneck should solve exactly the length its depth permits, tracing the identity line $y=x$ (Figure 3 in  @movahedi2025fixedpointrnnsinterpolatingdiagonal). We observe that this behavior is only present in the model equipped with pre-norm and residual scaling. We interpret this observation as strong evidence for improved trainability in Looped Transformers with pre-norm and residual scaling.

```{=latex}
\begin{wrapfigure}[25]{r}{0.3\textwidth}
    
    
    \captionof{table}{Sensitivity of FPRM to residual scaling initialization on Sudoku-Extreme dataset. Each cell reports best test sequence accuracy (\%) for a given pair of initial values.}
    \resizebox{\linewidth}{!}{%
      \begin{tabular}{lccc}
        \toprule
        & $\alpha_2$ init \\
        \cmidrule(lr){2-4}
        $\alpha_1$ init & 0.25 & 0.50 & 0.75 \\
        \midrule
        0.25 & 83.44 & 78.10 & 83.24 \\
        0.50 & 84.49 & 89.05 & 86.29 \\
        0.75 & \textbf{94.23} & 91.41 & 85.70 \\
        \bottomrule
      \end{tabular}}
    
    \label{tab:alpha_sweep}

    
    \includegraphics[width=\linewidth]{figures/fprm_residual/alpha_distribution_3.pdf}
    \captionof{figure}{The distribution of the residual scales in FPRM after training on the Sudoku-Extreme dataset.}
    \label{fig:alpha_distribution}

\end{wrapfigure}
```
#### Sudoku-Extreme.

Compared to the previous experiment on the state-tracking task, here we run each model far beyond its trained depth, trying to detect the point where more compute no longer translates into improvements at test-time. We expect the performance of a model with fewer signal propagation issues to saturate later and with more `\efflayers`{=latex}, indicating that the model is capable of reaching deeper `\efflayer`{=latex}. On the other hand, the performance of a model bottlenecked by signal propagation problems is expected to saturate early, indicating that it cannot convert the extra compute into better predictions. For this experiment, we focus on two variants of the FPRM model, one with pre-norm and residual scaling, and the other with post-norm, both trained on the Sudoku-Extreme task.

In `\Cref{fig:fprm_norm_accuracy}`{=latex}, we demonstrate that scaling test-time compute improves performance for both types of normalization. Furthermore, considering `\Cref{fig:fprm_norm_residual}`{=latex}, we also observe that the majority of the improvement comes before at least half of the samples halt. However, there are clear differences between the `\efflayer `{=latex}of the two normalization methods in `\Cref{fig:fprm_norm_accuracy}`{=latex}, as the pre-norm model's performance saturates at almost twice as much compute, indicating improved signal propagation through depth.

The advantage of pre-norm with residual scaling also extends to the cross-model comparison in `\Cref{fig:sudoku_inference_scaling}`{=latex}, between FPRM (pre-norm with residual scaling) and TRM (post-norm). In this inference-time scaling experiment, all samples are run to a varied maximum looping compute budget. For TRM, compute is scaled through deep supervision steps, which we find optimal relative to scaling L- and H-steps (see `\Cref{sec:trm_scaling_LH_vs_deep_sup}`{=latex}). FPRM outperforms TRM across a range of effective-layer depths reached, with the gap widening at higher compute budgets, consistent with FPRM making better use of its depth.

### The training dynamics of residual scales {#sec:alphas_analysis}

While our original motivation for residual scaling was to prevent unbounded activations in pre-norm, we note a parallel relationship between our solution and common solutions to signal propagation problems [@DBLP:journals/corr/abs-2502-05795; @DBLP:conf/nips/NociABOSL22]. Specifically, it has been known that scaling down the output of the sub-layers when introducing them to the residual stream is beneficial to increasing `\efflayer`{=latex}, which is equivalent to increasing $\alpha_1$ in `\Cref{eq:layer_scaling}`{=latex}. Moreover, in `\Cref{thm:damped-fp-convergence}`{=latex}, we show that the looping will become more stable for smaller $\alpha_2$ in `\Cref{eq:iter_mixing}`{=latex}. In order to investigate the impact of these two parameters, we perform a coarse-grained ablation on the initial value of $\alpha_1,\alpha_2$ on the Sudoku-Extreme task.

`\Cref{tab:alpha_sweep}`{=latex} demonstrates that the best initialization places $\alpha_1$ at a high value and $\alpha_2$ at a low value. Interestingly, the $\alpha_1$ preference matches the common solutions to signal propagation problems: keeping the residual stream dominant. Moreover, the ablation also highlights the importance of having a more contractive mapping at initialization, as our convergence analysis (`\Cref{thm:damped-fp-convergence}`{=latex}) requires a sufficiently small $\alpha_2$ for the loop to reach a fixed-point. On the other hand, comparing the row with the smallest $\alpha_2$ choice ($\alpha_2=0.25$) with the column with the largest $\alpha_1$ ($\alpha_1=0.75$), we observe that increasing $\alpha_1$ has a slightly higher positive impact on the accuracy than a decreasing $\alpha_2$. We hypothesize that this is because it is easier for the model to recover from a bad choice of $\alpha_2$ than $\alpha_1$, as the gradients for $\alpha_1$ come from two different sources (the MHA and MLP sub-layers), and thus can be noisier.

In `\Cref{fig:alpha_distribution}`{=latex}, we provide the distribution statistics of the residual scales after training. Interestingly, we observe that while the median of the $\alpha_1,\alpha_2$ values over channels does not deviate significantly from the initial point, the spread of the distributions widens significantly. In the case of $\alpha_1$, the widening happens at a much larger scale, covering both very small and very large values. On the other hand, the $\alpha_2$ distribution remains more concentrated, with the majority of the values actually becoming smaller than the initial point. This can be interpreted as the model learning to become more contractive during training, which is in line with the observations in [@DBLP:conf/nips/BansalSBEHGG22]. Furthermore, this observation also supports our hypothesis that it might be easier for the model to learn the optimal $\alpha_2$ values than the $\alpha_1$ values.

# Discussion {#sec:discussion}

Our experiments support three broader observations about FPRM and looped reasoning models in general, which we discuss in turn.

#### Looped fixed-point models are adaptive.

FPRM adapts to the difficulty of the problem more effectively compared to TRM (Figures `\ref{fig:state-tracking-per-k}`{=latex}, `\ref{fig:adaptivity_sudoku}`{=latex}, and `\Cref{sec:adaptity}`{=latex}), using fewer `\efflayers `{=latex}(compute), while achieving better performance. This is a consequence of FPRM halting closer to the saturation point of accuracy (`\Cref{fig:fprm_scaling_convergence}`{=latex}). In contrast, TRM with its ACT halting mechanism either halts too early, resulting in lower performance, or too late, using excessive compute.

#### Enabling ACT at inference time.

The original proposed TRM does not use its trained ACT head at inference time, leading to the non-adaptive behavior. However, we find this to be an engineering challenge rather than a fundamental limitation. Therefore, in Figures `\ref{fig:fprm_trm_halting_points}`{=latex}, `\ref{fig:state-tracking-per-k}`{=latex}, `\ref{fig:adaptivity_sudoku}`{=latex}, we record the number of effective layers reached at the moment when the probability of halting exceeds $0.5$. Similarly, in the case of FPRM, we do so when the residual drops below the set threshold ($0.1$ in this case). However, note that the halted samples remain in the batch until the last sample in the batch halts. We leave the efficient implementation for future work.

#### The role of hierarchy in HRM and TRM.

While originally hierarchical reasoning was biologically motivated [@DBLP:journals/corr/abs-2506-21734], later explanations involved likening the lower level of the hierarchy to a scratch pad, the latent representation of which is used by the higher level for prediction [@jolicoeurmartineau2025morerecursivereasoningtiny]. However, the role of hierarchy as the driving force behind the success of hierarchical reasoning models has been brought into question recently, with similar architectures without the hierarchy performing as well as hierarchical models [@ge2025hierarchicalreasoningmodelsperspectives; @arcprize2025hiddendrivers]. In `\Cref{sec:signal-prop}`{=latex}, we observe that a Transformer model with post-norm, which is the building block of TRM and HRM, suffers from a signal propagation issue. On the other hand, in `\Cref{sec:puzzle-exps}`{=latex} we were able to show that by improving signal propagation, FPRM improves upon these models without requiring the hierarchy. In `\Cref{fig:trm_deep_superv_vs_LH}`{=latex}, we observe that reallocating the compute from the H- and L-steps to the additional deep supervision steps improves TRM's performance. Since more H- and L-steps increase the `\efflayer `{=latex}of TRM within each supervision step, the signal propagation issue induced by post-norm is amplified. The improvement is therefore consistent with TRM being limited by the same signal-propagation issue we identify in `\Cref{sec:signal-prop}`{=latex}. In light of these results, we hypothesize that there might be a simpler explanation for the success of hierarchical models: *the hierarchy improves signal propagation*. We identify the theoretical explanation of the role of hierarchy through the lens of optimization and signal propagation as an interesting direction for future work.

#### Scaling behavior of FPRM.

The results of `\Cref{sec:exps}`{=latex} combine into a coherent picture of how FPRM scales its computation. First, with better signal propagation FPRM is able to utilize compute more efficiently (`\Cref{fig:sudoku_inference_scaling}`{=latex}). Second, as more difficult problems require more compute [@merrill2025illusionstatestatespacemodels; @movahedi2025fixedpointrnnsinterpolatingdiagonal], better test-time scaling of FPRM is mostly visible in harder tasks (Figures `\ref{fig:state-tracking-per-k}`{=latex}, `\ref{fig:accuracy_difficulty_sudoku}`{=latex}). Finally, because in FPRM halting is governed by the fixed-point optimizer rather than a learned module, a natural controlling mechanism for the compute-performance trade-off appears in the form of the decay rate $\gamma$ and maximum patience $P$, allowing practitioners to select a desired point on the Pareto front. However, the optimality of the algorithms learned by looped models is not guaranteed, with great variation not only possible but likely. For example, for solving $A_5$, CoT would require a super-logarithmic number of iterations [@merrill2024expressivepowertransformerschain], while an optimal algorithm could solve it in logarithmic time. This suboptimal scaling has also been observed in recurrent-in-depth state-space models [@movahedi2025fixedpointrnnsinterpolatingdiagonal], a behavior that we also observe in `\Cref{fig:state-tracking-per-k}`{=latex}. Therefore, we propose it as an open challenge to find a latent reasoning architecture that achieves a solution with logarithmic complexity to state-tracking while remaining Turing-complete [@dehghani2019universaltransformers].

#### Limitations.

In a similar spirit to previous literature on end-to-end reasoning [@kaiser2016neuralgpuslearnalgorithms; @fan2025loopedtransformerslengthgeneralization; @DBLP:journals/corr/abs-2506-21734; @jolicoeurmartineau2025morerecursivereasoningtiny; @du2022learningiterativereasoningenergy; @du2024learningiterativereasoningenergy], we test our model only on algorithmic tasks and not on natural language. It is an open challenge to demonstrate that the compositional reasoning behavior that latent models exhibit on algorithmic tasks translates to other domains. In addition, even though the base architecture of FPRM could adopt any model (e.g. CNN, MLP, state-space models), we limit our experiments to Transformers.

# Conclusion

We present architectural modifications for looped fixed-point Transformers that enable the use of pre-norm, improving the model's ability to exploit deeper `\efflayer `{=latex}provided by looping. These modifications allow FPRM to outperform hierarchical baselines of similar size, such as HRM and TRM, on common symbolic reasoning benchmarks. We show that on state tracking and Sudoku-Extreme, FPRM is able to adapt its compute to the difficulty of the task. This capability stems from dynamically scaling depth through fixed-point iterations and improving signal propagation. We hope these architectural modifications and the accompanying insights will support further progress on latent reasoning models.

# Acknowledgments and Disclosure of Funding {#acknowledgments-and-disclosure-of-funding .unnumbered}

We thank Felix Sarnthein, Albert Catalan-Tatjer, Jonas Geiping, Philipp Nazari, Carl Richardson, and Nouha Dziri for the helpful discussions and comments. Alexander Theus, Vera Milovanović, and Shlomo Libo Feigin are supported by the Max Planck ETH Center for Learning Systems. Vera Milovanović and Antonio Orvieto are supported by the AI2050 program at Schmidt Sciences. Antonio Orvieto, T. Konstantin Rusch, and Sajad Movahedi acknowledge the financial support of the Hector Foundation.

```{=latex}
\onecolumn
```
```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\newpage
```
```{=latex}
\onecolumn
```
```{=latex}
\appendix
```
# Proofs

## Fixed-point iterations bound {#app:alphas-proof}

::: proof
*Proof.* We simplify our notation by omitting the pre-norm layer in `\Cref{eq:layer_scaling}`{=latex}. We start by unrolling the computation across the $L$ layers within a single fixed-point iteration. By recursively applying the layer update, we obtain $$\begin{aligned}
\bmz_i^{2L}
=
&\ \alpha_1\cdot \bmz_i^{2L-1}
+
\beta_1\cdot f^{2L}_{\theta^{2L}}\left(\bmz_i^{2L-1}\right) \\
=
&\ \alpha_1^2\cdot \bmz_i^{2L-2}
+
\alpha_1\beta_1\cdot f^{2L-1}_{\theta^{2L-1}}\left(\bmz_i^{2L-2}\right)
+
\beta_1\cdot f^{2L}_{\theta^{2L}}\left(\bmz_i^{2L-1}\right) \\
=
&\ \alpha_1^3\cdot \bmz_i^{2L-3}
+
\alpha_1^2\beta_1\cdot f^{2L-2}_{\theta^{2L-2}}\left(\bmz_i^{2L-3}\right)
+
\alpha_1\beta_1\cdot f^{2L-1}_{\theta^{2L-1}}\left(\bmz_i^{2L-2}\right)
+
\beta_1\cdot f^{2L}_{\theta^{2L}}\left(\bmz_i^{2L-1}\right) \\
\vdots \\
=
&\ \alpha_1^{2L}\cdot \bmz_i^0
+
\beta_1\cdot
\left(
\sum_{j=0}^{2L-1}
\alpha_1^j\cdot
f^{2L-j}_{\theta^{2L-j}}
\left(\bmz_i^{2L-j-1}\right)
\right).
\end{aligned}$$ Now, substituting this expression into the fixed-point update gives $$\begin{aligned}
\bmz_{i+1}^0
=
&\ \alpha_2\cdot \bmz_i^{2L}
+
\beta_2\cdot \bmx \\
=
&\ \alpha_2\cdot
\left(
\alpha_1^{2L}\cdot \bmz_i^0
+
\beta_1\cdot
\sum_{j=0}^{2L-1}
\alpha_1^j\cdot
f^{2L-j}_{\theta^{2L-j}}
\left(\bmz_i^{2L-j-1}\right)
\right)
+
\beta_2\cdot \bmx \\
=
&\ \alpha_2\alpha_1^{2L}\cdot \bmz_i^0
+
\alpha_2\beta_1\cdot
\left(
\sum_{j=0}^{2L-1}
\alpha_1^j\cdot
f^{2L-j}_{\theta^{2L-j}}
\left(\bmz_i^{2L-j-1}\right)
\right)
+
\beta_2\cdot \bmx.
\end{aligned}$$ For compactness, define $\rho=\alpha_2\alpha_1^{2L}$ and $\bms_i=
\sum_{j=0}^{2L-1}
\alpha_1^j\cdot
f^{2L-j}_{\theta^{2L-j}}
\left(\bmz_i^{2L-j-1}\right).$ Then the fixed-point iteration can be written as $\bmz_{i+1}^0
=
\rho\cdot \bmz_i^0
+
\alpha_2\beta_1\cdot \bms_i
+
\beta_2\cdot \bmx.$ Unrolling this recursion over fixed-point iterations gives $$\begin{aligned}
\bmz_{i+1}^0
=
&\ \rho^{i+1}\cdot \bmz_0^0
+
\beta_2
\left(
\sum_{k=0}^{i}\rho^k
\right)
\cdot \bmx
+
\alpha_2\beta_1
\left(
\sum_{k=0}^{i}
\rho^k\cdot \bms_{i-k}
\right).
\end{aligned}$$ Since $0\leq \alpha_1,\alpha_2<1$, we have $0\leq \rho<1$. Therefore, the geometric series is convergent. Taking norms and using the boundedness of each layer map, we get $$\begin{aligned}
\left\Vert \bmz_{i+1}^0\right\Vert
\leq
&\ \rho^{i+1}\left\Vert \bmz_0^0\right\Vert
+
\beta_2
\left(
\sum_{k=0}^{i}\rho^k
\right)
\left\Vert \bmx\right\Vert
+
\alpha_2\beta_1
\left(
\sum_{k=0}^{i}\rho^k
\right)
\left(
\sum_{j=0}^{L-1}\alpha_1^j
\right)c_f.
\end{aligned}$$ Letting $i\to\infty$, the first term vanishes and the two geometric sums converge, which gives $$\begin{aligned}
\limsup_{i\to\infty}
\left\Vert \bmz_i^0\right\Vert
\leq
&\ \frac{\beta_2}{1-\rho}
\left\Vert \bmx\right\Vert
+
\frac{\alpha_2\beta_1}{1-\rho}
\left(
\frac{1-\alpha_1^{2L}}{1-\alpha_1}
\right)c_f.
\end{aligned}$$ Substituting back $\rho=\alpha_2\alpha_1^{2L}$, we obtain $$\begin{aligned}
\limsup_{i\to\infty}
\left\Vert \bmz_i^0\right\Vert
\leq
&\ \frac{\beta_2}{1-\alpha_2\alpha_1^{2L}}
\left\Vert \bmx\right\Vert
+
\frac{\alpha_2\beta_1\left(1-\alpha_1^{2L}\right)}
{\left(1-\alpha_2\alpha_1^{2L}\right)\left(1-\alpha_1\right)}
c_f.
\end{aligned}$$ We now set $\beta_2=1-\alpha_2\alpha_1^{2L}.$ This makes the coefficient of $\left\Vert \bmx\right\Vert$ equal to $1$. Furthermore, setting $\beta_1=
\frac{\beta_2\left(1-\alpha_1\right)}
{\left(1-\alpha_1^{2L}\right)}$ makes the coefficient of $c_f$ equal to $\alpha_2$. Therefore, $$\begin{aligned}
\limsup_{i\to\infty}
\left\Vert \bmz_i^0\right\Vert
\leq
\left\Vert \bmx\right\Vert
+
\alpha_2\cdot c_f.
\end{aligned}$$ In particular, if the fixed-point iteration converges to $\bmz_\infty^0$, then $$\begin{aligned}
\left\Vert \bmz_\infty^0\right\Vert
\leq
\left\Vert \bmx\right\Vert
+
\alpha_2\cdot c_f.
\end{aligned}$$ This completes the proof. ◻
:::

## Contractive mapping {#app:damped-fp-convergence-proof}

::: proof
*Proof.* We first show that $f_{\theta}(.;\bmx)$ is a contraction with respect to $\bmz$. For any $\bmz,\bmz'$, using the Lipschitzness of the $L$-layer model, we have $$\begin{aligned}
\left\Vert
f_{\theta}(\bmz;\bmx)
-
f_{\theta}(\bmz';\bmx)
\right\Vert
\leq
&\
\lambda_f
\left\Vert
\left(\alpha_2\cdot \bmz+\beta_2\cdot \bmx\right)
-
\left(\alpha_2\cdot \bmz'+\beta_2\cdot \bmx\right)
\right\Vert \\
=
&\
\alpha_2\lambda_f
\left\Vert
\bmz-\bmz'
\right\Vert.
\end{aligned}$$ Therefore, if $0\leq \alpha_2\lambda_f<1$, the map $f_{\theta}(.;\bmx)$ is strictly contractive. By the Banach fixed-point theorem, it has a unique fixed-point $\bmz^\star$, and the iteration $\bmz_{i+1}=f_{\theta}(\bmz_i;\bmx)$ converges to $\bmz^\star$.

We now prove the residual bound. Since $\bmz_{i+1}=f_{\theta}(\bmz_i;\bmx)$, the residual at iteration $i$ can be written as $$\begin{aligned}
\left\Vert
f_{\theta}(\bmz_i;\bmx)-\bmz_i
\right\Vert
=
\left\Vert
\bmz_{i+1}-\bmz_i
\right\Vert.
\end{aligned}$$ Using the contraction property of $f_{\theta}(.;\bmx)$, we get $$\begin{aligned}
\left\Vert
\bmz_{i+1}-\bmz_i
\right\Vert
=
&\
\left\Vert
f_{\theta}(\bmz_i;\bmx)
-
f_{\theta}(\bmz_{i-1};\bmx)
\right\Vert \\
\leq
&\
\alpha_2\lambda_f
\left\Vert
\bmz_i-\bmz_{i-1}
\right\Vert.
\end{aligned}$$ Applying this inequality recursively gives $$\begin{aligned}
\left\Vert
\bmz_{i+1}-\bmz_i
\right\Vert
\leq
\left(\alpha_2\lambda_f\right)^i
\left\Vert
\bmz_1-\bmz_0
\right\Vert.
\end{aligned}$$ Since $\bmz_1=f_{\theta}(\bmz_0;\bmx)$, we obtain $$\begin{aligned}
\left\Vert
f_{\theta}(\bmz_i;\bmx)-\bmz_i
\right\Vert
\leq
\left(\alpha_2\lambda_f\right)^i
\left\Vert
f_{\theta}(\bmz_0;\bmx)-\bmz_0
\right\Vert.
\end{aligned}$$ This completes the proof. ◻
:::

## Mitigating oscillation through damping {#app:damping-proof}

::: proof
*Proof.* We first show that the fixed-points of $g_{\eta,\theta}(\,\cdot\,;\bmx)$ and $f_\theta(\,\cdot\,;\bmx)$ coincide. Since $$\begin{aligned}
    g_{\eta,\theta}(\bmz;\bmx) - \bmz \;=\; \eta\bigl(f_\theta(\bmz;\bmx) - \bmz\bigr),
\end{aligned}$$ and $\eta > 0$, we have $g_{\eta,\theta}(\bmz;\bmx) = \bmz$ if and only if $f_\theta(\bmz;\bmx) = \bmz$.

We now study the local stability of the damped iteration around $\bmz^\star$. The Jacobian of $g_{\eta,\theta}(\,\cdot\,;\bmx)$ at $\bmz^\star$ is $$\begin{aligned}
    \frac{\partial g_{\eta,\theta}}{\partial \bmz}(\bmz^\star;\bmx) \;=\; (1-\eta)\matI + \eta\matJ,
\end{aligned}$$ so for every eigenvalue $\lambda_i$ of $\matJ$, the corresponding eigenvalue of the damped Jacobian is $$\begin{aligned}
    \mu_i(\eta) \;=\; 1 - \eta + \eta\lambda_i \;=\; 1 + \eta(\lambda_i - 1).
\end{aligned}$$ Local asymptotic stability is implied by $|\mu_i(\eta)| < 1$ for all $i$. Writing $\lambda_i = a_i + \mathrm{i}\,b_i$ with $a_i = \Re(\lambda_i)$, $$\begin{aligned}
    |\mu_i(\eta)|^2 \;=\; 1 + 2\eta(a_i - 1) + \eta^2 |\lambda_i - 1|^2.
\end{aligned}$$ Hence $|\mu_i(\eta)| < 1$ if and only if $\eta |\lambda_i - 1|^2 < 2(1 - a_i)$, i.e., $$\begin{aligned}
    0 \;<\; \eta \;<\; \frac{2(1 - \Re(\lambda_i))}{|\lambda_i - 1|^2}.
\end{aligned}$$ By assumption $\Re(\lambda_i) < 1$ for every $i$, so each upper bound is strictly positive. Setting $$\begin{aligned}
    \eta_0 \;=\; \min\!\left\{\,1,\; \min_i \frac{2(1 - \Re(\lambda_i))}{|\lambda_i - 1|^2}\,\right\} \;>\; 0,
\end{aligned}$$ we obtain $|\mu_i(\eta)| < 1$ for every $i$ and every $\eta \in (0, \eta_0)$. Therefore $\bmz^\star$ is locally asymptotically stable under the damped iteration $\bmz_{t+1} = g_{\eta,\theta}(\bmz_t;\bmx)$, and the iterates converge to $\bmz^\star$ from any sufficiently close initialization. ◻
:::

## Error of truncated-BPTT {#app:trunc-bptt-proof}

::: proof
*Proof.* Since $\|\mathbf{J}\|_2=\sigma<1$, the Neumann series is convergent, and we have $\left(\mathbf{I}-\mathbf{J}\right)^{-1}
=
\sum_{j=0}^{\infty}\mathbf{J}^j.$ Therefore, the error of the $k$-term truncated approximation is $$\begin{aligned}
\left\Vert
\left(\mathbf{I}-\mathbf{J}\right)^{-1}
-
\sum_{j=0}^{k-1}\mathbf{J}^j
\right\Vert_F
&=
\left\Vert
\sum_{j=k}^{\infty}\mathbf{J}^j
\right\Vert_F \\
&\leq
\sum_{j=k}^{\infty}
\left\Vert
\mathbf{J}^j
\right\Vert_F.
\end{aligned}$$ Using the relation $\|\mathbf{A}\|_F\leq \sqrt{D}\|\mathbf{A}\|_2$ for $\mathbf{A}\in\mathbb{R}^{D\times D}$, together with submultiplicativity of the spectral norm, we get $$\begin{aligned}
\left\Vert
\mathbf{J}^j
\right\Vert_F
&\leq
\sqrt{D}
\left\Vert
\mathbf{J}^j
\right\Vert_2 \\
&\leq
\sqrt{D}
\left\Vert
\mathbf{J}
\right\Vert_2^j \\
&=
\sqrt{D}\cdot \sigma^j.
\end{aligned}$$ Substituting this into the previous inequality gives $$\begin{aligned}
\left\Vert
\left(\mathbf{I}-\mathbf{J}\right)^{-1}
-
\sum_{j=0}^{k-1}\mathbf{J}^j
\right\Vert_F
&\leq
\sqrt{D}
\sum_{j=k}^{\infty}
\sigma^j \\
&=
\sqrt{D}\cdot
\frac{\sigma^k}{1-\sigma}.
\end{aligned}$$ Thus, the approximation error decays as $\mathcal{O}(\sigma^k)$.

To make the corresponding gradient statement explicit, let $\mathbf{P}
=
\frac{\partial f_\theta}{\partial \theta}(\mathbf{z}^\star;\mathbf{x})$ and $\boldsymbol{\delta}
=
\frac{\partial \mathcal{L}}{\partial \mathbf{z}^\star}.$ The exact implicit gradient is $\nabla_\theta \mathcal{L}
=
\mathbf{P}^\top
\left(\mathbf{I}-\mathbf{J}\right)^{-\top}
\boldsymbol{\delta},$ whereas the $k$-step truncated BPTT gradient is $\widehat{\nabla}_\theta^{(k)}\mathcal{L}
=
\mathbf{P}^\top
\left(
\sum_{j=0}^{k-1}
\left(\mathbf{J}^\top\right)^j
\right)
\boldsymbol{\delta}.$ Therefore, $$\begin{aligned}
\left\Vert
\nabla_\theta \mathcal{L}
-
\widehat{\nabla}_\theta^{(k)}\mathcal{L}
\right\Vert_2
&\leq
\left\Vert \mathbf{P}\right\Vert_2
\left\Vert
\left(\mathbf{I}-\mathbf{J}\right)^{-\top}
-
\sum_{j=0}^{k-1}
\left(\mathbf{J}^\top\right)^j
\right\Vert_2
\left\Vert \boldsymbol{\delta}\right\Vert_2 \\
&\leq
\left\Vert \mathbf{P}\right\Vert_2
\left(
\sum_{j=k}^{\infty}
\left\Vert \mathbf{J}\right\Vert_2^j
\right)
\left\Vert \boldsymbol{\delta}\right\Vert_2 \\
&=
\left\Vert \mathbf{P}\right\Vert_2
\frac{\sigma^k}{1-\sigma}
\left\Vert \boldsymbol{\delta}\right\Vert_2.
\end{aligned}$$ Hence, the truncated BPTT gradient error also decays exponentially with the number of backward passes $k$. This completes the proof. ◻
:::

# A Toy Failure Mode for Recurrent Post-norm {#sec:pre_post_toy}

In this section, we provide a toy example to showcase a failure mode of post-norm in a small setting. We take random $\bmx,\bmy\in\mathbb{R}^{n\times d}$ to be the input-output pair of sequences, with sequence length $n=100$ and hidden-size $d=2$, and set $\bmz^0_0=\bmx$. Let $f_\theta(\bmz;\bmx)$ be a neural network with a single sub-layer $f_{\theta^1}^1(\bmz)$ defined as: $$f_{\theta^1}^1(\bmz)= \bmz W(\bmw),$$ where we define the rank-one map: $$W(\bmw)=\bmw\mathbf{1}^{\top}\in \mathbb{R}^{2\times 2},
    \qquad
    \bmw=(w_1,w_2)^\top,$$ with $\theta^1=\bmw$.

The **post-norm recurrence** is defined as $$\bmz_{i+1}^0 = \mathrm{Norm}_{\mathrm{post}}\left(\bmz_{i}^0 + f_{\theta^1}\left(\bmz_{i}^0\right) \right),$$ while the **pre-norm recurrence** is defined as

$$\bmz_{i+1}^0 = \left(1 - \beta\right) \cdot \bmz_{i}^0 + \beta \cdot f_{\theta^1}\left(\mathrm{Norm}_{\mathrm{post}}\left(\bmz_{i}^0\right)\right),
    \qquad
    \beta=\frac12 .$$

`\Cref{fig:pre_post_toy}`{=latex} gives a minimal version of the normalization trade-off discussed in the main text based on this toy model. For both models we sweep a $200\times 200$ grid over $\bmw\in \mathcal{G}= [-5,5]^2$ and plot

$$\Delta\mathcal{L}(\bmw)
    =
    \mathcal{L}(\bmw)-\min_{\bmv\in\mathcal{G}}\mathcal{L}(\bmv),$$ where we define the loss as $$\mathcal{L}(\bmw)
    =
    \frac{1}{nd}
    \left\|
        \bmz_{20}^0(\bmw)-\bmy
    \right\|_F^2,$$ i.e., at the $20^{th}$ `\efflayer`{=latex}.

The figure illustrates that **boundedness does not imply trainability**. Post-norm keeps the recurrent state bounded by construction: after every step, each row is projected back onto the unit sphere. However, the same projection also removes radial information at every iteration. In this two-parameter slice, the resulting loss is organized into thin angular sectors with sharp ridges and narrow low-loss regions. Thus a random initialization of $\bmw$ is likely to start in a bounded but poorly conditioned part of the landscape, where the gradient does not point into a useful basin. This is the toy analogue of the optimization difficulty of recurrent post-norm layers.

The right panel is not bare pre-norm; it is pre-norm with residual scaling. This matters because naive pre-norm removes the projection that controls the recurrent state and can lead to activation growth (`\Cref{fig:ut_activation_norm}`{=latex}). With the scaled update, each row satisfies $$\|\bmz_{i + 1}^{0}\|_2
    \le
    (1-\beta)\|\bmz_{i}^{0}\|_2+\beta.$$ So the toy dynamics remain bounded while preserving a live residual stream. The broader low-loss region in `\Cref{fig:pre_post_toy}`{=latex} is therefore consistent with the architecture we use in `\Cref{thm:bounded-fp-iterates}`{=latex}: pre-normalization improves signal propagation, while residual scaling replaces the boundedness mechanism that post-normalization provided.

![Landscape visualization for the setup proposed in `\Cref{sec:pre_post_toy}`{=latex}.](figures/arxiv/appendix/pre-postLN-landscape.png){#fig:pre_post_toy width="0.9\\linewidth"}

# Further Details About the Architecture {#architecture_details}

#### Fixed-point solver.

Let $\bmz_i \in \mathbb{R}^{B \times T \times d}$ denote the $i^{th}$ latent representation (with $B$ denoting the batch index, $T$ the sequence index, and $d$ the hidden size), and $\bmz_{i+1}$ the next latent representation. We index the $b^{th}$ batch dimension as $\bmz_i[b]$. Convergence is measured per sample $\bmz_i[b]$ by the relative $L_\infty$ norm of the residual, $$\bmr_i[b] = \frac{\left\Vert \bmz_{i + 1}[b] - \bmz_i[b] \right\Vert_\infty}{\left\Vert \bmz_{i+1}[b] \right\Vert_\infty + \epsilon} \in \mathbb{R}.$$ A sample is declared converged when $r_i[b] < \tau$. In practice, we set $\tau$ to $0.1$. But we observe that for a reasonably small choice of $\tau$, the model is not sensitive to this value. Two safeguards bound the loop: **(1)** a hard cap on the number of iterations, and **(2)** early termination if the adaptive step size collapses below a minimum.

#### Deep supervision.

We adopt a similar deep supervision mechanism as HRM [@DBLP:journals/corr/abs-2506-21734] and TRM [@jolicoeurmartineau2025morerecursivereasoningtiny]. Let $T_{\text{sup}}$ denote the deep supervision interval. After every $T_{\text{sup}}$ iterations, the intermediate activations of the model are decoded through the output head, the loss is computed, and truncated-BPTT is performed through the $k$ latest iterations. Then, the computation graph is detached from the previous step. For each sequence, this process is continued until the fixed-point of the input is reached. The number of backward passes per forward pass is therefore $\lceil k / T_{\text{sup}}\rceil$. In our model, we set $T_{\text{sup}}=k$, while in TRM and HRM, $T_{\text{sup}}$ is usually set to a larger number. However, we observe that in practice, FPRM performs a smaller number of forward and backward passes during training compared to TRM, lowering the training cost.

#### Depth-wise convolutions.

Depth-wise convolutions have proven effective in improving the performance of looped models, at a small time and parameter complexity [@DBLP:journals/corr/abs-2602-02156; @DBLP:journals/corr/abs-2512-14693]. Therefore, in FPRM we apply depth-wise convolutions on the latent representations at the beginning of each loop, which we find to be most effective. An overview of FPRM is available in `\Cref{fig:lofat-architecture}`{=latex}. We consider both 1D and 2D convolutions, and we find the 2D variant to be more effective at 2-dimensional tasks such as Sudoku and ARC, while the 1D variant is essential in state-tracking. However, as observed in `\Cref{tab:arch_sweep}`{=latex}, depth-wise convolutions seem to have a detrimental impact on the performance of TRM.

# Description of `\Cref{fig:fprm_trm_halting_points}`{=latex} {#sec:fig1-descript}

In this figure, we categorize the puzzles into three groups: easy, medium, and hard. The grouping is based on difficulty, which is measured by the number of empty cells in the puzzle. The sample sizes for each difficulty level are balanced and set at around 1000 samples. For FPRM, we mark the halting decision for the entire group based on the residual of the group: if the mean residual is smaller than a pre-determined threshold (set to 0.1), then the model makes the halting decision. For TRM, the halting decision is marked when the ACT module signals halting for more than half of the samples. For the sake of exposition, we exclude the hardest puzzles from the groups, since they fail to halt at the current max. set budget of 10000 effective layers.

# Fixed-point Residuals and Halting

We provide the test accuracy and the fixed-point residuals achieved by FPRM as a function of `\efflayer `{=latex}in `\Cref{fig:fprm_scaling_convergence}`{=latex}. The residuals for more difficult problems decay at a much slower rate, indicating that they demand more compute. Furthermore, accuracy stops improving at roughly the same `\efflayer `{=latex}where the residual plateaus, supporting the use of fixed-points as a halting criterion.

<figure id="fig:fprm_scaling_convergence">
<img src="figures/arxiv/sudoku_empty_cells/bin_counts_train_test.png" />
<img src="figures/arxiv/appendix/fprm_residues.png" />
<figcaption><strong>FPRM allocates more compute to harder problems.</strong> Harder inputs need more iterations before halting and peak beyond the training compute limit (dashed line); color shows residual norm.</figcaption>
</figure>

# How to effectively spend loops in TRM? {#sec:trm_scaling_LH_vs_deep_sup}

The default proposed TRM uses $16$ deep-supervision steps (outer loops) and variable L- and H-steps (inner loops). The L-steps outnumber the H-steps, typically by about $2\times$. However, other configurations for the number of loops spent for deep supervision vs. inner loops are possible. We test the performance of other configurations with experiments shown in `\Cref{fig:trm_deep_superv_vs_LH}`{=latex}. We fix the L-to-H ratio at $2$ and vary deep-supervision steps (segments) against per-segment recurrence depth (inner loops, shown with numbers next to the black markers in `\Cref{fig:trm_deep_superv_vs_LH}`{=latex}). We measure test accuracy as a function of the number of deep-supervision steps, with fixed inference budget at approximately $1040$ steps. This budget also matches the max. number of effective layers reached by the baseline FPRM on this task. This isolates how a fixed inference budget is best allocated: toward more outer refinement steps or deeper inner recurrence. `\Cref{fig:trm_deep_superv_vs_LH}`{=latex} shows that the budget is best spent on outer, deep-supervision steps. This matches the finding that, in TRM's post-norm Transformer, the gains from added `\efflayer `{=latex}depth get smaller compared to FPRM (`\Cref{fig:sudoku_inference_scaling}`{=latex}). Fewer `\efflayers `{=latex}per segment is therefore the better strategy at a fixed compute budget. We adopt it for all experiments where we scale TRM compute (`\efflayers`{=latex}).

![**The optimal way to spend the fixed looping compute is to maximize deep supervision steps.** The numbers next to markers are inner recurrence depths per each deep supervision step. The total depth of `\efflayers `{=latex}is approximately the same across all configurations of TRM and FPRM on the Sudoku-Extreme task.](figures/arxiv/appendix/trm_deepsup_scaling.png){#fig:trm_deep_superv_vs_LH width="0.5\\linewidth"}

# Additional Experimental Details {#experimental details}

```{=latex}
\resizebox{\linewidth}{!}{%
    \begin{tabular}{lcccc}
    \toprule
                 & w/ conv & w/o conv \\
    \cmidrule(lr){2-3} \cmidrule(lr){4-5}
    Configuration & $\Delta$ (\%) & Acc.\ (\%) & $\Delta$ (\%) & Acc.\ (\%) \\
    \midrule
    Original TRM (post-norm, no residual scaling)                              & ---       & $63.98$ & ---               & $72.60$ \\
    \;$+$ residual scaling ($\alpha_2$ only)                                    & $-6.71$   & $57.27$ & $-4.44$           & $68.16$ \\
    \;$+$ residual scaling ($\alpha_1$, $\alpha_2$)                             & $-4.47$   & $59.51$ & $-12.24$          & $60.36$ \\
    \;$-$ post-norm $+$ pre-norm $+$ residual scaling ($\alpha_2$ only)          & $-49.00$  & $14.98$ & $-58.87$          & $13.73$ \\
    \;$-$ post-norm $+$ pre-norm $+$ residual scaling ($\alpha_1$, $\alpha_2$)   & $-24.52$  & $39.46$ & $\mathbf{-52.83}$ & $19.77$ \\
    \bottomrule
    \end{tabular}%
    }
```
#### Weight initialization.

It seems that initializing the weights using a truncated normal distribution (LeCun initialization) is common practice in looped architectures. In our experiments, it accelerates the convergence but there is very little material difference in sequence accuracy after convergence.

#### Grokking.

There is some evidence for grokking in looped architectures, but on the maze task we observe convergence on the training data. And training the models for a longer period (up to 7 days) did not yield better performance.

#### Hyperparameters, device specification

We provide the values for some of the most important hyperparameters in the paper, per each model and dataset.

```{=latex}
\small
```
::: {#tab:hp-sudoku}
                                                   TRM                          FPRM
  ------------------------------------- -------------------------- ------------------------------
  *Looping structure*                                              
  $H$-cycles                                        3                            --
  $L$-cycles                                        6                            --
  $H$-layers                                        0                            0
  $L$-layers                                        2                            2
  $n_{\text{back}}$                       $=L\text{-cycles} + 1$                 6
  *Halting*                                                        
  mechanism                                        ACT                      fixed-point
  `halt_max_steps`                                  16                           --
  `max_iter` (train)                                --                           12
  `max_iter` (eval)                                 --                         35 000
  stepsize-decay / patience (eval)                  --                       0.997 / 10
  $\texttt{fp\_thresh}$                             --                          0.1
  *Block / signal-prop modifications*                              
  norm type                                     post-norm                     pre-norm
  residual scaling                       `\xmark        `{=latex}        `\cmark  `{=latex}
  $\alpha_1,\alpha_2$ init                          --                   0.75,`\;`{=latex}
  conv branch                                       --              2D-conv ($3\times 3$ kernel)
  *Optimizer*                                                      
  learning rate                                 $10^{-4}$                    $10^{-3}$
  weight decay                                     1.0                       $10^{-3}$
  puzzle-emb LR                                 $10^{-4}$                    $10^{-3}$
  puzzle-emb WD                                    1.0                       $10^{-3}$

  : Hyperparameters for Sudoku-Extreme experiments (Table 1 of the paper). Shared across all models: 1$\times$A100-40GB, batch 768, 60 000 epochs, constant LR after 2 000-step warm-up, EMA enabled (rate 0.999), puzzle-embedding length 16. All models are trained with `AdamW` [@DBLP:conf/iclr/LoshchilovH19].
:::

```{=latex}
\small
```
::: {#tab:hp-maze}
                                                  TRM                         FPRM
  ------------------------------------- ------------------------ ------------------------------
  *Looping structure*                                            
  $H$-cycles                                       3                           --
  $L$-cycles                                       4                           --
  $H$-layers                                       0                           0
  $L$-layers                                       2                           2
  $n_{\text{back}}$                      $=L\text{-cycles} + 1$                6
  *Halting*                                                      
  mechanism                                       ACT                     fixed-point
  `halt_max_steps`                                 16                          --
  `max_iter` (train)                               --                          24
  `max_iter` (eval)                                --                        35 000
  stepsize-decay / patience (eval)                 --                      0.996 / 10
  `fp_thresh`                                      --                         0.1
  *Block / signal-prop modifications*                            
  norm type                                    post-norm                    pre-norm
  residual scaling                        `\xmark    `{=latex}        `\cmark   `{=latex}
  $\alpha_1,\alpha_2$ init                         --                  0.75,`\;`{=latex}
  conv branch                                      --             1D-conv ($1\times 4$ kernel)
  *Optimizer*                                                    
  learning rate                                $10^{-4}$                   $10^{-4}$
  weight decay                                    1.0                         1.0
  puzzle-emb LR                                $10^{-4}$                   $10^{-2}$
  puzzle-emb WD                                   1.0                         1.0

  : Hyperparameters for Maze-Hard experiments (Table 1 of the paper). Shared: trained on `maze-30x30-hard-1k` *without augmentation*, 4$\times$A100-80GB, constant LR after a 2 000-step warm-up, EMA enabled (rate 0.999), puzzle-embedding length 16. FPRM trains for 60 000 epochs (TRM 50 000). FPRM is trained using `Adam-Atan2` [@DBLP:conf/icml/EverettXWANLGSK24]; TRM is trained using `AdamW`.
:::

```{=latex}
\small
```
`\setlength{\tabcolsep}{4pt}`{=latex}

::: {#tab:hp-statetrack}
                                                      TRM                             FPRM               Looped Transformer (Fig. 2)
  ------------------------------------- -------------------------------- ------------------------------ -----------------------------
  *Looping structure*                                                                                   
  $H$-cycles                                           2                               --                            --
  $L$-cycles                                           4                               --                            --
  $H$-layers                                           0                               0                              0
  $L$-layers                                           4                               2                              2
  $n_{\text{back}}$                          $=L\text{-cycles} + 1$                    4                              4
  *Halting*                                                                                             
  mechanism                              fixed iters or ACT (inference)           fixed-point                    fixed iters
  `halt_max_steps`                                     16                              --                            --
  `max_iter`                                           --                             128                    $=k_{\text{train}}$
  iter. distribution                                   --                        deterministic                  deterministic
  *Block / signal-prop modifications*                                                                   
  norm type                                        post-norm                        pre-norm                   sweep$^\dagger$
  norm placement                                       --                             none                          none
  residual scaling                           `\xmark      `{=latex}         `\cmark        `{=latex}           sweep$^\dagger$
  $\alpha_1,\alpha_2$ init                             --                       0.5,`\;`{=latex}               sweep$^\dagger$
  conv branch                                          --                 1D-conv ($1\times 4$ kernel)               --
  *Optimizer*                                                                                           
  learning rate                                    $10^{-4}$                       $10^{-3}$                      $10^{-4}$
  weight decay                                     $10^{-2}$                       $10^{-2}$                      $10^{-2}$

  : Hyperparameters for state-tracking experiments on $A_5$ and $S_5$ (Figure `\ref{fig:state-tracking-per-k}`{=latex} of the paper) and for the Looped Transformer signal-propagation analysis (Figure `\ref{fig:ut_norm}`{=latex}). Shared: 1$\times$A100-80GB, global batch 1024, `Adam-Atan2`, no LR warm-up, EMA disabled, no puzzle embedding (`puzzle_emb_len`=0). Trained at $k_{\text{train}}{=}32$, evaluated for $k\in[2,128]$. TRM and FPRM train for 50 epochs; the Looped Transformer analysis (Fig. 2) for 30.
:::

\
`\footnotesize `{=latex}$^\dagger$ The Looped Transformer row sweeps the {post-norm, pre-norm, pre-norm + residual-scaling} variants from `\Cref{fig:ut_max_len_vs_iters}`{=latex}; the residual-scaling variant uses $\alpha_1{=}0.75,\alpha_2{=}0.5$ and $k_{\text{train}}\in\{8,16,32,64\}$.

```{=latex}
\small
```
::: {#tab:hp-arc}
                                                  FPRM
  ------------------------------------- ------------------------
  *Looping structure*                   
  $H$-cycles                                       --
  $L$-cycles                                       --
  $H$-layers                                       0
  $L$-layers                                       2
  $n_{\text{back}}$                                6
  *Halting*                             
  mechanism                                   fixed-point
  `halt_max_steps`                                 --
  `max_iter` (train)                               8
  `max_iter` (eval)                               1000
  stepsize / decay / patience (eval)         1.0 / 0.9 / 5
  $\texttt{fp\_thresh}$                           0.1
  *Block / signal-prop modifications*   
  norm type                                     pre-norm
  residual scaling                       `\cmark      `{=latex}
  $\alpha_1,\alpha_2$ init                 0.75,`\;`{=latex}
  conv branch                              1D-conv (kernel 4)
  *Optimizer*                           
  learning rate                                $10^{-3}$
  weight decay                                 $10^{-2}$
  puzzle-emb LR                                $10^{-2}$
  puzzle-emb WD                                   1.0

  : Hyperparameters for the FPRM ARC-AGI experiments. The two runs share an identical configuration and differ only in the training corpus (ARC1-Concept vs. ARC2-Concept, both with 1000 augmentations per sample). Shared across both: 4$\times$A100-80GB, batch 768, 100 000 epochs, constant LR with no warm-up, EMA enabled (rate 0.999), puzzle-embedding length 16, hidden size 512, 8 heads, MLP expansion 4, RoPE position encodings. Both models are trained with `Adam-Atan2` ($\beta_1{=}0.9,\beta_2{=}0.95$).
:::

[^1]: Equal contribution.

[^2]: Equal advising.