---
abstract: |
  Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models. Project Page: [[https://loopformer.github.io]{style="color: customred"}](https://loopformer.github.io/)
author:
- |
  Ahmadreza Jeddi$^{1,2,3}$ `\quad`{=latex} Marco Ciccone$^{1,2}$ `\quad`{=latex} Babak Taati$^{1,2,3}$\
  $^{1}$University of Toronto `\quad`{=latex} $^{2}$Vector Institute `\quad`{=latex} $^{3}$University Health Network\
  Corresponding Author: `ajeddi@cs.toronto.edu``\quad`{=latex}
bibliography:
- main.bib
title: 'LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\fix}{\marginpar{FIX}}
```
```{=latex}
\newcommand{\new}{\marginpar{NEW}}
```
```{=latex}
\maketitle
```
`\definecolor{customred}{HTML}{ED028C}`{=latex}

```{=latex}
\hypersetup{
    colorlinks=true,
    urlcolor=customred  % Apply the custom color to URLs
}
```
Introduction
============

Transformers with parameter sharing, often called *looped* or *recurrent* Transformers, have emerged as an efficient and capable alternative to deep non--shared stacks across vision and natural language [@dehghani2018universal; @lan2019albert; @jaegle2021perceiver; @dutta2021redesigning; @geiping2025scaling]. In particular, looped Transformers in the language modeling setting have shown promising performance on a variety of algorithmic and reasoning tasks [@geiping2025scaling; @saunshi2024inductive; @saunshi2025reasoning; @gatmiry2024can]. These models appear to possess an inductive bias toward reasoning: a property sometimes referred to as *latent reasoning*, where they internalize reasoning skills akin to explicit chain-of-thought prompting in large language models. Moreover, studies indicate that such abilities scale gracefully with effective computational depth and number of loops, yielding improved results on reasoning benchmarks [@xu2024expressive; @saunshi2025reasoning; @geiping2025scaling]. However, existing approaches almost always train and evaluate with a fixed number of unrolls. This raises a fundamental question: do looped Transformers truly exploit their computational depth flexibly, and can they be trained to operate effectively under variable compute budgets?

Despite their promise, current looped models remain tied to a single trajectory length. Once trained, their representations collapse when evaluated at shorter or longer depths, since those settings are off-distribution [@bae2024relaxed; @fan2024looped]. In practice, this means looped models spend the same budget as non-looped iso-FLOP baselines, forfeiting one of their key motivations: flexible compute. We instead consider *budget-conditioned* language modeling: at inference, a user specifies a compute budget $M$, and the model should produce high-quality representations without retraining. While early exiting, routing, and layer dropping have made non-looped Transformers more dynamic [@schuster2022confident; @fan2019reducing; @elhoushi2024layerskip; @shazeer2017outrageously; @raposo2024mixture], little has been explored for looped models. Naively transplanting these techniques into *looped* architectures is fragile [@bae2025mixture; @geiping2025scaling]: repeated passes of the shared block often converge to similar, *stagnant* states. We aim for *elastic depth*: a single looped model that performs well across user-chosen budgets without retraining or late-step degeneracy. The core challenge is to train looped models whose internal trajectories remain *stable across depths*, so that shorter routes do not degrade and longer routes continue to refine rather than collapse.

We present **LoopFormer**, a shortcut-modulated looped language model that supports elastic depth, maintaining strong performance across a range of inference budgets. Inspired by diffusion models [@frans2024one; @lu2024simplifying] and neural ODEs [@chen2018neural], we cast iterative representation refinement as a trajectory in representation space: token states evolve from an initial $h_0$ toward a target $h_1$ over a normalized unit-time horizon. Our key insight is to explicitly condition each loop step on the current time $t$ and a step size $\Delta t$ (a "jump"), allowing coarser trajectories to approximate fine-grained ones with fewer steps. Training employs a *shortcut-consistency* objective that aligns the final representations of shorter routes with those of the full route via a stop-gradient target, effectively performing self-distillation within the loop. At inference, this conditioning yields *elastic depth* without retraining: the user selects a budget $M \le L$ (maximum loops) and a step schedule, and performance scales smoothly with compute, rather than collapsing at shorter depths.

Departing from dynamic compute through routing or token halting [@dehghani2018universal; @bae2025mixture], we instead introduce the notion of *loop trajectories*, which enable effective and efficient language modeling, thereby giving the model the ability to internalize and refine reasoning and performance as compute increases. Empirically, LoopFormer not only preserves the language modeling abilities reported in prior work, but also outperforms baseline looped models on language tasks and closes much of the gap with iso-FLOP non-looped models. This creates an opportunity to study loop trajectories as a new lens on reasoning in looped models. Moreover, unlike early-exit mechanisms that often collapse to degenerate states, LoopFormer maintains stable refinement: both perplexity and reasoning scale gracefully with the compute budget.

The key contributions of this paper are summarized as follows:

1.  We formulate a class of shortcut-modulated looped Transformers for language modeling that conditions each pass on internal time and step sizes.

2.  We introduce a shortcut-consistency training protocol over families of step schedules that enables compute-budgeted inference (elastic depth) without retraining.

3.  We analyze representation dynamics across loop iterations using multiple geometric and information-theoretic metrics (anisotropy, curvature, entropy, CKA), finding that adaptive early-exit looped models exhibit representational collapse, while our shortcut-modulated models maintain evolving, non-degenerate states.

4.  We demonstrate consistent performance-per-compute gains in both perplexity and zero-shot reasoning across diverse language benchmarks; ablations isolate the roles of time/step conditioning and offer practical guidance for choosing trajectories under fixed budgets.

Related Work {#sec:literature}
============

#### Recursive / Looped Transformers.

Parameter sharing provides an orthogonal route to efficiency and effective depth. The Universal Transformer demonstrated that repeatedly applying a single block can match the representational capacity of deep non--shared stacks, while also introducing adaptive computation time [@dehghani2018universal]. ALBERT further showed that extensive cross-layer weight-tying yields substantial parameter savings during pretraining, without compromising downstream performance [@lan2019albert]. Deep Equilibrium Models (DEQ) extend this paradigm by defining an implicit infinitely deep, weight--tied transformation solved via fixed-point iteration and implicit differentiation [@bai2019deep], with related equilibrium and recurrent architectures also explored in vision and multimodal contexts [@jaegle2021perceiver; @yang2022recurring]. More recent work has studied looping as programmable computation [@giannou2023looped], iterative data-fitting or optimization solvers [@yang2023looped], mechanisms for algorithmic length generalization [@fan2024looped], and even the Turing-completeness of looped decoders for certain graph algorithms [@de2024simulation]. Most relevant to our work, *Time-Modulated Looped Transformers* (TMLT) [@xu2024expressive] analyze the expressive power of looped Transformers in language modeling, showing that conditioning on timestep (loop index) improves scaling and perplexity. Our approach builds on this insight but extends it by conditioning on both normalized time and step size, and by training across families of trajectories rather than a single fixed schedule.

#### Dynamic compute.

Dynamic compute allocation reduces inference cost by skipping or reallocating computation where it is not needed [@bengio2015conditional; @huang2016deep; @panda2016conditional]. Early exiting halts processing for "easy" inputs at intermediate layers, allowing deeper computation only for harder cases [@elbayad2019depth; @schuster2022confident; @elhoushi2024layerskip]. Other approaches include layer dropping and pruning strategies for BERT-style models [@fan2019reducing; @hou2020dynabert]. Mixture-of-Experts (MoE) increases capacity through sparse expert routing [@shazeer2017outrageously; @fedus2022switch], while *Mixture-of-Depths* (MoD) [@raposo2024mixture] reframes adaptivity as token-wise routing across layers, enabling fine-grained dynamic allocation of depth. Despite these advances, the adaptation of dynamic compute techniques to exploit the looping capability of recursive models has received little attention. A concurrent line of work, *Mixture of Recursions* [@bae2025mixture], extends routing to recursive stacks by varying the number of shared-block applications per token. In contrast, our approach does not route or halt computation at the token level; instead, we train looped models to be *budget-conditioned*, ensuring that the same parameters operate robustly across user-specified loop budgets.

#### Latent reasoning.

An increasing body of work investigates reasoning carried out within hidden states rather than through explicit chain-of-thought prompting [@goyal2023think; @cheng2024compressed; @pfau2024let; @kaissis2026stepresolveddataattributionlooped; @chen2026loop; @zhu2025scaling; @jolicoeur2025less; @wang2025hierarchical; @zhang2025recursive; @knupp2026depth; @liang2026latent]. Several approaches employ looped models to simulate multi-step reasoning or to approximate chain-of-thought dynamics directly [@hao2024training; @saunshi2025reasoning; @wang2025hierarchical]. In parallel, theoretical and empirical studies connect network depth and algorithmic generalization to reasoning ability [@merrill2023expressive; @chen2024can; @ye2024physics; @xu2025cot]. More recently, analyses suggest that looped Transformers possess an inductive bias for reasoning that strengthens with increasing effective computational depth [@saunshi2024inductive; @saunshi2025reasoning]. Collectively, this line of work frames such abilities as *latent reasoning*, wherein models perform iterative computations within their hidden-state space without explicit verbalization. Our work leverages this inductive bias of looped models but emphasizes *trajectory robustness*: ensuring that hidden-state computation remains informative even under shorter budgets, while continuing to refine effectively at greater depths.

#### Time / shortcut modulation and conditioning.

Conditioning networks on continuous *time* or related control variables has proven highly effective in generative diffusion modeling [@ho2020denoising; @rombach2022high; @lipman2022flow]. Diffusion Transformers (DiT) [@peebles2023scalable] incorporate timestep-conditioned adaptive normalization, while consistency-based training aligns solutions across discretizations [@song2023consistency]. Recent shortcut and one-step diffusion approaches further distill long trajectories into a few steps by enforcing consistency between coarse and fine solvers [@frans2024one; @lu2024simplifying]. These ideas are increasingly migrating into language modeling: for example, *Time-Modulated Looped Transformers* (TMLT) [@xu2024expressive] adapt DiT-style timestep conditioning to recursive language models, demonstrating improved scaling and perplexity. LoopFormer extends this trend by conditioning each loop not only on normalized time $t$ but also on the step size $\Delta t$, and by training across families of sampled trajectories with a shortcut-consistency objective. This design enables robust performance under arbitrary user-specified compute budgets, without relying on per-token halting or routing mechanisms.

Shortcut-Modulated Looped Models for Latent Reasoning {#sec:method}
=====================================================

Notation and Problem Statement
------------------------------

We denote by $X = (x_1, \dots, x_T)$ a sequence of $T$ tokens drawn from a vocabulary $\mathcal{V}$. The token embeddings are obtained via $E_{\text{tok}}(X) \in \mathbb{R}^{T \times d}$. Without loss of generality, positional embeddings can be added in a one-shot manner or applied through alternatives such as RoPE; in this work, we adopt the former for simplicity. The initial hidden states are therefore $$h^{(0)} \;=\; E_{\text{tok}}(X)\;+\;E_{\text{pos}}[1{:}T] \;\in\; \mathbb{R}^{T\times d}.$$

A range of looping mechanisms for looped Transformers has been studied in prior work [@takase2021lessons; @saunshi2025reasoning; @bae2025mixture; @bae2024relaxed], including cycle, middle-cycle, and relaxed-cycle variants. Since our focus is on trajectories and their representation dynamics, we adopt the simplest *cycle* design, where a stack of $k$ Transformer blocks, denoted by $\Phi_k(\cdot)$, is recursively applied on the hidden state. Following [@saunshi2025reasoning], we write $(k \otimes L)$ for a looped model with $k$ blocks repeated $L$ times (approximate cost $kL$ FLOPs), and $(kL \otimes 1)$ for a non-looped Transformer of comparable depth.

LoopFormer applies $\Phi_k(\cdot)$ for $M$ iterations ($1 \le M \le L$), conditioning each loop $i$ on the pair $(t_{i-1}, \Delta_i)$, where $0 = t_0 < t_1 < \cdots < t_M = 1$ are cumulative normalized timesteps, and $\Delta_i = t_i - t_{i-1}$ is the step size of the $i$-th iteration. We refer to $\boldsymbol{\Delta_M} = (\Delta_1, \dots, \Delta_M)$ as a *trajectory*, constrained by $\sum_{i=1}^M \Delta_i = 1$.

The *maximum trajectory* corresponds to $L$ uniform steps with $\Delta_i = \frac{1}{L}$ for all $i$. At inference, a user specifies a compute budget $M$ and provides a step schedule $\boldsymbol{\Delta_M}$ such that $\sum_{i=1}^M \Delta_i = 1$.

![**(a)** LoopFormer conditions each loop on normalized time $t$ and step size $\Delta t$, modulating RMSNorm scales and gating the MHSA/FFN residuals. **(b)** During inference, shorter-budget trajectories ($M<L$) approximate the full $L$-step route; more budget yields progressively refined next-token distributions while preserving utility at low budget.](assets/architecture.png "fig:"){#fig:arch_and_traj width="\\linewidth"} `\subcaption{LoopFormer architecture}`{=latex} `\label{fig:architecture}`{=latex}

```{=latex}
\hfill
```
![**(a)** LoopFormer conditions each loop on normalized time $t$ and step size $\Delta t$, modulating RMSNorm scales and gating the MHSA/FFN residuals. **(b)** During inference, shorter-budget trajectories ($M<L$) approximate the full $L$-step route; more budget yields progressively refined next-token distributions while preserving utility at low budget.](assets/trajectories.png "fig:"){#fig:arch_and_traj width="\\linewidth"} `\subcaption{Budget-conditioned trajectories}`{=latex} `\label{fig:trajectories}`{=latex}

Architecture
------------

We introduce **LoopFormer**, a looped language modeling architecture designed to remain faithful to the standard Transformer while incorporating trajectory-based conditioning. `\autoref{fig:architecture}`{=latex} provides an overview. In this section we describe the forward pass and the core components of the model.

LoopFormer is a looped decoder-only Transformer in which a single shared stack is applied iteratively; the key novelty is *how each loop is conditioned*. At iteration $i$, the model conditions on the pair $(t_{i-1},\Delta_i)$, where $t_{i-1}\in[0,1]$ is the cumulative normalized time and $\Delta_i\in(0,1]$ is the step size. Both scalars are encoded with sine--cosine frequency embeddings and projected by small MLPs to obtain $e_t$ and $e_\Delta$, which are summed to form $e_i=e_t+e_\Delta$. This signal modulates the *LoopFormer Block*: an MLP maps $e_i$ to scaling $(\gamma_1,\gamma_2)$ for the two RMSNorm layers and to gating $(\alpha_1,\alpha_2)$ applied immediately before the residual connections of MHSA and FFN. As a result, each loop is explicitly aware of its location and granularity in the trajectory, enabling consistent behavior across coarse- and fine-grained schedules.

#### Relation to prior work.

For the design of our architecture, we follow the overall approach of looped models such as ALBERT and recent looped decoders [@lan2019albert; @saunshi2025reasoning], and focus our novelty on *trajectory conditioning*. Our modulation mechanism is inspired by Diffusion Transformers (DiT) [@peebles2023scalable] and shortcut/one-step diffusion [@frans2024one], which condition blocks on a timestep via adaptive normalization (adaLN), regressing scale/shift and gates from a time embedding. In the looped language model setting, TMLT [@xu2024expressive] adopts this DiT-style idea by using the loop index as time (with RMSNorm). LoopFormer extends this conditioning by using both normalized time $t$ and step size $\Delta t$, and by training over families of trajectories with a shortcut-consistency objective, yielding trajectory-consistent representations suitable for budget-conditioned inference.

Training & Inference with LoopFormer
------------------------------------

A key goal of LoopFormer is to train models that perform well even with fewer than $L$ loops, thereby enabling *elastic depth* at inference time. Unlike prior adaptive-compute approaches that rely on per-token halting or early exits, our framework uses a user-defined compute budget, similar in spirit to diffusion models. Empirically, we find that naive early exiting in looped architectures leads to *stagnant* representations in later iterations, under-utilizing depth. To address this, LoopFormer leverages shortcut conditioning together with a consistency loss that encourages trajectories of different lengths to converge toward the full trajectory of length $L$. The overall objective is: $$\mathcal{L}
  \;=\;
  \mathcal{L}_{\text{L}}
  \;+\;
  \lambda_1\,\mathcal{L}_{\text{S}}
  \;+\;
  \lambda_2\,\mathcal{L}_{\text{cons}},$$ where $\mathcal{L}_{\text{L}}$ and $\mathcal{L}_{\text{S}}$ denote next-token prediction losses for the longest and sampled shortcut trajectories, respectively, and $\mathcal{L}_{\text{cons}}$ is a stop-gradient consistency loss aligning per-token logits of shorter trajectories to those of the longest trajectory. We set $\lambda_{1}=\lambda_{2}=0.1$ in all experiments. Algorithm `\ref{alg:training}`{=latex} formalizes the training procedure.

#### Shortcut trajectory sampling.

Given a LoopFormer $(k \otimes L)$ and the maximum trajectory $\boldsymbol{\Delta_L}$, during training, we additionally sample a shortcut trajectory $\boldsymbol{\Delta_S}$ with budget $1 \le S < L$. For simplicity, at each batch we first sample a shortcut length $S \sim \mathcal{U}\{1, \ldots, L-1\}$, then uniformly draw the discrete step schedule $\boldsymbol{\Delta_S}$ over $[0,1]$ such that $\sum_{i=1}^S \Delta_i = 1$. This ensures exposure to both long and short trajectories during training.

#### Inference with elastic depth.

At inference time, LoopFormer can be deployed flexibly at any budget $M \le L$. A user specifies $M$ and $\boldsymbol{\Delta_M}$, and the model produces outputs that scale smoothly with compute, without retraining. `\autoref{fig:trajectories}`{=latex} provides a conceptual illustration of budget-conditioned refinement: for any chosen budget, shorter routes are trained to approximate the $t{=}1$ endpoint, with quality improving as steps increase. Algorithm `\ref{alg:inference}`{=latex} outlines the inference procedure.

```{=latex}
\begin{algorithm}[H]\caption{LoopFormer Training}
\label{alg:training}
\begin{algorithmic}
\While{not converged}
  \State Sample batch $(X, Y)$
  \State Construct max trajectory $\boldsymbol{\Delta_L}$
  \State Sample shortcut length $S \sim \mathcal{U}\{1, \ldots, L-1\}$
  
  \State Sample trajectory $\boldsymbol{\Delta_S}$ of length $S$

  \State $h^{(L)} \gets \Phi_k(h^{(0)};\boldsymbol{\Delta_L})$ \;; $h^{(S)} \gets \Phi_k(h^{(0)};\boldsymbol{\Delta_S})$
  
  \State $\mathcal{L}_L \gets \mathrm{CE}(\mathrm{LMHead}(h^{(L)}), Y)$
  
  \State $\mathcal{L}_S \gets \mathrm{CE}(\mathrm{LMHead}(h^{(S)}), Y)$
  \State $\mathcal{L}_{\mathrm{cons}} \gets 
         \| \operatorname{stopgrad}(h^{(L)}) - h^{(S)} \|^2$
  \State Update $\theta$ using $\nabla_\theta\big(\mathcal{L}_L + \lambda_1 \mathcal{L}_S + \lambda_2 \mathcal{L}_{\mathrm{cons}}\big)$
\EndWhile
\end{algorithmic}
\end{algorithm}
```
```{=latex}
\hfill
```
```{=latex}
\begin{algorithm}[H]\caption{LoopFormer Inference}
\label{alg:inference}
\begin{algorithmic}
\State Given input $X$ 
\State Choose budget $M \le L$
\State Sample schedule $\boldsymbol{\Delta_M}$
\State Initialize $h^{(0)} \gets E_{\mathrm{tok}}(X) + E_{\mathrm{pos}}$
\For{$i=1$ to $M$}
  \State $h^{(i)} \gets \Phi_k\!\big(h^{(i-1)}; t_{i-1}, \Delta_i\big)$
\EndFor
\State \Return $\mathrm{LMHead}(h^{(M)}[:,-1])$
\end{algorithmic}
\end{algorithm}
```
Experiments {#sec:experiments}
===========

We evaluate latent reasoning, scalability, efficiency, and representation dynamics of our shortcut-modulated looped language models. Following [@tay2022ul2; @saunshi2025reasoning], we compare a 24-layer, $\sim$1B-parameter non-looped Transformer with FLOP-matched looped variants. All models use a GPT-style decoder [@radford2019language] with NanoGPT configurations. Training is performed on a deduplicated subset of The Pile [@gao2020pile] for 25B tokens in accordance with Chinchilla scaling [@hoffmann2022training]. See `\autoref{sec:supp_exp_details}`{=latex} for details. Unless otherwise specified, all reported results use uniform step sizes at inference: for a compute budget $M$, $\Delta_i = 1/M$ for all $i$.

Latent Reasoning and Perplexity
-------------------------------

We train looped models under different parameter and compute budgets. Using the notation $(k \otimes L)$ for a $k$-block $\Phi_k$ unrolled $L$ times, we consider $k \in \{1,2,3\}$ and train with maximum loops $L \in \{8,12,24\}$. Fixed-loop models are trained and evaluated with the same $L$, whereas depth-elastic models support inference at any $M \le L$. In this section we use $kL$ as a proxy for FLOPs when comparing baselines, ignoring embedding/unembedding costs; `\autoref{tab:main_table}`{=latex} summarizes the results for $k=3$.

```{=latex}
\renewcommand{\arraystretch}{1.8}
```
```{=latex}
\begin{table*}[t]\small
\caption{Perplexity and zero-shot reasoning for $(3\otimes 8)$ looped models under three inference budgets (24$\times$, 12$\times$, 6$\times$). At 24$\times$ we also report fixed-depth baselines; at 12$\times$/6$\times$ we compare against Base $(12\otimes 1)$ and $(6\otimes 1)$. While depth-elastic, LoopFormer narrows the perplexity gap to Base and is competitive on reasoning, outperforming other looped variants, especially at higher budgets.}

\label{tab:main_table}
\setlength{\tabcolsep}{4pt}
\fontsize{12pt}{14pt}\selectfont
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccccccccccccccc}
\toprule
&  & \textbf{Perplexity} $\downarrow$ & \textbf{Language Tasks (Accuracy)} $\uparrow$ \\
\cmidrule(lr){3-5} \cmidrule(lr){6-16}
 & \multirow{-2}{*}{\textbf{Params / FLOPs}} & \textbf{Pile} & \textbf{FineWeb-Edu} & \textbf{OpenWebText} & \textbf{COPA} & \textbf{HS} & \textbf{LB} & \textbf{OBQA} & \textbf{PIQA} & \textbf{Race} & \textbf{SciQ} & \textbf{ARC} & \textbf{SIQA} & \textbf{WG} & \textbf{Avg Acc} \\
\midrule
\rowcolor[HTML]{F5F5F5}
\textbf{Budget: 24x} \\
\midrule

Base $(24 \otimes 1)$ &
  24x / 24x &
  \textbf{9.49} &
  \textbf{20.7} &
  \textbf{20.08} &
  61 &
  \textbf{35.04} &
  \textbf{41.96} &
  27.6 &
  \textbf{66} &
  29 &
  \textbf{70.1} &
  \textbf{33.43} &
  {\ul 38.18} &
  50.4 &
  45.27 \\
Base-Loop $(3 \otimes 8)$ &
  3x / 24x &
  10.91 &
  24.53 &
  24.53 &
  61 &
  30.46 &
  34.68 &
  27 &
  63.22 &
  28.8 &
  63.7 &
  31.71 &
  \textbf{38.43} &
  49.8 &
  42.88 \\
TMLT  $(3 \otimes 8)$ &
  3x / 24x &
  10.38 &
  {\ul 22.87} &
  21.99 &
  65 &
  {\ul 32.34} &
  {\ul 39.06} &
  {\ul 27.8} &
  63.11 &
  {\ul 29.67} &
  {\ul 69.8} &
  31.67 &
  36.95 &
  51.54 &
  44.69 \\ \hline
Naive-Loop-EE $(3 \otimes 8)$ &
  3x / 24x &
  11.6 &
  26.55 &
  25.64 &
  \textbf{66} &
  29.68 &
  31.52 &
  27 &
  62.24 &
  28.33 &
  65.5 &
  31.64 &
  36.64 &
  50.59 &
  42.91 \\
Base-Loop-EE-Cons $(3 \otimes 8)$ &
  3x / 24x &
  11.56 &
  25.33 &
  24.41 &
  \textbf{66} &
  30.54 &
  32.19 &
  26.8 &
  62.13 &
  28.23 &
  64.9 &
  31.45 &
  37.4 &
  \textbf{52.88} &
  43.25 \\
TMLT-EE $(3 \otimes 8)$ &
  3x / 24x &
  10.7 &
  24.07 &
  23.17 &
  65 &
  31.03 &
  35.14 &
  \textbf{28.4} &
  63.11 &
  28.8 &
  68.8 &
  31.74 &
  37.41 &
  50.83 &
  44.03 \\
\rowcolor[HTML]{DDEEFF} 
LoopFormer - Ours $(3 \otimes 8)$ &
  3x / 24x &
  {\ul 10.28} &
  {\ul 22.87} &
  {\ul 21.98} &
  \textbf{66} &
  32.3 &
  38.27 &
  26.8 &
  {\ul 63.33} &
  \textbf{30.81} &
  68 &
  {\ul 32.71} &
  37.97 &
  {\ul 51.94} &
  44.81 \\ \hline
\rowcolor[HTML]{F5F5F5} 


\midrule
\rowcolor[HTML]{F5F5F5}
\textbf{Budget: 12x} \\
\midrule

Base (12x1) &
  12x / 12x &
  \textbf{9.98} &
  \textbf{22.24} &
  \textbf{ 21.41} &
  {\ul 67} &
  \textbf{32.72} &
  \textbf{37.78} &
  26.2 &
  \textbf{64.69} &
  \textbf{29.86} &
  \textbf{69.1} &
  \textbf{32.29} &
  \textbf{38.38} &
  {\ul 51.3} &
  44.93 \\
Naive-Loop-EE $(3 \otimes 4)$ &
  3x / 12x &
  11.66 &
  26.74 &
  25.81 &
  65 &
  29.35 &
  31.28 &
  26 &
  61.75 &
  28.32 &
  64.4 &
  31.15 &
  36.8 &
  51.85 &
  42.59 \\
Base-Loop-EE-Cons $(3 \otimes 4)$ &
  3x / 12x &
  12.0 &
  27.72 &
  26.68 &
  63 &
  29.72 &
  25.6 &
  \textbf{27} &
  61.26 &
  28.04 &
  58.4 &
  30.06 &
  36.23 &
  51.7 &
  41.1 \\
TMLT-EE $(3 \otimes 4)$ &
  3x / 12x &
  12.18 &
  28.28 &
  27.11 &
  60 &
  29.88 &
  27.73 &
  {\ul 26.4} &
  62.02 &
  {\ul 28.8} &
  61.4 &
  30.59 &
  36.69 &
  51.54 &
  41.5 \\
\rowcolor[HTML]{DDEEFF} 
LoopFormer - Ours $(3 \otimes 4)$ &
  3x / 12x &
  {\ul 11.12} &
  {\ul 25.02} &
  {\ul 24.21} &
  \textbf{68} &
  {\ul 31} &
  {\ul 32.35} &
  25.4 &
  {\ul 63.06} &
  28.23 &
  {\ul 66.3} &
  {\ul 31.78} &
  {\ul 37.77} &
  \textbf{53.43} &
  43.73 \\ \hline
\rowcolor[HTML]{F5F5F5} 


\midrule
\rowcolor[HTML]{F5F5F5}
\textbf{Budget: 6x} \\
\midrule

Base (6x1) &
  6x / 6x &
  \textbf{11.13} &
  \textbf{25.28} &
  \textbf{24.28} &
  \textbf{64} &
  \textbf{30.45} &
  \textbf{33.45} &
  {\ul 25.6} &
  \textbf{62.51} &
  \textbf{28.04} &
  \textbf{67.5} &
  \textbf{31.07} &
  \textbf{36.18} &
  48.54 &
  42.73 \\
Naive-Loop-EE $(3 \otimes 2)$ &
  3x / 6x &
  {\ul 12.61} &
  {\ul 29.36} &
  {\ul 28.38} &
  {\ul 63} &
  28.64 &
  {\ul 27.28} &
  24.6 &
  {\ul 61.48} &
  {\ul 26.41} &
  {\ul 62.3} &
  29.91 &
  {\ul 35.41} &
  {\ul 50.67} &
  40.97 \\
Base-Loop-EE-Cons $(3 \otimes 2)$ &
  3x / 6x &
  15.07 &
  35.95 &
  34.88 &
  59 &
  28.28 &
  18.49 &
  25.6 &
  59.52 &
  25.16 &
  53.8 &
  28.94 &
  34.65 &
  \textbf{52.72} &
  38.62 \\
TMLT-EE $(3 \otimes 2)$ &
  3x / 6x &
  15.79 &
  37.83 &
  37.84 &
  57 &
  28.18 &
  17.09 &
  25.08 &
  58.68 &
  {\ul 26.41} &
  50.01 &
  28.26 &
  34.9 &
  50.27 &
  37.59 \\
\rowcolor[HTML]{DDEEFF} 
LoopFormer - Ours $(3 \otimes 2)$ &
  3x / 6x &
  14.3 &
  33.45 &
  32.46 &
  {\ul 63} &
  {\ul 28.69} &
  26.14 &
  \textbf{26.6} &
  60.1 &
  25.74 &
  58.6 &
  {\ul 29.29} &
  35.26 &
  50.2 &
  40.36 \\


\bottomrule
\end{tabular}%
}

\end{table*}
```
#### Evaluation metrics and benchmarks.

Following [@saunshi2025reasoning; @geiping2025scaling; @bae2025mixture], we report perplexity and downstream zero-shot accuracy. Perplexity is measured on FineWeb-Edu [@penedo2024fineweb], OpenWebText [@Gokaslan2019OpenWeb], and The Pile. For latent reasoning, we report zero-shot accuracy on ten established benchmarks spanning a range of reasoning difficulty: COPA [@roemmele2011choice], HellaSwag (HS) [@zellers2019hellaswag], LAMBADA (LB) [@paperno2016lambada], OpenBookQA (OBQA) [@mihaylov2018can], PIQA [@bisk2020piqa], RACE [@lai2017race], Social IQA (SIQA) [@sap2019socialiqa], ARC [@clark2018think], SciQ [@welbl2017crowdsourcing], and WinoGrande (WG) [@sakaguchi2021winogrande]. Where available, we use normalized accuracy; for ARC we report the average over Easy and Challenge.

#### Baselines.

We compare LoopFormer against two groups of baselines.

*Fixed-depth models:* (a) **Base**: a non-looped Transformer; (b) **Base-Loop**: a standard looped model as in [@saunshi2025reasoning]; (c) **TMLT**: a looped model with timestep conditioning [@xu2024expressive].

*Depth-elastic models:* (i) **Base-Loop-EE**: naive early exiting applied to the basic looped model;[^1] (ii) **Base-Loop-EE-Cons**: (i) augmented with our consistency loss during training; (iii) **TMLT-EE**: early exiting and consistency training applied to TMLT to enable depth elasticity.

#### Findings.

`\autoref{tab:main_table}`{=latex} highlights three trends:

-   Consistent with [@mohtashami2023cotformer; @saunshi2025reasoning], looped models trail non-looped baselines on perplexity, reflecting the role of parameters in memorization. However, LoopFormer closes much of this gap, surpassing even fixed-depth looped variants.

-   In zero-shot reasoning, looped models benefit from iterative refinement and can approach iso-FLOP non-looped baselines; LoopFormer is the most competitive among looped models.

-   Under reduced budgets, LoopFormer maintains high utility: at $12\times$ it remains close to Base $(12\otimes 1)$ on both perplexity and reasoning, indicating that budget-conditioned trajectories preserve informative representations rather than collapsing.

We next examine the effect of the number of layers $k$ and the number of loops $L$, and compare against depth-elastic alternatives under more compute configurations.

```{=latex}
\begin{figure*}[t]
  

    \includegraphics[width=\linewidth]{assets/perplexity_by_k.pdf}
    

  \caption{Scaling with layers and loops.
    Perplexity on The Pile across $(k,L)$ and inference budgets $M\le L$; larger $k$ lowers perplexity at fixed $M$, and additional loops further reduce it.}

\label{fig:perplexity_graph}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]
  

    \includegraphics[width=\linewidth]{assets/reasoning_by_k.pdf}
    

  \caption{Scaling with layers and loops.
    Average zero-shot reasoning accuracy on 10 tasks for the same settings; more loops improve reasoning at fixed $k$.
    Across budgets, LoopFormer scales smoothly without collapse and consistently outperforms other looped baselines.}
\label{fig:reasoning_graph}
\end{figure*}
```
Number of Layers vs. Number of Loops
------------------------------------

We study how perplexity and reasoning change as the number of Transformer layers per block ($k$) varies. We train $k\in\{1,2,3\}$ with maximum loops $L\in\{8,12,24\}$, then evaluate across multiple inference budgets ($M\le L$). `\autoref{fig:perplexity_graph}`{=latex} reports perplexity and `\autoref{fig:reasoning_graph}`{=latex} reports average zero-shot reasoning accuracy across benchmarks.

As the plots show, both perplexity and reasoning improve with larger $k$ and more iterations. **LoopFormer** preserves these trends under budgets ($M\le L$): shorter trajectories remain informative and added loops yield smooth gains without collapse. Our consistency-augmented training also improves the scaling of Base-Loop and TMLT in the depth-elastic regime.

```{=latex}
\begin{figure*}[t]
  

  % Panels
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/curvature.pdf}
    \subcaption{Curvature}
    \label{fig:curvature}
  \end{subfigure}
  \hfill
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/cosine_distance.pdf}
    \subcaption{Anisotropy}
    \label{fig:anisotropy}
  \end{subfigure}
  \hfill
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/prompt_entropy.pdf}
    \subcaption{Prompt Entropy}
    \label{fig:prompt_entropy}
  \end{subfigure}

      \caption{Representation metrics over normalized depth. Panels show \textbf{(a)} curvature, \textbf{(b)} anisotropy, and \textbf{(c)} normalized prompt entropy. Early-exit baselines remain flat, indicating minimal change with additional loop steps, whereas LoopFormer exhibits sustained evolution that rises through mid-depths and tapers near the end, suggesting useful depth-elastic dynamics.}

  \label{fig:representation_metrics}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]
  
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/shortcut_cka.pdf}
    \subcaption{LoopFormer}
    \label{fig:cka_loopformer}
  \end{subfigure}
  \hfill
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/albert_ee_cka.pdf}
    \subcaption{Base-Loop-EE-Cons}
    \label{fig:cka_base_loop_ee}
  \end{subfigure}
  \hfill
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/tmlt_ee_cka.pdf}
    \subcaption{TMLT-EE}
    \label{fig:cka_tmlt_ee}
  \end{subfigure}

  \caption{CKA similarity across loop steps. Each heatmap reports cross-step CKA within a model family. Depth-elastic baselines show high CKA (indicating little change across loops), suggesting stagnation. LoopFormer exhibits progressive drift, especially toward the later steps.}


  \label{fig:cka}
\end{figure*}
```
Analyzing Representation Collapse in Looped Transformers
--------------------------------------------------------

The role of depth in Transformers has been widely examined through scaling laws and theory [@kaplan2020scaling; @csordas2025language], with a recurring observation that very deep stacks can exhibit *representation degeneration* (or *collapse*), where hidden states change little across layers [@ethayarajh2019contextual; @dong2021attention; @godey2024anisotropy]. Several works attribute this to an inductive bias of self-attention toward uniformity, including rank decay in attention maps with depth [@dong2021attention]. Related studies of *anisotropy* [@razzhigaev2023shape; @godey2024anisotropy] report pronounced effects in language modeling, and these effects could be amplified in looped architectures where a single block is repeatedly applied [@dong2021attention].

We analyze token dynamics along the computation depth using four complementary metrics: (i) **anisotropy** [@godey2024anisotropy] within a layer, measured as average pairwise cosine similarity among tokens in the prompt; higher values indicate more aligned (less diverse) directions. (ii) **Curvature** [@hosseini2023large], a local geometric measure of how rapidly token representations change direction across neighboring positions. (iii) **Prompt entropy** [@skean2025layer], a matrix-based estimate of how spread-out token embeddings are across feature dimensions; higher entropy suggests greater diversity and lower redundancy. (iv) **CKA similarity** [@kornblith2019similarity] across loop steps, quantifying representational similarity between different iterations of the shared block. While prior work studies these metrics in various architectures and their correlation with downstream performance [@garrido2023rankme; @skean2025layer], here we use them primarily as diagnostics to assess whether looped models *use* additional computation (iterative depth) or *stagnate*.

`\autoref{fig:representation_metrics}`{=latex} summarizes curvature, anisotropy, and prompt entropy across loop steps; `\autoref{fig:cka}`{=latex} reports cross-step CKA. Early-exit looped baselines show flat trajectories on all three metrics and high CKA, indicating stagnation. In contrast, LoopFormer evolves: early steps have lower curvature, weaker angular alignment, and lower entropy; as steps progress, curvature and entropy rise and alignment increases, then all three taper near the final step as the model prepares for unembedding.

Overall, these patterns suggest that LoopFormer maintains *useful* representational dynamics across loops, with shorter trajectories remaining informative and longer trajectories providing additional refinement, rather than converging prematurely to a static state. We view this as evidence that shortcut conditioning and consistency training help avert collapse in depth-elastic looped models.

How to choose trajectories under a fixed budget
-----------------------------------------------

```{=latex}
\begin{figure*}[t]
  
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/shortcut_3_ppl.pdf}
    \subcaption{Perplexity ($3 \otimes 8$; $M=4$)}
    \label{fig:space_ppl_3_8}
  \end{subfigure}
  \hfill
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/shortcut_3_acc.pdf}
    \subcaption{Accuracy ($3 \otimes 8$; $M=4$)}
    \label{fig:space_acc_3_8}
  \end{subfigure}
  \hfill
  \begin{subfigure}[t]{0.32\textwidth}
    
    \includegraphics[width=\linewidth]{assets/shortcut_2_ppl.pdf}
    \subcaption{Perplexity ($2 \otimes 12$; $M=6$)}
    \label{fig:space_ppl_2_12}
  \end{subfigure}

  \caption{Performance across length-$M$ trajectories. \textbf{(a,b)} enumerate all $M{=}4$ schedules for LoopFormer $(3\otimes 8)$, reporting perplexity and average zero-shot accuracy; \textbf{(c)} shows perplexity for $M{=}6$ schedules of $(2\otimes 12)$. Even at fixed budget, trajectory choice matters: spreads are large, and top schedules allocate coarser steps early and finer steps late.}
  \label{fig:time_space}
\end{figure*}
```
A practical question for our depth-elastic models is: *given a budget $M\le L$, how should we choose $\boldsymbol{\Delta}_M$?* Beyond average gains, we ask which *trajectories* work best and which parts of time contribute most. The representation analyses in `\autoref{fig:representation_metrics}`{=latex} and `\autoref{fig:cka}`{=latex} suggest a pattern: early steps produce more similar states, activity rises mid/late, then tapers near the end.

To probe this directly, we run a toy yet exhaustive study. For LoopFormer $(3\otimes8)$ with budget $M=4$, we enumerate all trajectories with step sizes summing to 1 (aligned to training granularity) and measure perplexity and average zero-shot accuracy (`\autoref{fig:space_ppl_3_8}`{=latex}, `\autoref{fig:space_acc_3_8}`{=latex}). Despite identical compute, performance varies across schedules (spread $\sim$1.4 perplexity and $\sim$1.3 accuracy points). Repeating for $(2\otimes12)$ with $M=6$, the perplexity spread grows to nearly 3 (`\autoref{fig:space_ppl_2_12}`{=latex}).

#### Findings.

\(1\) Even under uniform training over budgets and times, some schedules outperform others by wide margins. (2) The best schedules for perplexity and for downstream reasoning are close but not identical; both favor allocating larger steps early and finer steps late.

Discussion and Future Directions
================================

`\noindent `{=latex}We introduced LoopFormer, a looped Transformer that conditions each iteration on normalized time $t$ and step size $\Delta t$, trained across trajectories with a shortcut--consistency loss. This framing induces *loop trajectories* in hidden space: under a fixed budget, shorter routes yield useful intermediate states, and additional steps refine them toward a shared $t{=}1$ endpoint, realizing latent reasoning while supporting budget--conditioned (elastic) inference without retraining. Empirically, LoopFormer delivers strong performance--per--compute on perplexity and consistently improves downstream zero-shot reasoning, while avoiding the representational stagnation seen in naive early-exit baselines. Limitations include global (sequence-level) rather than instance/token-adaptive budgeting, added training overhead from multi-trajectory consistency, and correlational (not causal) representation analyses. Promising directions include instance-conditioned schedule policies and deeper theory/diagnostics of the representation space.

ACKNOWLEDGEMENTS
================

`\noindent `{=latex}We thank Prof. Colin Raffel (University of Toronto) and Prof. Ali Etemad (Queen's University) for valuable feedback throughout this project. We also thank Negin Baghbanzadeh and Dev Shah for their help during the early stages of our work on looped vision models.

`\noindent `{=latex}Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, the [Digital Research Alliance of Canada](https://www.alliancecan.ca), and companies sponsoring the [Vector Institute](https://www.vectorinstitute.ai/partnerships/current-partners/).

```{=latex}
\bibliographystyle{iclr2026_conference}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
Appendix {#appendix .unnumbered}
========

Training and Implementation Details {#sec:supp_exp_details}
===================================

#### Hardware and framework.

All models are trained on 4$\times$H100 (80 GB) GPUs using the open-source NanoGPT training stack as a reference implementation.

#### Data and tokens.

Unless otherwise specified, we run each experiment for 50,000 optimizer steps with a global batch size of 48 sequences and block size 1024 (context length). This corresponds to approximately $\,\sim$25B training tokens in total.

#### Optimization.

We use AdamW with weight decay $2\times 10^{-1}$, cosine learning-rate decay (per NanoGPT), peak learning rate $\texttt{lr}=6\times 10^{-4}$, minimum learning rate $\texttt{min\_lr}=6\times 10^{-5}$, and $4{,}000$ warmup steps, which we found important for stability. Similar to observations in [@geiping2025scaling], we occasionally observe training instabilities at depth; warmup and cosine decay mitigate these in practice. Unless noted, other optimizer and training defaults follow NanoGPT.

#### Model hyperparameters.

Following [@saunshi2025reasoning], we use hidden size $d=2048$ and $n_{\text{heads}}=32$ for all $(k\otimes L)$ configurations. The feed-forward dimension is $d_{\text{ff}}=5120$ with a standard two-layer GELU MLP. All normalizations are RMSNorm. We use learned positional embeddings added to token embeddings (NanoGPT default). We also tested RoPE [@su2024roformer] and observed slightly better small-scale performance, but we use learned positions for simplicity and efficiency. We tie input and output embeddings (weight tying).

#### LoopFormer conditioning.

We use two embedding modules, one for normalized time $t\in[0,1]$ and one for step size $\Delta t\in(0,1]$. Each maps a scalar input to a $d$-dimensional conditioning vector via (i) fixed sinusoidal Fourier features of width $D_f=256$ and max period $10{,}000$, followed by (ii) a 2-layer MLP with hidden size $d$ and `SiLU` nonlinearity: $$\textstyle \phi(\tau)\;=\;\mathrm{MLP}\big([\cos(\tau\omega_1),\sin(\tau\omega_1),\ldots,\cos(\tau\omega_{D_f/2}),\sin(\tau\omega_{D_f/2})]\big)\in\mathbb{R}^{d},$$ where $\omega_k = \exp\!\big(-\tfrac{k-1}{D_f/2}\log 10{,}000\big)$ for $k=1,\ldots,D_f/2$. Given batchwise scalars $t$ and $\Delta t$, we compute $e_t=\phi(t)$ and $e_{\Delta}=\phi(\Delta t)$ and sum them to obtain the per-iteration conditioning signal $c=e_t+e_{\Delta}\in\mathbb{R}^{d}$.

Conditioning is applied inside each LoopFormer block via an AdaLN-style modulator: a small MLP takes $c$ and outputs $4d$ parameters, which we split into $(\alpha_{\mathrm{msa}},\;\alpha_{\mathrm{mlp}},\;\gamma_{\mathrm{msa}},\;\gamma_{\mathrm{mlp}})$. We use RMSNorm (with no learned affinity) before MHSA and FFN, and apply multiplicative scaling and residual gating as $$\begin{aligned}
x &\leftarrow x + \alpha_{\mathrm{msa}}\odot \mathrm{MHSA}\big(\mathrm{RMSNorm}(x)\odot (1+\gamma_{\mathrm{msa}})\big),\\
x &\leftarrow x + \alpha_{\mathrm{mlp}}\odot \mathrm{FFN}\big(\mathrm{RMSNorm}(x)\odot (1+\gamma_{\mathrm{mlp}})\big),
\end{aligned}$$ broadcast over the sequence length. The modulator is a `SiLU` followed by a linear layer with output size $4d$, and is *zero-initialized* (weights and bias), ensuring the initial behavior matches the unmodulated backbone and that conditioning is learned stably.

#### TMLT baseline.

For Time-Modulated Looped Transformers, we follow the authors' setup and condition each iteration on the loop index, implementing the timestep modulation as described in their paper.

Additional Experiments and Ablations {#sec:supp_ablations}
====================================

Computational overhead of LoopFormer {#sec:supp_overhead}
------------------------------------

Algorithm `\ref{alg:training}`{=latex} uses a dual-trajectory objective: for every batch we compute (i) the full $L$-loop trajectory for the main LM loss, and (ii) a sampled shorter $M$-loop trajectory for shortcut-consistency. This design adds some overhead during training, however, during the inference, the computational Flops and inference times of LoopFormer scale similar to that of a vanilla transformer. In the following we report the training overhead.

For clarity, we decompose the per-batch FLOPs into a loop-independent overhead plus a loop-dependent term. Let $C_{\text{io}}$ denote the FLOPs of the embedding / unembedding (non-loop compute), and let $C_{1}$ denote the FLOPs of one loop through the shared $K$-block stack. Then running $\ell$ loops costs $$C(\ell) \;=\; C_{\text{io}} \;+\; \ell\,C_{1}.$$ LoopFormer computes a full $L$-loop trajectory plus one sampled shorter trajectory of length $M$, where $M \sim \text{Unif}\{1,\dots,L-1\}$. Thus the expected per-batch FLOPs are $$C(L) + \mathbb{E}_{M}[C(M)]
\;=\; (C_{\text{io}} + L C_{1}) \;+\; \bigl(C_{\text{io}} + \mathbb{E}[M]\,C_{1}\bigr).$$ Since $M$ is uniform on $\{1,\dots,L-1\}$, we have $\mathbb{E}[M]=L/2$. Plugging in, $$C(L) + \mathbb{E}_{M}[C(M)]
\;=\; 2C_{\text{io}} \;+\; \left(L+\tfrac{L}{2}\right)C_{1}
\;=\; 2C_{\text{io}} \;+\; \tfrac{3L}{2}C_{1}.$$ Relative to the fixed-loop/vanilla training cost $C(L)=C_{\text{io}}+LC_{1}$, the multiplicative overhead is $$\frac{C(L) + \mathbb{E}_{M}[C(M)]}{C(L)}
\;=\;
\frac{2C_{\text{io}} + \tfrac{3L}{2}C_{1}}{C_{\text{io}} + L C_{1}},$$ which, for the regimes considered in this work, is approximately $1.5\times$ the FLOPs of fixed-loop training.

We confirm this empirically. Across our settings, LoopFormer requires about $1.5\times$ the training FLOPs of a fixed-loop or vanilla transformer baseline trained under the same token budget and number of iterations. In wall-clock time, this corresponds to an approximate $1.3\times$ slowdown in our setup (measured on 4$\times$H100 GPUs with identical batch size, optimizer, and data). It is worth mentioning that other depth-elastic baselines studied in this work (e.g. TMLT-EE or Base-Loop-EE), also have the same overhead as our LoopFormer.

Overall, LoopFormer pays a modest training overhead to enable elastic-depth inference: a single shared-parameter model that remains robust under aggressive truncation and continues to refine representations as more loops are allocated. The inference on the other has a similar computational FLOPs to that of vanilla or fixed-loop baselines.

Comparison with Vanilla Transformers under a similar parameter budget
---------------------------------------------------------------------

Here we compare LoopFormer to a vanilla Transformer under a similar parameter budget. We fix the shared stack to $K=3$ layers, and evaluate LoopFormer at different inference budgets, namely $(3\otimes2)$, $(3\otimes4)$, and $(3\otimes8)$. `\autoref{tab:supp:params_matched}`{=latex} reports perplexity and downstream language task accuracy for LoopFormer and a 3-layer vanilla Transformer with a similar number of parameters. We observe that at small compute (few loops), LoopFormer is on par with the vanilla baseline, while increasing the loop budget leads to consistent improvements, allowing our depth-elastic model to substantially outperform the vanilla Transformer.

```{=latex}
\renewcommand{\arraystretch}{2.4}
```
```{=latex}
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccccccccccccccc}
\hline
 &
   &
  \textbf{Perplexity $\downarrow$} &
  \textbf{Language Tasks (Acc) $\uparrow$} \\ \cline{3-16} 
 &
  \multirow{-2}{*}{\textbf{Params / FLOPs}} &
  \textbf{Pile} &
  \textbf{FineWeb-Edu} &
  \textbf{OpenWebText} &
  \textbf{COPA} &
  \textbf{HS} &
  \textbf{LB} &
  \textbf{OBQA} &
  \textbf{PIQA} &
  \textbf{Race} &
  \textbf{SciQ} &
  \textbf{ARC} &
  \textbf{SIQA} &
  \textbf{WG} &
  \textbf{Avg Acc} \\ \hline

Base $(3 \otimes 1)$ &
  3x / 3x &
  12.93 &
  30.23 &
  29.20 &
  59 &
  28.7 &
  26.47 &
  26.4 &
  59.85 &
  27.08 &
  64.1 &
  30.7 &
  35.57 &
  51.46 &
  40.93 \\


LoopFormer - Ours $(3 \otimes 2)$ &
  3x / 6x &
  14.3 &
  33.45 &
  32.46 &
  {63} &
  { 28.69} &
  26.14 &
  {26.6} &
  60.1 &
  25.74 &
  58.6 &
  { 29.29} &
  35.26 &
  50.2 &
  40.36 \\


LoopFormer - Ours $(3 \otimes 4)$ &
  3x / 12x &
  {11.12} &
  {25.02} &
  {24.21} &
  {68} &
  { 31} &
  { 32.35} &
  25.4 &
  { 63.06} &
  28.23 &
  { 66.3} &
  { 31.78} &
  { 37.77} &
  {53.43} &
  43.73 \\


LoopFormer - Ours $(3 \otimes 8)$ &
  3x / 24x &
  {10.28} &
  { 22.87} &
  { 21.98} &
  {66} &
  32.3 &
  38.27 &
  26.8 &
  {63.33} &
  {30.81} &
  68 &
  { 32.71} &
  37.97 &
  {51.94} &
  44.81 \\ \hline

\bottomrule
\end{tabular}%
}
```
Comparison under FLOPs-matched training
---------------------------------------

As discussed in `\autoref{sec:supp_overhead}`{=latex}, elastic-depth training incurs approximately $1.5\times$ the training FLOPs of vanilla or fixed-loop Transformers due to the additional sampled shortcut trajectory. To evaluate LoopFormer under a FLOPs-matched training regime, we train a LoopFormer model for 34k iterations (all other hyperparameters unchanged), resulting in roughly the same total training FLOPs as the baselines trained for 50k iterations. This compute-matched LoopFormer therefore sees about 8B fewer Pile tokens during training. `\autoref{tab:supp:flops_matched}`{=latex} reports perplexity and zero-shot reasoning when both training and inference FLOPs are matched. Even under this reduced training budget, LoopFormer remains on par with TMLT and other baselines, while additionally providing depth-elastic inference.

```{=latex}
\renewcommand{\arraystretch}{1.8}
```
```{=latex}
\begin{table*}[t]\small
\caption{Perplexity and zero-shot reasoning under FLOPs-matched training and inference. We compare $(3\otimes 8)$ looped models as well as a Base $(24\otimes 1)$. To match total training FLOPs, LoopFormer is trained for 34k iterations (seeing $\sim$8B fewer Pile tokens), while baselines are trained for 50k iterations. Despite the reduced token budget, LoopFormer matches TMLT and remains competitive on reasoning across budgets, while retaining depth-elastic behavior.}
\label{tab:supp:flops_matched}
\setlength{\tabcolsep}{4pt}
\fontsize{12pt}{14pt}\selectfont
\resizebox{\textwidth}{!}{%
\begin{tabular}{cccccccccccccccc}
\toprule
&  & \textbf{Perplexity} $\downarrow$ & \textbf{Language Tasks (Accuracy)} $\uparrow$ \\
\cmidrule(lr){3-5} \cmidrule(lr){6-16}
 & \multirow{-2}{*}{\textbf{Params / FLOPs}} & \textbf{Pile} & \textbf{FineWeb-Edu} & \textbf{OpenWebText} & \textbf{COPA} & \textbf{HS} & \textbf{LB} & \textbf{OBQA} & \textbf{PIQA} & \textbf{Race} & \textbf{SciQ} & \textbf{ARC} & \textbf{SIQA} & \textbf{WG} & \textbf{Avg Acc} \\
\midrule

Base $(24 \otimes 1)$ &
  24x / 24x &
   {9.49} &
   {20.7} &
   {20.08} &
  61 &
   {35.04} &
   {41.96} &
  27.6 &
   {66} &
  29 &
   {70.1} &
   {33.43} &
  {  38.18} &
  50.4 &
  45.27 \\
Base-Loop $(3 \otimes 8)$ &
  3x / 24x &
  10.91 &
  24.53 &
  24.53 &
  61 &
  30.46 &
  34.68 &
  27 &
  63.22 &
  28.8 &
  63.7 &
  31.71 &
   {38.43} &
  49.8 &
  42.88 \\
TMLT  $(3 \otimes 8)$ &
  3x / 24x &
  10.38 &
  {  22.87} &
  21.99 &
  65 &
  {  32.34} &
  {  39.06} &
  {  27.8} &
  63.11 &
  {  29.67} &
  {  69.8} &
  31.67 &
  36.95 &
  51.54 &
  44.69 \\

\rowcolor[HTML]{DDEEFF} 
LoopFormer - Ours $(3 \otimes 8)$ &
  3x / 24x &
  {10.71} &
  {  23.66} &
  {  22.78} &
   {64} &
  31.96 &
  37.11 &
  25.6 &
  {  62.8} &
   {29.91} &
  68.5 &
  {  31.68} &
  37.93 &
  {  52.57} &
  44.21 \\ \hline


\bottomrule
\end{tabular}%
}
\end{table*}
```
PyTorch Pseudocode for LoopFormer {#appendix:implementation:loopformer}
---------------------------------

Algorithm 3 presents a self-contained PyTorch-style pseudocode snippet for the core LoopFormer shared stack. The module consists of a `LoopFormerBlock`, which applies AdaLN-style modulation to the attention and MLP residual branches using the loop-conditioning vector $c$ (derived from $(t,\Delta t)$), and a `SharedBlock` that stacks $K$ such blocks to form the shared Transformer $\Phi_k$.

Within each loop iteration, `SharedBlock` applies all $K$ blocks sequentially to the token states $x$; this entire $K$-block stack is then reused across loops and repeated for $M$ or $L$ iterations during training or inference. This snippet clarifies how LoopFormer handles multiple blocks per loop and how the same shared stack supports variable compute budgets.

```{=latex}
\newpage
```
`\label{alg:loopformer_block}`{=latex}

`\noindent`{=latex}

------------------------------------------------------------------------

\
**Algorithm 3: PyTorch implementation of LoopFormer shared block**\
`\noindent`{=latex}

------------------------------------------------------------------------

\

``` {.python language="Python" frame="none" numbers="none" xleftmargin="0pt" xrightmargin="0pt"}
# Inputs:
#   x      : torch.Tensor [B, T, D] - token states.
#   c      : torch.Tensor [B, D]    - loop-conditioning vector.
#   config : model config with n_embd, etc.
#   depth  : int (K) - number of distinct Transformer blocks


class LoopFormerBlock(nn.Module):

    def __init__(self, config):
        super().__init__()
        d = config.n_embd

        self.ln_1 = nn.RMSNorm(d, elementwise_affine=False)
        self.attn = CausalSelfAttention(config)

        self.ln_2 = nn.RMSNorm(d, elementwise_affine=False)
        self.mlp  = MLP(config)

        # Conditioning -> (gate_msa, gate_mlp, scale_msa, scale_mlp)
        self.adaLN_modulation = nn.Sequential(
            nn.SiLU(),
            nn.Linear(d, 4 * d, bias=True),
        )

        # AdaLN-Zero init: start from identity updates
        nn.init.zeros_(self.adaLN_modulation[1].weight)
        nn.init.zeros_(self.adaLN_modulation[1].bias)

    def forward(self, x: torch.Tensor, c: torch.Tensor) -> torch.Tensor:
        gate_msa, gate_mlp, scale_msa, scale_mlp = \
            self.adaLN_modulation(c).chunk(4, dim=1)

        # Multi-head attention branch
        x = x + gate_msa.unsqueeze(1) * self.attn(
            self.ln_1(x) * (1.0 + scale_msa.unsqueeze(1))
        )

        # Feed-forward branch
        x = x + gate_mlp.unsqueeze(1) * self.mlp(
            self.ln_2(x) * (1.0 + scale_mlp.unsqueeze(1))
        )
        return x


class SharedBlock(nn.Module):

    def __init__(self, depth: int, config):
        super().__init__()
        self.blocks = nn.ModuleList(
            [LoopFormerBlock(config) for _ in range(depth)]
        )

    def forward(self, x: torch.Tensor, c: torch.Tensor) -> torch.Tensor:
        for blk in self.blocks:
            x = blk(x, c)
        return x
```

`\noindent`{=latex}

------------------------------------------------------------------------

```{=latex}
\clearpage
```

[^1]: Akin to Recurrent Depth [@geiping2025scaling] without sandwich normalizations or a randomly initialized recurrent state. We trained both variants; the simpler one performs better.