---
author:
- Sahil Goyal
- Swayam Agrawal
- Gautham Govind Anil
- Prateek Jain
- Sujoy Paul
- Aditya Kusupati
title: '`\ours`{=latex}: `\oursfull`{=latex} for Visual Generation'
---

```{=latex}
\renewcommand{\internalonly}{}
```
```{=latex}
\newcommand{\ours}{ELT}
```
```{=latex}
\newcommand{\oursfull}{Elastic Looped Transformers}
```
```{=latex}
\newcommand{\oursall}{Elastic Looped Transformers (ELT)}
```
```{=latex}
\newcommand{\distill}{ILSD}
```
```{=latex}
\newcommand{\distillfull}{Intra-Loop Self Distillation (ILSD)}
```
```{=latex}
\renewcommand{\today}{}
```
```{=latex}
\maketitle
```
Introduction {#sec:intro}
============

Traditional techniques to increase compute capacity in deep learning models, such as stacking deeper layers or increasing network width, inevitably lead to a proportionally larger memory footprint. Recurrence offers a powerful alternative paradigm, enabling models to utilize large amounts of compute while maintaining a minimal memory footprint by leveraging the same set of parameters repeatedly. This architectural efficiency parallels the biological visual systems [@kar2019evidence; @kietzmann2019recurrence], where recurrent processing, rather than strictly feedforward pathways, is essential for resolving complex visual inputs.

While looping of transformers was popularized by Universal Transformers [@dehghani2018universal] and has recently empowered language models with stronger reasoning capabilities [@saunshi2025reasoninglatentthoughtspower; @yang2023looped], its potential for high-fidelity visual generation remains largely untapped. From a practical standpoint, compared to the traditional models, Looped Transformers (a) are extremely parameter efficient and can perform significantly more compute (FLOPs) per parameter, (b) can have higher throughput by minimizing the \`\`memory wall" bottleneck. They use a compact set of shared parameters and maintain its major parameter footprint on-chip or adjacent to the chip. This avoids the cost of repeated transfers between different units of the accelerator (GPUs/TPUs) required in large transformers, and (c) can exhibit robustness against overfitting in data-constrained settings.

```{=latex}
\resizebox{0.8\linewidth}{!}{
        % Standalone TikZ document; preserve it in the source archive and omit
        % expansion from the Markdown conversion input.
    }
```
Despite the parameter efficiency of looped architectures, their training remains challenging because intermediate representations often remain uninterpretable until the final loop (see `\Cref{fig:ilsd_trajectory}`{=latex}). We address this by introducing `\oursfull`{=latex} (`\ours`{=latex}), a class of generative models designed for progressive refinement. Unlike traditional recurrent transformers, `\ours`{=latex} is optimized to provide meaningful, high-quality synthesis even at intermediate repeats. This enables Any-Time (elastic) inference capability - where a single model can scale its compute based on available resources without sacrificing much on generation quality. We present a pictorial representation of our proposed method in `\Cref{fig:looping_main}`{=latex} and generation results for Any-Time inference capability in `\Cref{fig:gen_dit_ilsd}`{=latex}.

To achieve this functional flexibility across loops, we propose an `\distillfull`{=latex} algorithm for training looped transformers. Rather than treating the loop as a fixed-depth process, our framework operates as a dual-path system: a teacher path executes the full loop count to provide a high-fidelity target, while a student path, defined strictly as a subset of the teacher's trajectory, learns to produce comparable results with fewer iterations. Note that both the paths have the same parameter count. By framing the student process as a literal subset of the teacher's forward propagation, we ensure there is no additional overhead during training. This approach forces the shared parameters to compress complex transformations into earlier loops. Consequently, the model does not just fix the output at the end. It learns an efficient, progressive refinement process that maintains the generation quality regardless of the exit point as motivated in `\Cref{fig:ilsd_trajectory}`{=latex}. Our contributions and findings are summarized as follows:

`\flushleft `{=latex}**State-of-the-Art Parameter Efficiency**: Through the reuse of block of transformer layers across loops, `\ours`{=latex} achieves a competitive FID of *2.0* on class-conditional ImageNet $256 \times 256$ and FVD of *72.8* on class-conditional UCF-101. This represents a *4*$\times$ reduction in parameters compared to baselines MaskGIT [@chang2022maskgit] (image generation) and MAGVIT [@yu2023magvit] (video generation), while matching or improving their performance under iso-inference-compute settings.

`\flushleft `{=latex}**Elastic/Any-Time Inference**: By treating the looped block as an iterative refiner, our models enable Any-Time inference [@zilberstein1996using], allowing to traverse the pareto frontier of quality versus compute at test-time without retraining. This allows to serve diverse deployment tiers: from latency-critical on-device generation (few loops) to high-fidelity cloud rendering (more loops).

`\flushleft `{=latex}**Scalability**: While model size remains a primary driver of quality, recursive looping provides a unique test-time compute lever that scales predictably across both Masked Generative Transformers [@chang2022maskgit; @yu2023magvit; @yu2023language] and Diffusion Transformers [@peebles2023scalablediffusionmodelstransformers].

Related Work
============

`\flushleft `{=latex}**Recursive Architectures:** The principle of recursively applying a shared block of parameters to enhance model efficiency is a well-established concept. This approach, often referred to as *looping* has been shaped by Universal Transformers [@dehghani2018universal], which introduced the idea of iterating over a single transformer layer. Notably, there has been a resurgence of interest in this area. @saunshi2025reasoninglatentthoughtspower demonstrated the power of looping for sophisticated reasoning tasks. @gatmiry2024loopedtransformerslearnimplement showed that Looped Transformers can learn to implement multi-step gradient descent for in-context learning, providing a deeper understanding of their capabilities. Mixture-of-Recursions [@bae2025mixtureofrecursionslearningdynamicrecursive] further explored input-dependent dynamic and adaptive depth in looped transformers. Mixture-of-Recursions-VIT [@li2025morvitefficientvisiontransformer] extends it for image understanding. @fan2024looped utilized looping for length generalization. Geiping et al. @geiping2025scalingtesttimecomputelatent demonstrated that scaling test-time compute via recurrent depth allows language models to perform complex latent reasoning.

Deep Equilibrium Models (DEQs) [@bai2019deepequilibriummodels; @pokle2022deepequilibriumapproachesdiffusion; @geng2023onestepdiffusiondistillationdeep; @mccallum2025reversibledeepequilibriummodels; @anil2022pathindependentequilibriummodels; @gabor2024positiveconcavedeepequilibrium], instead of unrolling a weight-tied layer for a fixed number of iterations, define the output as the fixed point of a non-linear transformation. Unlike DEQs that rely on black-box solvers for an analytical fixed point, our `\ours`{=latex} framework explicitly optimizes unrolled intermediate states via `\distillfull`{=latex}, retaining the flexibility of Any-Time inference without requiring the network to reach a strict analytical fixed point.

`\flushleft `{=latex}**Parameter-Efficient Visual Generation:** Standard efficiency techniques [@menghani2023efficient] that work across deep learning models also help in speeding up visual generation models. MobileDiffusion [@zhao2024mobilediffusioninstanttexttoimagegeneration] prunes redundant residual blocks and replaces standard layers with separable convolutions in UNet to get an optimized architecture ($\sim$400M) for on-device visual generation. EdgeFusion [@castells2024edgefusionondevicetexttoimagegeneration] employs BK-SDM, a lightweight Stable Diffusion variant, and refines the step distillation process of Latent Consistency Model. MaGNeTS [@goyal2025maskedgenerativenestedtransformers] trains a family of nested transformers [@kudugunta2023matformer; @kusupati2022matryoshka] with schedules of model sizes over the generation process, without increasing the parameter count.

`\flushleft `{=latex}**Elastic Visual Generation:** The paradigm of Any-Time or elastic generation focuses on decoupling model's parameter count from its computational depth. E-DiT [@wang2026elasticdiffusiontransformer] introduces adaptive block skipping and MLP width reduction, allowing a single model to traverse varying computational budgets without retraining. In the context of visual reasoning, LoopViT  [@shu2026loopvitscalingvisualarc] uses a weight-tied recursive architecture, employing a parameter-free dynamic exit mechanism to halt inference based on uncertainty of prediction. EvoSearch [@he2025scalingimagevideogeneration] proposes a search-based strategy that optimizes sampling trajectories at inference time. Unlike these methods that rely on architectural skipping or external search, our `\ours`{=latex} framework equipped with `\distillfull`{=latex} algorithm directly regularizes the recursive process, ensuring stability across the entire loop spectrum. Methods like ALIT [@duggal2024adaptivelengthimagetokenization], FlexTok [@bachmann2025flextokresamplingimages1d], One-D-Piece [@miwa2025onedpieceimagetokenizermeets], ElasticTok [@yan2025elastictokadaptivetokenizationimage], and CAT [@shen2025catcontentadaptiveimagetokenization] use tail dropping to allow elasticity in token sequence length.

`\flushleft `{=latex}**Relation to Few-Step and Consistency Models**: Recent approaches such as Consistency Models [@song2023consistencymodels] and progressive distillation [@salimans2022progressivedistillationfastsampling] address inference efficiency by reducing the number of sampling steps (inter-step acceleration). `\ours`{=latex} is fundamentally orthogonal: it reduces compute *within* each sampling step by varying the loop count $L$ (intra-step acceleration). These two axes are complementary, we can combine `\ours`{=latex} with few-step methods to achieve further efficiency gains. Moreover, unlike consistency models which require specialized training objectives and architectural constraints to enable variable-step inference, `\ours`{=latex}'s elastic capability emerges naturally from `\distill`{=latex} applied to standard training objectives. We note that `\ours`{=latex} is particularly compelling for one-step generative paradigms, where the loop count $L$ becomes the *sole* lever for controlling the compute-quality trade-off at inference time.

Preliminaries {#sec:pre}
=============

`\flushleft `{=latex}**Masked Generative Models**: Masked Generative Image Transformer (MaskGIT) [@chang2022maskgit] introduced a novel approach to image generation that significantly differs from traditional autoregressive models. In autoregressive decoding, images are generated sequentially, one pixel/token at a time, following a raster scan order [@esser2021taming; @kondratyuk2023videopoet; @wang2024parallelizedautoregressivevisualgeneration; @li2024autoregressiveimagegenerationvector]. This sequential approach is computationally inefficient, as each token is conditioned only on the previously generated tokens, leading to a bottleneck in processing time. MaskGIT generates all tokens of an image simultaneously, while iteratively refining them. This method enables significant acceleration in the decoding process. The tokens are discrete and obtained using Vector Quantized (VQ) autoencoders, learned with self-reconstruction and photo-realism losses [@yu2023magvit]. The iterative parallel decoding process is represented as: $$\mathbf{X}_{k} \leftarrow \mathrm{Mask \circ Sample}({M}({\mathbf{X}}_{k-1}, c), k)
    \label{eqn:maskgit}$$ where $\mathbf{X} \in \mathbb{Z}^N_{\geq 0}$, are the input tokens, $N$ is the number of tokens, $k \in [1, K]$ denote the iteration number, with $K$ being the total number of iterations, $\mathbf{X}_0$ is either completely masked for full generation, and partially masked for conditional generation tasks like frame prediction, $c$ is the category of image/video under generation. The $\mathrm{Sample}$ function utilizes logits predicted by the model ${M(.)}$, introduces certain randomness, and sorts them by confidence, unmasking only top-p tokens while masking the rest.

`\flushleft `{=latex}**Diffusion Models**: Diffusion models [@ho2020denoising; @song2020score] generate data by learning to reverse a process that gradually corrupts a signal $\mathbf{X}_0$ into Gaussian noise $\mathbf{X}_T$ through a predefined noise schedule. At its core, diffusion denoising/sampling proceeds iteratively as follows: $$\begin{aligned}
    \mathbf{X}_{t-1} \leftarrow \alpha_1(t) \cdot \mathbf{X}_{t} + \alpha_2(t) \cdot M(\mathbf{X}_{t}, t, c) + \alpha_3(t) \cdot \mathbf{z}\end{aligned}$$ where $\mathbf{X}_{t}$ is the (partially)denoised vector at time $t$, $\alpha_1(t), \alpha_2(t), \alpha_3(t)$ are time-dependent scalars defined by the noise schedule and $\mathbf{z}$ is a sample drawn from standard Normal distribution. The model $M$ takes as input $\mathbf{X}_{t}$ along with time $t$ and class label $c$. From the perspective of the forward diffusion process, the generative task can be thought of as predicting the noise added to a latent representation $\mathbf{X}_t$ at timestep $t$, conditioned on a class label $c$. While traditionally architectures based on U-Nets [@ronneberger2015unetconvolutionalnetworksbiomedical] have been used, the Diffusion Transformer (DiT) [@peebles2023scalablediffusionmodelstransformers] architecture shifts away from this design by treating image latents as sequences of tokens, and using transformer blocks for processing these tokens. Similar to MaskGIT, we explore replacing the typical DiT transformers with `\oursfull`{=latex} for denoising at time $t$.

In summary, both masked generative models and diffusion models involve recursive refinement over multiple sampling steps, sharing parameters across sampling steps. Unlike standard transformers, our `\ours`{=latex} framework aligns the architecture of model $M$ with the progressive refinement process by implementing $M$ as a recurrent, weight-shared transformer blocks. This allows the model to perform recursive refinement within each sampling step, providing a test-time compute lever to trade-off inference speed and generation quality.

Method {#sec:method}
======

`\flushleft `{=latex}**Looping Mechanism**: Let the number of transformer layers to be looped be $N$ and number of loops per sampling step be $L$. The total effective depth for a single sampling or denoising step of the generation process is then $N \times L$. Let $f_{\theta_i}(\mathbf{x})$ denote a single transformer layer with parameters $\theta_i$. We define a composite block $g_{\Theta}(x)$ consisting of $N$ unique transformer layers with parameters $\Theta = \{\theta_1, \theta_2, \dots, \theta_N\}$ as follows: $$\begin{aligned}
    g_{\Theta}(\mathbf{x}) = f_{\theta_N}(f_{\theta_{N-1}}(\cdots f_{\theta_1}(\mathbf{x})))\end{aligned}$$

```{=latex}
\resizebox{0.9\linewidth}{!}{
        % Standalone TikZ document; preserve it in the source archive and omit
        % expansion from the Markdown conversion input.
    }
```
In a standard transformer model with total depth $\mathcal{D} = N \times L$, the effective transformation $F_{\mathcal{D} }(x)$ would require $\mathcal{D}$ sets of unique parameters. In *looping*, we reuse the same block of parameters $\Theta$ for $L$ successive applications, resulting in only $N$ unique layers of parameters. The effective transformation for a $N \times L$ configuration is given by: $$\begin{aligned}
    F_{(N, L)} (\mathbf{x}) =   \underbrace{g_{\Theta}\left(g_{\Theta} (\cdots g_{\Theta}(\mathbf{x}))\right)}_{L \text{ loops}} \equiv g_{\Theta}^L(\mathbf{x})\end{aligned}$$

This looping architecture decouples physical model size from computational depth (see `\Cref{fig:looping_main}`{=latex} for a visual overview). The parameter count ($\Theta$) is constrained by the number of unique blocks ($N$), while representational capacity and depth ($\mathcal{D}$) scale with loop count ($L$). This setup provides two primary advantages: extreme parameter efficiency and high throughput. See `\Cref{sec:image_gen}`{=latex} for details.

`\flushleft `{=latex}**`\distillfull`{=latex}**: In a standard weight-tied (looped) transformer, the model is typically optimized only for its final output after fixed $L_{\text{max}}$ iterations i.e. the loop count for which it is trained. However, this creates a \`\`black box" internal trajectory where intermediate loops ($L < L_{\text{max}}$) may produce suboptimal representations until the final projection layer. By treating the looped block $g_{\Theta}$ as an iterative refiner, we can motivate a training objective that ensures the model remains useful at multiple depths. This not only encourages the model to learn a more robust, incremental transformation but also allows for elastic inference, where the model can be exited early with minimal performance drop. Towards this end, we propose `\distillfull`{=latex}.

From a distillation perspective, `\distill`{=latex} leverages the fact that a model with more loops ($L_{\text{max}}$) naturally possesses a more mature and refined representational space than its shallower version ($L_{\text{int}}$), even though they have the same unique parameters. By treating the full-depth model as an internal teacher, we provide a high-fidelity, low-variance signal for the shallower student to follow (see `\Cref{fig:looping_main}`{=latex}). This forces the shared parameters $\Theta$ to compress complex transformations into fewer steps, effectively distilling the knowledge of the deep model into the early stages of the computation.

`\flushleft `{=latex}**Stochastic Student Sampling ($S^3$)**: We train with a fixed number of loops, $L_{\text{max}}$, for a block of $N$ layers with unique parameters. During each training step, we treat the model as a dual-path system sharing a single set of parameters $\Theta$. We define a teacher path that executes the full $L_{\text{max}}$ loops and a stochastic student path that exits early at an intermediate loop $L_{\text{int}}$. The intermediate path receives supervision from the ground truth labels as well as the online teacher with maximum training loop count $L_{\text{max}}$.

We denote the training loss for the output of a configuration $N \times L$ as $\mathcal{L}_{\Theta}(F_{(N, L)}(\mathbf{x}), \mathbf{y})$ where $\mathbf{y}$ denote the ground-truth for a certain input masked / denoise level. At every training iteration, we randomly sample an intermediate loop count $L_{\text{int}}$ from uniform distribution such that $L_{\text{min}} \leq L_{\text{int}} < L_{\text{max}}$. Note that $L_{\text{min}}$ is just used for constraining the student sampling distribution. The effective joint loss, $\mathcal{L}^{\text{\distill}}_{\Theta}$, is computed as:

$$\begin{aligned}
\mathcal{L}^{\text{\distill}}_{\Theta} = & \mathcal{L}^\text{GT}(F_{(N, L_{\text{max}})}(\mathbf{x}), \mathbf{y}) && \text{\small (1) Ground-truth for teacher} \\
+ &\lambda \mathcal{L}^\text{GT}(F_{(N, L_{\text{int}})}(\mathbf{x}),\mathbf{y}) && \text{\small (2) Ground-truth for student} \\
+ & (1 - \lambda) \mathcal{L}^{\text{dist}}(F_{(N, L_{\text{int}})}(\mathbf{x}), \text{sg}(F_{(N, L_{\text{max}})}(\mathbf{x}))) && \text{\small (3) Intra-Loop Self Distillation} \\
\text{with} \quad &L_\text{int} \sim \mathcal{U}(L_{\text{min}}, L_{\text{max}})
&& \text{\small  Stochastic Student Sampling}\end{aligned}$$ `\noindent`{=latex} where sg is stop-grad for teacher ($L_{\text{max}}$) in `\distill`{=latex}, $\lambda$ is hyperparameter controlling the weight between the ground truth and distillation losses. We introduce a curriculum for $\lambda$ and linearly decay it from $1$ to $0$ as training progresses. This initially anchors the student to reliable ground-truth labels while the teacher is still untrained and gradually shift to mimicking the teacher's predictions once they have matured. We found this linear schedule to be effective across all our experiments and did not observe sensitivity to the decay rate, as long as the transition is gradual.

`\flushleft `{=latex}**Loss Formulation**: The exact forms of $\mathcal{L}^\text{GT}$ and $\mathcal{L}^\text{dist}$ depend on the algorithm used for training. For masked generative models with discrete tokens, we use the cross-entropy loss for both ground-truth and distillation: $$\begin{aligned}
\mathcal{L}^\text{GT} &= -\sum_{i \in Mask} \log P_{(N, L_{\text{int}})}(y_i \mid \mathbf{x}_{mask}) \\
\mathcal{L}^\text{dist} &= -\sum_{i \in Mask} \sum_{v \in \mathcal{V}} P_{(N, L_{\text{max}})}(v \mid \mathbf{x}_{mask}) \log P_{(N, L_{\text{int}})}(v \mid \mathbf{x}_{mask})\end{aligned}$$ where $\mathbf{x}_{mask}$ is the masked input, $y = \{y_i\}_{i \in \text{Mask}}$ represents the ground-truth tokens for the masked positions, and $\mathcal{V}$ is the full vocabulary of the tokenizer. For diffusion, we use the sigmoid-weighted Mean Squared Error (MSE) for both ground-truth and distillation losses. Let $\mathbf{x}_t$ be the noised version of ground truth latent $\mathbf{x}_0$, the ground-truth loss is: $$\begin{aligned}
    \mathcal{L}^{\text{GT}}= w(t) \norm{F_{(N, L)}(\mathbf{x}_t) - \mathbf{x}_0}_2^2\end{aligned}$$ where $L$ is $L_{\text{int}}$ for student and $L_{\text{max}}$ for teacher. $w(t)$ is a time-dependent sigmoid weighting term [@hoogeboom2025simpler]. The distillation loss is: $$\begin{aligned}
    \mathcal{L}^{\text{dist}}= w(t) \norm{F_{(N, L_{\text{max}})}(\mathbf{x}_t) - F_{(N, L_{\text{int}})}(\mathbf{x}_t)}_2^2\end{aligned}$$

Note that gradients from both computational paths, corresponding to $L_{\text{int}}$ and $L_{\text{max}}$, update the single, shared set of block parameters $\Theta$. This joint optimization provides a significantly richer training signal; the shared block $g_{\Theta}$ is forced to learn a transformation that is not only effective at $L_{\text{int}}$ loops but remains incrementally useful for the subsequent iterations up to $L_{\text{max}}$ loops. This constraint prevents the model from learning a *shortcut* that might minimize loss at a specific depth but fail when composed further. Consequently, the shared block generalizes better to lower depths, leading to higher performance even with fewer loops. It is interesting to note that unlike traditional distillation, where we have to forward propagate through both the student and teacher models separately, in the proposed way of `\distillfull`{=latex}, the training overhead is minimal, as the computation of $F_{(N, L_{\text{int}})}(\mathbf{x})$ is a strictly required intermediate step for computing $F_{(N, L_{\text{max}})}(\mathbf{x})$ i.e. the student trajectory ($L_{\text{int}}$) is a strict prefix of the teacher trajectory ($L_{\text{max}}$). Refer `\Cref{alg:elt_train}`{=latex} and `\Cref{alg:elt_infer}`{=latex} for details of our training and inference algorithms respectively.

Experiments and Results
=======================

We conduct extensive experiments using masked generative transformers and diffusion transformers to demonstrate the generality and efficacy of our approach on class-conditional image generation. We also experiment with class-conditional video generation using masked generative transformers. We first detail our experimental setup and then present the results.

```{=latex}
\resizebox{0.8\columnwidth}{!}{%
\begin{tabular}{lccccccc}
    \toprule
    Model & AR & FID $\downarrow$ & IS $\uparrow$ & \# params & \# steps & \# Gflops\\
    \midrule
    BigGAN-deep$^{\Delta}$ \citep{brock2018large}&  & 7.0 & 171.4 & 160M & 1 & - \\
    StyleGAN-XL$^{\Delta g}$ \citep{sauer2022styleganxlscalingstyleganlarge}&  & 2.3 & 265.1 & 166M & 1 & - \\
    \midrule
    Improved DDPM$^{\Delta}$ \citep{nichol2021improved}&  & 12.3 & - & 280M & 250 & $>$150k\\
    ADM + Upsample$^{\Delta g}$ \citep{dhariwal2021diffusion}&  & 3.9 & 215.8 & 554M & 250 &  371k\\
    LDM-4$^{\Delta g*}$  \citep{ldm}& & 3.6 & 247.7 & 400M & 250 &  51.5k\\
    DiT-XL/2$^{\Delta g*}$ \citep{peebles2023scalablediffusionmodelstransformers}& & 2.3 & 278.2 & 675M & 250 & 59.5k \\
    MDT$^{\Delta g* }$ \citep{gao2023masked}&  & 1.8 & 283.0 & 676M & 250 & $>$59k \\
    MaskDiT$^{\Delta g* }$ \citep{zheng2023fast}&  & 2.3 & 276.6 & 736M & 250 & $>$28k   \\
    CDM$^{\Delta}$ \citep{ho2022cascaded}&  & 4.9 & 158.7 & - & 8100 & -  \\
    RIN$^{\Delta}$ \citep{jabri2022scalable}&  & 3.4 & 182.0 & 410M & 1000 & 334k  \\
    Simple Diffusion$^{\Delta g}$ \citep{hoogeboom2023simple}&  & 2.4 & 256.3 & 2B & 512 & - \\
    VDM++$^{\Delta g}$  \citep{kingma2023understanding}& & 2.1 & 267.7 & 2B & 512 & - \\
    EDiff$^{\Delta g}$  \citep{hang2024efficientdiffusiontrainingminsnr}&  & 2.1 & - & 450M & 50 & 119k \\
    LPDM-ADM$^{\Delta g}$  \citep{wang2023patchdiffusionfasterdataefficient}&  & 2.7 & - & - & 50 & 7.8k \\
    MAR$^{\Delta g}$  \citep{li2024autoregressiveimagegenerationvector}& $\checkmark$ & 1.8 & 296.0 & 479M & 128 & - \\
    \midrule
    VQVAE-2$^{\Delta}$ \citep{razavi2019generating}& $\checkmark$ & 31.1 & $\sim$45 & 13.5B & 5120 & -\\
    VQGAN$^{\Delta}$ \citep{esser2021taming}& $\checkmark$ & 15.8 & 78.3 & 1.4B & 256 & -\\
    MaskGIT$^{\Delta}$\citep{chang2022maskgit}& & 6.2 & 182.1 & 227M & 8 & 647 \\
    Mo-VQGAN$^{\Delta}$ \citep{zheng2022movqmodulatingquantizedvectors}& & 7.2 & 130.1 & 389M & 12 & $\sim$1k\\
    MaskBit$^{\Delta g}$ \cite{weber2024maskbitembeddingfreeimagegeneration}&  & 1.7 & 341.8 & 305M & 64 & 10.3k\\
    PAR-$4\times^{\Delta}$ \cite{wang2024parallelizedautoregressivevisualgeneration}&$\checkmark$ & 3.8 & 218.9 & 343M & 147 & -\\
    PAR-$16\times^{\Delta}$ \cite{wang2024parallelizedautoregressivevisualgeneration}&$\checkmark$ & 2.9 & 262.5 & 3.1B & 51 & -\\
    \midrule
    MaskGIT-L$^g$ & & 2.1 & 270.1 & 303M & 24 & 3.7k \\
    MaskGIT-XL$^g$ & & 2.0 & 294.8 & 446M & 24 & 3.9k \\
    \midrule
    \textbf{\ours-L} ($\bm{8N \times 3L}$)& & 2.2 & 254.3 & \textbf{101M} & 24 & 3.7k \\
    \textbf{\ours-L} ($\bm{12N \times 2L}$)& & 2.1 & 281.8 & \textbf{152M} & 24 & 3.7k \\
    \textbf{\ours-XL} ($\bm{7N \times 4L}$)& & 2.0 & 266.1 & \textbf{111M} & 24 & 3.9k \\
    \bottomrule
\end{tabular}
}
```
`\label{tab:img_results}`{=latex}

Experimental Setup {#sec:exp_setup}
------------------

`\flushleft `{=latex}**Datasets**: We experiment on ImageNet $256\times256$ [@deng2009imagenet] for image generation and UCF-101 [@soomro2012ucf101dataset101human] for class-conditional video generation.

`\flushleft `{=latex}**Implementation Details**: **(i) Masked Generative Transformers**: We use pretrained tokenizers from MaskGIT [@chang2022maskgit] (images) and MAGVIT [@yu2023magvit] (videos) with a codebook size of 1024 tokens. Image models are trained at $256\times256$ resolution, compressed to $16\times16$ discrete tokens with an embedding dimension of 1024. Video models are trained for $16 \times 128 \times 128$ sequences, compressed to $4 \times 16 \times 16$ tokens. Following MaskGIT, we adopt the BERT [@devlin2019bertpretrainingdeepbidirectional] architecture as the transformer backbone and perform experiments at several model scales to understand the scaling behaviors of ELT. We train all models for 270 epochs unless otherwise specified. We employ a cosine schedule for unmasking tokens during inference. For image generation, we use classifier-free guidance by dropping class condition labels for $10\%$ of the training batches. **(ii) Diffusion Transformers:** We use a pretrained Stable Diffusion v1.4 VAE [@ldm] model to map $256 \times 256$ images into a continuous $32 \times 32 \times 4$ latent space (8× spatial downsampling). We train a DDPM-style diffusion model which operates on these latents using a DiT architecture. We employ a shifted cosine noise schedule and sigmoid-weighted MSE loss for training [@hoogeboom2025simpler]. Models are trained for 500K steps with a batch size of $512$ using Adam. Unless mentioned otherwise, sampling uses 512-step DDPM with classifier-free guidance scale $3.0$. See Appendix for more details.

`\flushleft `{=latex}**Evaluation Metrics and Efficiency**: To evaluate the quality of synthesized content, we employ Fréchet Inception Distance (FID) [@heusel2017gans; @dhariwal2021diffusion] and Inception Score (IS) [@salimans2016improvedtechniquestraininggans] for image generation tasks, and Fréchet Video Distance (FVD) [@unterthiner2019fvd] for video generation. Beyond generative quality, we evaluate model efficiency using inference-time GFLOPs and throughput (samples generated per second). In the proposed $N \times L$ design space, for a fixed model scale and block size $N$, both computational complexity and latency scale linearly with the number of loops $L$. This relationship allows us to precisely navigate the trade-off between generation quality and inference speed by modulating the loop count.

![`\small `{=latex}**Pareto front of FID vs. Inference GFLOPs**. The black curve denotes the fit (FID = $1922.5 \cdot G^{-0.95} + 1.48$) over pareto-optimal configurations, representing the best achievable FID for a given computational budget. Points are labeled as $N \text{ layers} \times L \text{ loops}$. Results demonstrate that `\ours`{=latex} scales as effectively as the baseline while remaining significantly more parameter-efficient. Scaling both the model dimension ($d$) and the number of loops ($L$) follows a predictable efficiency frontier, where larger models with fewer loops often compete with smaller models with more loops at specific target GFLOPs. Trends across faded points show scaling along number of loops $L$ in inference, from a single run trained with a certain $L_{max}$ loops. Refer `\cref{tab:model_scaling}`{=latex} for the exact baseline configuration, including the specific number of layers for each comparison point. ](figures/fid_vs_gflops_imagenet.png){#fig:fid_vs_gflops_imagenet width="90%"}

![`\small `{=latex}**Scaling with Parameters**. We show the best-achievable FID as a function of model parameters (log scale). Each series corresponds to iso-inference-compute configuration for a specific model width ($d$). Points are labeled as $N \ \text{layers} \times L \ \text{loops}$. Results show that increasing parameter count via model width yields superior FID. Proposed ELT mechanism shows that even parameter-constrained models can achieve similar performance through recursive depth. Each `\ours`{=latex} point in the figure is best inference configuration ($N \times L$) chosen from its corresponding training run ($N \times L_{\text{max}}$). Refer `\Cref{tab:model_scaling}`{=latex} for the exact baseline configuration, including the specific number of layers for each comparison point. ](figures/isocompute.png){#fig:fid_vs_params_imagenet width="\\textwidth"}

Image Generation {#sec:image_gen}
----------------

`\flushleft `{=latex}**Comparison with Baselines**: We present the results for $256 \times 256$ image generation on ImageNet-1k in `\Cref{tab:img_results}`{=latex}. Despite using **4**$\times$ less parameters, ELT-XL achieves same FID of 2.0 as MaskGIT-XL, which is the base setup for ELT. As shown in recent literature, using a superior tokenizer [@zhao2024imagevideotokenizationbinary; @yu2023language; @weber2024maskbitembeddingfreeimagegeneration] or optimized training & inference configurations [@ni2024revisitingnonautoregressivetransformersefficient; @ni2024enatrethinkingspatialtemporalinteractions] can further boost `\ours`{=latex}'s performance.

We also present comparisons of `\ours`{=latex} using diffusion models in `\Cref{tab:dit}`{=latex}. Notably, `\ours`{=latex} outperforms the baseline model with 32 layers (FID 3.43) using iso-inference-compute $8N \times 4L$ (FID 3.16) and $16N \times 2L$ (FID 2.83) settings, achieving $\textbf{4} \times$ and $\textbf{2} \times$ reduction in parameter count respectively. The $1N \times 32L$ configuration (FID 10.30) reveals that a single unique transformer layer, despite 32 effective passes, lacks the representational capacity for high-fidelity generation, highlighting that a minimum block size $N$ is necessary to provide sufficient architectural expressiveness within each iteration. While vanilla looping (without `\distill`{=latex}) gives competitive performance when running inference with same loops as training ($L=L_{max}$), performance degrades drastically for lower number of loops which is mitigated by `\distill`{=latex} as show in `\Cref{fig:ilsd_dit}`{=latex}.

`\flushleft `{=latex}**Qualitative Results**: `\Cref{fig:gen_dit_ilsd}`{=latex}, `\Cref{fig:dit_qual_supp1}`{=latex}, and `\Cref{fig:dit_qual_supp2}`{=latex} compare `\ours`{=latex} against vanilla looped transformers within a diffusion framework. It is clear that `\ours`{=latex} unlocks Any-Time inference capability providing dynamic trade-off between generation quality and inference speed. `\Cref{fig:dit_final_gen_qual_supp}`{=latex} and `\Cref{fig:qual_elt_maskgit}`{=latex} visualize the generation results of `\ours`{=latex} in diffusion and masked generative framework respectively.

`\flushleft `{=latex}**Scaling Inference GFLOPs and Pareto Front**:  `\Cref{fig:fid_vs_gflops_imagenet}`{=latex} illustrates the trade-off between generation quality (FID) and inference compute (GFLOPs). The pareto front (black curve) reveals that while increasing the loop count ($L$) consistently improves FID (faded points) for a fixed unique layers ($N$), the gains eventually diminish, where transitioning to the next architecture scale becomes more performant than over-looping smaller models. Crucially, ELT allows for Any-Time inference, we can traverse the pareto curve at test-time by simply adjusting $L$ to meet specific hardware constraints without retraining.

```{=latex}
\captionof{table}{\small \textbf{Class-conditional Image Generation} on ImageNet $256\times256$ using DiT. We report $ N \times L$ used for inference in parenthesis in Model column.}
```
```{=latex}
\resizebox{0.8\columnwidth}{!}{
    \begin{tabular}{lccc}
        \toprule
        {Model} & {FID $\downarrow$} & {\# params} & $\mathcal{D}$\\
        \midrule
        DiT - 16 layers & $3.87$  & 1.1B & 16 \\
        DiT - 32 layers &  $3.43$  & 2.1B & 32 \\
        \midrule
        \ours\ ($ 1N \times 32L$)& $10.30$ &69M & 32 \\
        \ours\ ($4N \times 8L$)&  $3.96$  & 271M & 32\\
        \textbf{\ours}\ ($\bm{8N \times 4L}$)&  $\textbf{3.16}$ & \textbf{539M}& 32\\
        \textbf{\ours}\ ($\bm{16N \times 2L}$)&  $\textbf{2.83}$ &\textbf{1.1B}& 32\\
        \bottomrule
    \end{tabular}
}
```
`\label{tab:dit}`{=latex}

```{=latex}
\hfill
```
```{=latex}
\captionof{table}{\small \textbf{Throughput gains for \ours}. We report throughput (images/sec) ratio (\ours\ /\ Baseline) on a Google Cloud TPU v6e~\citep{googleclouduv6e} with $1\times1$ topology and inference batch size of 8. The speedup arises from \ours's compact shared parameters fitting on-chip, reducing repeated HBM-to-SRAM transfers. \ours\ achieves a peak speedup of $\emph{3.5}\times$ for model scale H.}
```
```{=latex}
\resizebox{0.8\columnwidth}{!}{
    \begin{tabular}{lcc}
    \toprule
    \ours\ & $d_{model}$ & Throughput Ratio \\
    \midrule
    $6N \times 2L$ (B)  & 768  & 1.0 \\
    $8N \times 3L$ (L)  & 1024 & 2.9 \\
    $7N \times 4L$ (XL) & 1152 & 3.3 \\
    $8N \times 4L$ (H)  & 1280 & \textbf{3.5} \\
    % G  & 1536 & 1.1 \\
    \bottomrule
    \end{tabular}
    }
```
`\label{tab:thru}`{=latex}

`\flushleft `{=latex}**Scaling Parameters**: We further investigate the relationship between model capacity and performance in `\Cref{fig:fid_vs_params_imagenet}`{=latex}. By plotting the best FID achieved at each parameter budget, we observe a clear power-law scaling trend across all model widths. While increasing the number of unique layers ($N$) reduces FID, the gains are most pronounced when accompanied by an increase in model width ($d$). Specifically, the $d=1536$ configuration ($G$) achieves the lowest overall FID with only $8$ unique layers, while the full $G$ model has $48$ unique layers. However, for on-device budgets, smaller looped variants (L and XL) remain highly efficient, providing a flexible scale-to-performance ratio that is critical for resource-constrained visual generation.

![`\small `{=latex}**Faster Training Convergence with `\ours`{=latex}**. While using $2\times$ less parameters, `\ours`{=latex} improves the training efficiency in diffusion framework.](figures/supplementary/16_2_fast.png "fig:"){#fig:supp_faster_training_convergence height="45mm"} `\label{fig:supp_16_2_faster_training_convergence}`{=latex}

`\hfill `{=latex}

![`\small `{=latex}**Faster Training Convergence with `\ours`{=latex}**. While using $2\times$ less parameters, `\ours`{=latex} improves the training efficiency in diffusion framework.](figures/supplementary/8_4_fast.png "fig:"){#fig:supp_faster_training_convergence height="45mm"} `\label{fig:supp_8_4_faster_training_convergence}`{=latex}

**`\flushleft{Faster Training Convergence}`{=latex}**: Our method demonstrates significantly faster convergence than standard DiT architectures [@peebles2023scalablediffusionmodelstransformers]. As illustrated in `\Cref{fig:supp_faster_training_convergence}`{=latex}, `\ours`{=latex}-based diffusion models achieve $2\times$ and $1.4\times$ speedups when using configurations of $16N \times 2L_{\text{max}}$ and $8N \times 4L_{\text{max}}$, respectively, in iso-inference-compute settings with $N=32$ DiT baseline. Note that effective depth $\mathcal{D}$ is same for both `\ours`{=latex} and DiT baseline (32).

`\flushleft `{=latex}**`\ours`{=latex} has high Throughput**: `\ours`{=latex} utilizes a compact set of shared parameters and maintain its major weight footprint closer to the accelerator computation unit. This avoids the cost of repeated memory transfers typically required in standard models. We evaluate the efficiency of our proposed architecture by measuring the throughput ratio relative to the baseline across various model scales with an inference batch size of 8, on 1 TPU slice. We choose the best performing `\ours`{=latex} inference configuration from each model scale (refer `\Cref{fig:fid_vs_params_imagenet}`{=latex}) and report their throughput gains in `\Cref{tab:thru}`{=latex}. Our method delivers significant performance gains across model scales L, XL and H. Note that these gains are contingent on the model size: as long as the shared parameters remain within memory capacity, the architecture eliminates redundant memory movement across iterations. For model scale B, the baseline (MaskGIT [@chang2022maskgit]) is small enough to fit entirely within device memory capacity, eliminating the transfer penalty. Generally speaking, ELT offers extreme parameter efficiency, which would reduce significant memory transfers even for bigger models which need to be sharded along multiple devices.

Video Generation
----------------

```{=latex}
\resizebox{\textwidth}{!}
```
  Method                                                                              Class                FVD $\downarrow$                          IS$\uparrow$                 \# params   \# steps   \# GFlops
  ------------------------------------------------------------------------------- -------------- ------------------------------------- ----------------------------------------- ----------- ---------- ------------
  RaMViD$^{\Delta*}$ [@hoppe2022diffusion]                                                                        \-                    21.71 $\pm$ [0.21]{style="color: gray"}     308M        500          \-
  StyleGAN-V$^{\Delta*}$[@skorokhodov2022stylegan]                                                                \-                    23.94 $\pm$ [0.73]{style="color: gray"}      \-          1           \-
  DIGAN$^{\Delta}$ [@yu2022generating]                                                             577$\pm$[21]{style="color: gray"}    32.70$\pm$ [0.35]{style="color: gray"}       \-          1       $\sim$148
  DVD-GAN$^{\Delta}$ [@clark2019adversarial]                                       $\checkmark$                   \-                    32.97$\pm$ [1.70]{style="color: gray"}       \-          1           \-
  Video Diffusion$^{\Delta*}$ [@ho2022video]                                                                                            57.00$\pm$ [0.62]{style="color: gray"}      1.1B        256          \-
  TATS$^{\Delta}$ [@ge2022longvideogenerationtimeagnostic]                                        420$\pm$ [18]{style="color: gray"}    57.63$\pm$ [0.24]{style="color: gray"}      321M        1024         \-
  CCVS+StyleGAN$^{\Delta}$ [@le2021ccvs]                                                          386$\pm$ [15]{style="color: gray"}    24.47$\pm$ [0.13]{style="color: gray"}       \-          \-          \-
  Make-A-Video$^{\Delta*}$ [@singer2022make]                                       $\checkmark$                   367                                    33.00                       \-          \-          \-
  TATS$^{\Delta}$ [@ge2022longvideogenerationtimeagnostic]                         $\checkmark$   332$\pm$ [18]{style="color: gray"}    79.28$\pm$ [0.38]{style="color: gray"}      321M        1024         \-
  [CogVideo]{style="color: gray"}$^{\Delta*}$ [@hong2022cogvideo]                  $\checkmark$                   626                                    50.46                      9.4B         \-          \-
  [Make-A-Video]{style="color: gray"}$^{\Delta*}$ [@singer2022make]                $\checkmark$                   81                                     82.55                    $\gg$3.5B   $\gg$250       \-
  PAR-4$\times^{\Delta}$ [@wang2024parallelizedautoregressivevisualgeneration]     $\checkmark$                  99.5                                     \-                        792M        323          \-
  PAR-16$\times^{\Delta}$ [@wang2024parallelizedautoregressivevisualgeneration]    $\checkmark$                  103.4                                    \-                        792M         95          \-
  MaGNeTS$^{\Delta}$ [@goyal2025maskedgenerativenestedtransformers]                $\checkmark$    96.4$\pm$[2]{style="color: gray"}     88.53$\pm$[0.20]{style="color: gray"}      306M         12      $\sim$1.7k
  MAGVIT-L$^{\Delta}$ [@yu2023magvit]                                              $\checkmark$    76$\pm$ [2]{style="color: gray"}     89.27$\pm$ [0.15]{style="color: gray"}      306M         12      $\sim$4.3k
  **`\ours`{=latex}** ($\bm{6N \times 4L}$)                                        $\checkmark$   72.8$\pm$[2.5]{style="color: gray"}    88.27$\pm$[0.33]{style="color: gray"}     **76M**       12      $\sim$4.3k
  **`\ours`{=latex}** ($\bm{6N \times 6L}$)                                        $\checkmark$   60.8$\pm$[2.7]{style="color: gray"}    87.88$\pm$[0.39]{style="color: gray"}     **76M**       24      $\sim$13k

}

```{=latex}
\captionof{table}{\textbf{Video Generation on UCF-101.} Methods in \textcolor{gray}{gray} are pretrained on additional large video data. $\checkmark$ denotes class-conditional. $^{*}$ indicates custom resolutions. $^\Delta$ denotes values from prior publications. No guidance is used.}
```
`\label{tab:ucf101_quant}`{=latex}

```{=latex}
\hfill
```
![image](figures/ucf101_sjo_final.png){width="100%"} `\captionof{figure}{\textbf{Impact of \distill\ on class-conditional UCF-101 generation.} \distill\ significantly improves performance for all values of $L$ in inference especially when $L \neq L_{max}$.}`{=latex} `\label{fig:ucf101_sjo}`{=latex}

We use the MAGVIT [@yu2023magvit] framework to train parallel decoding based video generation models. We summarize the results for class-conditional video generation on UCF-101 in `\Cref{tab:ucf101_quant}`{=latex}. Our compact 76M `\ours`{=latex} model outperforms MAGVIT baseline (FVD 72.8 vs 76) on data-constrained settings of UCF-101 ($\sim$13.7M training tokens) in iso-inference-compute settings. Scaling the compute further with number of loops and sampling steps gives a boost in performance, reaching FVD of 60.8. This suggests that looped transformers can exhibit robustness against overfitting in data-constrained regimes like UCF-101, effectively regularizing the learning process while maintaining the expressive capacity for high-quality generation.

```{=latex}
\captionsetup{justification=centering}
```
![Masked Generative Models](figures/sjo_effect_maskgit.png){#fig:ilsd_effect_maskgit height="45mm"}

`\hfill `{=latex}

```{=latex}
\captionsetup{justification=centering}
```
![Diffusion Transformers](figures/dit_ilsd_final.png){#fig:ilsd_dit height="45mm"}

`\distillfull`{=latex} drives Elasticity
----------------------------------------

We analyze the impact of `\distill`{=latex} for image generation across masked generative models (refer to `\Cref{fig:ilsd_effect_maskgit}`{=latex}) and diffusion models (refer to `\Cref{fig:ilsd_dit}`{=latex}). Models trained without `\distill`{=latex} exhibit significant divergence from their fixed training depth ($L_{max}$). In contrast, `\ours`{=latex} maintains stable, high-quality generation across the entire inference loop spectrum (see `\Cref{fig:gen_dit_ilsd}`{=latex}). We further analyse class-conditional video generation on UCF-101 in `\Cref{fig:ucf101_sjo}`{=latex}. Interestingly, `\distill`{=latex} enables the model to maintain reasonable quality even at unseen depths ($L > L_{\text{max}}$). On UCF-101, the model achieves a peak FVD of *69.20* at $L=6$ despite being trained with $L_{\text{max}}=4$, suggesting that `\distill`{=latex} regularizes the iterative process sufficiently for modest extrapolation beyond training depth. We note that this extrapolation behavior warrants further investigation across datasets and scales.

Conclusion
==========

We proposed a novel parameter-efficient approach to visual generation using recurrent transformers called `\oursfull`{=latex} (`\ours`{=latex}). Our approach achieves a strong empirical performance, similar to baselines, with $4\times$ fewer parameters in iso-inference-compute setting in both image and video generation tasks. Beyond significantly improved performance per parameter, we identified fundamental scaling properties of looped transformers: while increasing model width remains a primary driver of quality, recursive looping provides a unique \`\`test-time" compute lever. Through our proposed `\distillfull`{=latex} strategy, we train a single model that is performant across a variable number of iterations. This strategy effectively yields a continuous family of models from a single training run, enabling Any-Time inference where practitioners can traverse the pareto front to balance image quality and GFLOPs dynamically.

![`\small `{=latex}**Class-conditional Image Generation on ImageNet $\bm{256 \times 256}$**. Comparing our proposed `\oursfull`{=latex} (left) against vanilla looped transformers (right), which simply repeat transformer layers during a forward pass of diffusion. As shown, standard looping only produces a coherent image when the inference loop count exactly matches its training setup ($L=8$), with quality severely degrading at all other values. In contrast, our proposed `\ours`{=latex} enhances vanilla looping with Self Distillation, maintaining a high-fidelity generation across several evaluated compute budgets.](figures/supplementary/supp_dit_qual_part1_compressed.png){#fig:dit_qual_supp1 width="\\textwidth"}

![`\small `{=latex}**Class-conditional Image Generation on ImageNet $\bm{256 \times 256}$**. Comparing our proposed `\oursfull`{=latex} (left) against vanilla looped transformers (right), which simply repeat transformer layers during a forward pass of diffusion. As shown, standard looping only produces a coherent image when the inference loop count exactly matches its training setup ($L=8$), with quality severely degrading at all other values. In contrast, our proposed `\ours`{=latex} enhances vanilla looping with Self Distillation, maintaining a high-fidelity generation across several evaluated compute budgets.](figures/supplementary/supp_dit_qual_part2_compressed.png){#fig:dit_qual_supp2 width="\\textwidth"}

Looking forward, we note that ELTs can potentially unlock more efficient inference for diffusion models. While existing approaches rely on the same network (and hence allocate same compute) across denoising steps, ELT can dynamically allocate compute across denoising steps, spending more compute where it matters the most. Additionally, in the context of recent one-step generative modeling paradigms such as consistency models [@song2023consistencymodels] and drifting models [@deng2026generative], ELT can enable true elasticity: since there is only one sampling step, one can dynamically control the quality of the model at inference by varying the number of loops, without having to pre-determine the number of sampling steps as is the case with traditional multi-step diffusion models. We believe this paradigm of flexible, weight-efficient scaling offers a promising direction for deploying high-fidelity generative models on resource-constrained hardware.

![`\small `{=latex}**Class-conditional Image Generation on ImageNet $\bm{256 \times 256}$**. Qualitative results for `\ours`{=latex} in diffusion framework, $16N \times 2L$ inference configuration & trained with $L_{\text{max}}=2$. ](figures/supplementary/qualitative_dit_final_images_compressed.png){#fig:dit_final_gen_qual_supp width="90%"}

![`\small `{=latex}**Class-conditional Image Generation on ImageNet $\bm{256 \times 256}$**. Qualitative results for `\ours`{=latex} in masked generative framework. Model scale: `\ours`{=latex}-G, $8N \times 3L$ inference config & trained with $L_{\text{max}}=4$ (FID=1.9). ](figures/collage_compressed.png){#fig:qual_elt_maskgit width="90%"}

```{=latex}
\clearpage
```
```{=latex}
\begin{thebibliography}{85}
\providecommand{\natexlab}[1]{#1}
\providecommand{\url}[1]{\texttt{#1}}
\expandafter\ifx\csname urlstyle\endcsname\relax
  \providecommand{\doi}[1]{doi: #1}\else
  \providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi

\bibitem[Anil et~al.(2022)Anil, Pokle, Liang, Treutlein, Wu, Bai, Kolter, and Grosse]{anil2022pathindependentequilibriummodels}
C.~Anil, A.~Pokle, K.~Liang, J.~Treutlein, Y.~Wu, S.~Bai, Z.~Kolter, and R.~Grosse.
\newblock Path independent equilibrium models can better exploit test-time computation, 2022.
\newblock URL \url{https://arxiv.org/abs/2211.09961}.

\bibitem[Bachmann et~al.(2025)Bachmann, Allardice, Mizrahi, Fini, Kar, Amirloo, El-Nouby, Zamir, and Dehghan]{bachmann2025flextokresamplingimages1d}
R.~Bachmann, J.~Allardice, D.~Mizrahi, E.~Fini, O.~F. Kar, E.~Amirloo, A.~El-Nouby, A.~Zamir, and A.~Dehghan.
\newblock Flextok: Resampling images into 1d token sequences of flexible length, 2025.
\newblock URL \url{https://arxiv.org/abs/2502.13967}.

\bibitem[Bae et~al.(2025)Bae, Kim, Bayat, Kim, Ha, Schuster, Fisch, Harutyunyan, Ji, Courville, and Yun]{bae2025mixtureofrecursionslearningdynamicrecursive}
S.~Bae, Y.~Kim, R.~Bayat, S.~Kim, J.~Ha, T.~Schuster, A.~Fisch, H.~Harutyunyan, Z.~Ji, A.~Courville, and S.-Y. Yun.
\newblock Mixture-of-recursions: Learning dynamic recursive depths for adaptive token-level computation, 2025.
\newblock URL \url{https://arxiv.org/abs/2507.10524}.

\bibitem[Bai et~al.(2019)Bai, Kolter, and Koltun]{bai2019deepequilibriummodels}
S.~Bai, J.~Z. Kolter, and V.~Koltun.
\newblock Deep equilibrium models, 2019.
\newblock URL \url{https://arxiv.org/abs/1909.01377}.

\bibitem[Brock(2018)]{brock2018large}
A.~Brock.
\newblock Large scale gan training for high fidelity natural image synthesis.
\newblock \emph{arXiv preprint arXiv:1809.11096}, 2018.

\bibitem[Castells et~al.(2024)Castells, Song, Piao, Choi, Kim, Yim, Lee, Kim, and Kim]{castells2024edgefusionondevicetexttoimagegeneration}
T.~Castells, H.-K. Song, T.~Piao, S.~Choi, B.-K. Kim, H.~Yim, C.~Lee, J.~G. Kim, and T.-H. Kim.
\newblock Edgefusion: On-device text-to-image generation, 2024.
\newblock URL \url{https://arxiv.org/abs/2404.11925}.

\bibitem[Chang et~al.(2022)Chang, Zhang, Jiang, Liu, and Freeman]{chang2022maskgit}
H.~Chang, H.~Zhang, L.~Jiang, C.~Liu, and W.~T. Freeman.
\newblock Maskgit: Masked generative image transformer.
\newblock In \emph{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages 11315--11325, 2022.

\bibitem[Clark et~al.(2019)Clark, Donahue, and Simonyan]{clark2019adversarial}
A.~Clark, J.~Donahue, and K.~Simonyan.
\newblock Adversarial video generation on complex datasets.
\newblock \emph{arXiv preprint arXiv:1907.06571}, 2019.

\bibitem[Dehghani et~al.(2018)Dehghani, Gouws, Vinyals, Uszkoreit, and Kaiser]{dehghani2018universal}
M.~Dehghani, S.~Gouws, O.~Vinyals, J.~Uszkoreit, and {\L}.~Kaiser.
\newblock Universal transformers.
\newblock \emph{arXiv preprint arXiv:1807.03819}, 2018.

\bibitem[Deng et~al.(2009)Deng, Dong, Socher, Li, Li, and Fei-Fei]{deng2009imagenet}
J.~Deng, W.~Dong, R.~Socher, L.-J. Li, K.~Li, and L.~Fei-Fei.
\newblock Imagenet: A large-scale hierarchical image database.
\newblock In \emph{2009 IEEE conference on computer vision and pattern recognition}, pages 248--255. Ieee, 2009.

\bibitem[Deng et~al.(2026)Deng, Li, Li, Du, and He]{deng2026generative}
M.~Deng, H.~Li, T.~Li, Y.~Du, and K.~He.
\newblock Generative modeling via drifting.
\newblock \emph{arXiv preprint arXiv:2602.04770}, 2026.

\bibitem[Devlin et~al.(2019)Devlin, Chang, Lee, and Toutanova]{devlin2019bertpretrainingdeepbidirectional}
J.~Devlin, M.-W. Chang, K.~Lee, and K.~Toutanova.
\newblock Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
\newblock URL \url{https://arxiv.org/abs/1810.04805}.

\bibitem[Dhariwal and Nichol(2021)]{dhariwal2021diffusion}
P.~Dhariwal and A.~Nichol.
\newblock Diffusion models beat gans on image synthesis.
\newblock \emph{Advances in neural information processing systems}, 34:\penalty0 8780--8794, 2021.

\bibitem[Duggal et~al.(2024)Duggal, Isola, Torralba, and Freeman]{duggal2024adaptivelengthimagetokenization}
S.~Duggal, P.~Isola, A.~Torralba, and W.~T. Freeman.
\newblock Adaptive length image tokenization via recurrent allocation, 2024.
\newblock URL \url{https://arxiv.org/abs/2411.02393}.

\bibitem[Esser et~al.(2021)Esser, Rombach, and Ommer]{esser2021taming}
P.~Esser, R.~Rombach, and B.~Ommer.
\newblock Taming transformers for high-resolution image synthesis.
\newblock In \emph{CVPR}, pages 12873--12883, 2021.

\bibitem[Fan et~al.(2024)Fan, Du, Ramchandran, and Lee]{fan2024looped}
Y.~Fan, Y.~Du, K.~Ramchandran, and K.~Lee.
\newblock Looped transformers for length generalization.
\newblock \emph{arXiv preprint arXiv:2409.15647}, 2024.

\bibitem[Gabor et~al.(2024)Gabor, Piotrowski, and Cavalcante]{gabor2024positiveconcavedeepequilibrium}
M.~Gabor, T.~Piotrowski, and R.~L.~G. Cavalcante.
\newblock Positive concave deep equilibrium models, 2024.
\newblock URL \url{https://arxiv.org/abs/2402.04029}.

\bibitem[Gao et~al.(2023)Gao, Zhou, Cheng, and Yan]{gao2023masked}
S.~Gao, P.~Zhou, M.-M. Cheng, and S.~Yan.
\newblock Masked diffusion transformer is a strong image synthesizer.
\newblock In \emph{Proceedings of the IEEE/CVF International Conference on Computer Vision}, pages 23164--23173, 2023.

\bibitem[Gatmiry et~al.(2024)Gatmiry, Saunshi, Reddi, Jegelka, and Kumar]{gatmiry2024loopedtransformerslearnimplement}
K.~Gatmiry, N.~Saunshi, S.~J. Reddi, S.~Jegelka, and S.~Kumar.
\newblock Can looped transformers learn to implement multi-step gradient descent for in-context learning?, 2024.
\newblock URL \url{https://arxiv.org/abs/2410.08292}.

\bibitem[Ge et~al.(2022)Ge, Hayes, Yang, Yin, Pang, Jacobs, Huang, and Parikh]{ge2022longvideogenerationtimeagnostic}
S.~Ge, T.~Hayes, H.~Yang, X.~Yin, G.~Pang, D.~Jacobs, J.-B. Huang, and D.~Parikh.
\newblock Long video generation with time-agnostic vqgan and time-sensitive transformer, 2022.
\newblock URL \url{https://arxiv.org/abs/2204.03638}.

\bibitem[Geiping et~al.(2025)Geiping, McLeish, Jain, Kirchenbauer, Singh, Bartoldson, Kailkhura, Bhatele, and Goldstein]{geiping2025scalingtesttimecomputelatent}
J.~Geiping, S.~McLeish, N.~Jain, J.~Kirchenbauer, S.~Singh, B.~R. Bartoldson, B.~Kailkhura, A.~Bhatele, and T.~Goldstein.
\newblock Scaling up test-time compute with latent reasoning: A recurrent depth approach, 2025.
\newblock URL \url{https://arxiv.org/abs/2502.05171}.

\bibitem[Geng et~al.(2023)Geng, Pokle, and Kolter]{geng2023onestepdiffusiondistillationdeep}
Z.~Geng, A.~Pokle, and J.~Z. Kolter.
\newblock One-step diffusion distillation via deep equilibrium models, 2023.
\newblock URL \url{https://arxiv.org/abs/2401.08639}.

\bibitem[{Google Cloud}(2024)]{googleclouduv6e}
{Google Cloud}.
\newblock {TPU v6e (Trillium) Documentation}.
\newblock \url{https://cloud.google.com/tpu/docs/v6e}, 2024.
\newblock Accessed: 2024-05-22.

\bibitem[Goyal et~al.(2025)Goyal, Tula, Jain, Shenoy, Jain, and Paul]{goyal2025maskedgenerativenestedtransformers}
S.~Goyal, D.~Tula, G.~Jain, P.~Shenoy, P.~Jain, and S.~Paul.
\newblock Masked generative nested transformers with decode time scaling, 2025.
\newblock URL \url{https://arxiv.org/abs/2502.00382}.

\bibitem[Hang et~al.(2024)Hang, Gu, Li, Bao, Chen, Hu, Geng, and Guo]{hang2024efficientdiffusiontrainingminsnr}
T.~Hang, S.~Gu, C.~Li, J.~Bao, D.~Chen, H.~Hu, X.~Geng, and B.~Guo.
\newblock Efficient diffusion training via min-snr weighting strategy, 2024.
\newblock URL \url{https://arxiv.org/abs/2303.09556}.

\bibitem[He et~al.(2025)He, Liang, Wang, Wan, Zhang, Gai, and Pan]{he2025scalingimagevideogeneration}
H.~He, J.~Liang, X.~Wang, P.~Wan, D.~Zhang, K.~Gai, and L.~Pan.
\newblock Scaling image and video generation via test-time evolutionary search, 2025.
\newblock URL \url{https://arxiv.org/abs/2505.17618}.

\bibitem[Heusel et~al.(2017)Heusel, Ramsauer, Unterthiner, Nessler, and Hochreiter]{heusel2017gans}
M.~Heusel, H.~Ramsauer, T.~Unterthiner, B.~Nessler, and S.~Hochreiter.
\newblock Gans trained by a two time-scale update rule converge to a local nash equilibrium.
\newblock \emph{Advances in neural information processing systems}, 30, 2017.

\bibitem[Ho and Salimans(2022)]{ho2022classifierfreediffusionguidance}
J.~Ho and T.~Salimans.
\newblock Classifier-free diffusion guidance, 2022.
\newblock URL \url{https://arxiv.org/abs/2207.12598}.

\bibitem[Ho et~al.(2020)Ho, Jain, and Abbeel]{ho2020denoising}
J.~Ho, A.~P. Jain, and P.~Abbeel.
\newblock Denoising diffusion probabilistic models.
\newblock \emph{Advances in neural information processing systems}, 33:\penalty0 6840--6851, 2020.

\bibitem[Ho et~al.(2022{\natexlab{a}})Ho, Saharia, Chan, Fleet, Norouzi, and Salimans]{ho2022cascaded}
J.~Ho, C.~Saharia, W.~Chan, D.~J. Fleet, M.~Norouzi, and T.~Salimans.
\newblock Cascaded diffusion models for high fidelity image generation.
\newblock \emph{Journal of Machine Learning Research}, 23\penalty0 (47):\penalty0 1--33, 2022{\natexlab{a}}.

\bibitem[Ho et~al.(2022{\natexlab{b}})Ho, Salimans, Gritsenko, Chan, Norouzi, and Fleet]{ho2022video}
J.~Ho, T.~Salimans, A.~Gritsenko, W.~Chan, M.~Norouzi, and D.~J. Fleet.
\newblock Video diffusion models.
\newblock \emph{Advances in Neural Information Processing Systems}, 35:\penalty0 8633--8646, 2022{\natexlab{b}}.

\bibitem[Hong et~al.(2022)Hong, Ding, Zheng, Liu, and Tang]{hong2022cogvideo}
W.~Hong, M.~Ding, W.~Zheng, X.~Liu, and J.~Tang.
\newblock Cogvideo: Large-scale pretraining for text-to-video generation via transformers.
\newblock \emph{arXiv preprint arXiv:2205.15868}, 2022.

\bibitem[Hoogeboom et~al.(2023)Hoogeboom, Heek, and Salimans]{hoogeboom2023simple}
E.~Hoogeboom, J.~Heek, and T.~Salimans.
\newblock simple diffusion: End-to-end diffusion for high resolution images.
\newblock In \emph{International Conference on Machine Learning}, pages 13213--13232. PMLR, 2023.

\bibitem[Hoogeboom et~al.(2025)Hoogeboom, Mensink, Heek, Lamerigts, Gao, and Salimans]{hoogeboom2025simpler}
E.~Hoogeboom, T.~Mensink, J.~Heek, K.~Lamerigts, R.~Gao, and T.~Salimans.
\newblock Simpler diffusion: 1.5 fid on imagenet512 with pixel-space diffusion.
\newblock In \emph{Proceedings of the Computer Vision and Pattern Recognition Conference}, pages 18062--18071, 2025.

\bibitem[H{\"o}ppe et~al.(2022)H{\"o}ppe, Mehrjou, Bauer, Nielsen, and Dittadi]{hoppe2022diffusion}
T.~H{\"o}ppe, A.~Mehrjou, S.~Bauer, D.~Nielsen, and A.~Dittadi.
\newblock Diffusion models for video prediction and infilling.
\newblock \emph{arXiv preprint arXiv:2206.07696}, 2022.

\bibitem[Jabri et~al.(2022)Jabri, Fleet, and Chen]{jabri2022scalable}
A.~Jabri, D.~Fleet, and T.~Chen.
\newblock Scalable adaptive computation for iterative generation.
\newblock \emph{arXiv preprint arXiv:2212.11972}, 2022.

\bibitem[Kar et~al.(2019)Kar, Kubilius, Schmidt, Issa, and DiCarlo]{kar2019evidence}
K.~Kar, J.~Kubilius, K.~Schmidt, E.~B. Issa, and J.~J. DiCarlo.
\newblock Evidence that recurrent circuits are critical to the ventral stream's execution of core object recognition behavior.
\newblock \emph{Nature neuroscience}, 22\penalty0 (6):\penalty0 974--983, 2019.

\bibitem[Kietzmann et~al.(2019)Kietzmann, Spoerer, S{\"o}rensen, Cichy, Hauk, and Kriegeskorte]{kietzmann2019recurrence}
T.~C. Kietzmann, C.~J. Spoerer, L.~K. S{\"o}rensen, R.~M. Cichy, O.~Hauk, and N.~Kriegeskorte.
\newblock Recurrence is required to capture the representational dynamics of the human visual system.
\newblock \emph{Proceedings of the National Academy of Sciences}, 116\penalty0 (43):\penalty0 21854--21863, 2019.

\bibitem[Kingma and Gao(2023)]{kingma2023understanding}
D.~P. Kingma and R.~Gao.
\newblock Understanding the diffusion objective as a weighted integral of elbos.
\newblock \emph{arXiv preprint arXiv:2303.00848}, 2, 2023.

\bibitem[Kondratyuk et~al.(2024)Kondratyuk, Yu, Gu, Lezama, Huang, Hornung, Adam, Akbari, Alon, Birodkar, et~al.]{kondratyuk2023videopoet}
D.~Kondratyuk, L.~Yu, X.~Gu, J.~Lezama, J.~Huang, R.~Hornung, H.~Adam, H.~Akbari, Y.~Alon, V.~Birodkar, et~al.
\newblock Videopoet: A large language model for zero-shot video generation.
\newblock \emph{ICML}, 2024.

\bibitem[Devvrit et~al.(2023)Devvrit, Kudugunta, Kusupati, Dettmers, Chen, Dhillon, Tsvetkov, Hajishirzi, Kakade, Farhadi, Jain, et~al.]{kudugunta2023matformer}
Devvrit, S.~Kudugunta, A.~Kusupati, T.~Dettmers, K.~Chen, I.~Dhillon, Y.~Tsvetkov, H.~Hajishirzi, S.~Kakade, A.~Farhadi, P.~Jain, et~al.
\newblock Matformer: Nested transformer for elastic inference.
\newblock \emph{Advances in Neural Information Processing Systems}, 2024.

\bibitem[Kusupati et~al.(2022)Kusupati, Bhatt, Rege, Wallingford, Sinha, Ramanujan, Howard-Snyder, Chen, Kakade, Jain, et~al.]{kusupati2022matryoshka}
A.~Kusupati, G.~Bhatt, A.~Rege, M.~Wallingford, A.~Sinha, V.~Ramanujan, W.~Howard-Snyder, K.~Chen, S.~Kakade, P.~Jain, et~al.
\newblock Matryoshka representation learning.
\newblock \emph{Advances in Neural Information Processing Systems}, 35:\penalty0 30233--30249, 2022.

\bibitem[Le~Moing et~al.(2021)Le~Moing, Ponce, and Schmid]{le2021ccvs}
G.~Le~Moing, J.~Ponce, and C.~Schmid.
\newblock Ccvs: Context-aware controllable video synthesis.
\newblock \emph{Advances in Neural Information Processing Systems}, 34:\penalty0 14042--14055, 2021.

\bibitem[Li et~al.(2024)Li, Tian, Li, Deng, and He]{li2024autoregressiveimagegenerationvector}
T.~Li, Y.~Tian, H.~Li, M.~Deng, and K.~He.
\newblock Autoregressive image generation without vector quantization, 2024.
\newblock URL \url{https://arxiv.org/abs/2406.11838}.

\bibitem[Li(2025)]{li2025morvitefficientvisiontransformer}
Y.~Li.
\newblock Mor-vit: Efficient vision transformer with mixture-of-recursions, 2025.
\newblock URL \url{https://arxiv.org/abs/2507.21761}.

\bibitem[Loshchilov and Hutter(2019)]{loshchilov2019decoupledweightdecayregularization}
I.~Loshchilov and F.~Hutter.
\newblock Decoupled weight decay regularization, 2019.
\newblock URL \url{https://arxiv.org/abs/1711.05101}.

\bibitem[McCallum et~al.(2025)McCallum, Arora, and Foster]{mccallum2025reversibledeepequilibriummodels}
S.~McCallum, K.~Arora, and J.~Foster.
\newblock Reversible deep equilibrium models, 2025.
\newblock URL \url{https://arxiv.org/abs/2509.12917}.

\bibitem[Menghani(2023)]{menghani2023efficient}
G.~Menghani.
\newblock Efficient deep learning: A survey on making deep learning models smaller, faster, and better.
\newblock \emph{ACM Computing Surveys}, 55\penalty0 (12):\penalty0 1--37, 2023.

\bibitem[Miwa et~al.(2025)Miwa, Sasaki, Arai, Takahashi, and Yamaguchi]{miwa2025onedpieceimagetokenizermeets}
K.~Miwa, K.~Sasaki, H.~Arai, T.~Takahashi, and Y.~Yamaguchi.
\newblock One-d-piece: Image tokenizer meets quality-controllable compression, 2025.
\newblock URL \url{https://arxiv.org/abs/2501.10064}.

\bibitem[Ni et~al.(2024{\natexlab{a}})Ni, Wang, Zhou, Guo, Hu, Liu, Song, Yao, and Huang]{ni2024revisitingnonautoregressivetransformersefficient}
Z.~Ni, Y.~Wang, R.~Zhou, J.~Guo, J.~Hu, Z.~Liu, S.~Song, Y.~Yao, and G.~Huang.
\newblock Revisiting non-autoregressive transformers for efficient image synthesis, 2024{\natexlab{a}}.
\newblock URL \url{https://arxiv.org/abs/2406.05478}.

\bibitem[Ni et~al.(2024{\natexlab{b}})Ni, Wang, Zhou, Han, Guo, Liu, Yao, and Huang]{ni2024enatrethinkingspatialtemporalinteractions}
Z.~Ni, Y.~Wang, R.~Zhou, Y.~Han, J.~Guo, Z.~Liu, Y.~Yao, and G.~Huang.
\newblock Enat: Rethinking spatial-temporal interactions in token-based image synthesis, 2024{\natexlab{b}}.
\newblock URL \url{https://arxiv.org/abs/2411.06959}.

\bibitem[Nichol and Dhariwal(2021)]{nichol2021improved}
A.~Q. Nichol and P.~Dhariwal.
\newblock Improved denoising diffusion probabilistic models.
\newblock In \emph{International conference on machine learning}, pages 8162--8171. PMLR, 2021.

\bibitem[Peebles and Xie(2023)]{peebles2023scalablediffusionmodelstransformers}
W.~Peebles and S.~Xie.
\newblock Scalable diffusion models with transformers, 2023.
\newblock URL \url{https://arxiv.org/abs/2212.09748}.

\bibitem[Pokle et~al.(2022)Pokle, Geng, and Kolter]{pokle2022deepequilibriumapproachesdiffusion}
A.~Pokle, Z.~Geng, and Z.~Kolter.
\newblock Deep equilibrium approaches to diffusion models, 2022.
\newblock URL \url{https://arxiv.org/abs/2210.12867}.

\bibitem[Razavi et~al.(2019)Razavi, Van~den Oord, and Vinyals]{razavi2019generating}
A.~Razavi, A.~Van~den Oord, and O.~Vinyals.
\newblock Generating diverse high-fidelity images with vq-vae-2.
\newblock \emph{Advances in neural information processing systems}, 32, 2019.

\bibitem[Rombach et~al.(2022{\natexlab{a}})Rombach, Blattmann, Lorenz, Esser, and Ommer]{ldm}
R.~Rombach, A.~Blattmann, D.~Lorenz, P.~Esser, and B.~Ommer.
\newblock High-resolution image synthesis with latent diffusion models, 2022{\natexlab{a}}.
\newblock URL \url{https://arxiv.org/abs/2112.10752}.

% \bibitem[Rombach et~al.(2022{\natexlab{b}})Rombach, Blattmann, Lorenz, Esser, and Ommer]{rombach2022high}
% R.~Rombach, A.~Blattmann, D.~Lorenz, P.~Esser, and B.~Ommer.
% \newblock High-resolution image synthesis with latent diffusion models.
% \newblock In \emph{Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages 10684--10695, 2022{\natexlab{b}}.

\bibitem[Ronneberger et~al.(2015)Ronneberger, Fischer, and Brox]{ronneberger2015unetconvolutionalnetworksbiomedical}
O.~Ronneberger, P.~Fischer, and T.~Brox.
\newblock U-net: Convolutional networks for biomedical image segmentation, 2015.
\newblock URL \url{https://arxiv.org/abs/1505.04597}.

\bibitem[Salimans and Ho(2022)]{salimans2022progressivedistillationfastsampling}
T.~Salimans and J.~Ho.
\newblock Progressive distillation for fast sampling of diffusion models, 2022.
\newblock URL \url{https://arxiv.org/abs/2202.00512}.

\bibitem[Salimans et~al.(2016)Salimans, Goodfellow, Zaremba, Cheung, Radford, and Chen]{salimans2016improvedtechniquestraininggans}
T.~Salimans, I.~Goodfellow, W.~Zaremba, V.~Cheung, A.~Radford, and X.~Chen.
\newblock Improved techniques for training gans, 2016.
\newblock URL \url{https://arxiv.org/abs/1606.03498}.

\bibitem[Sauer et~al.(2022)Sauer, Schwarz, and Geiger]{sauer2022styleganxlscalingstyleganlarge}
A.~Sauer, K.~Schwarz, and A.~Geiger.
\newblock Stylegan-xl: Scaling stylegan to large diverse datasets, 2022.
\newblock URL \url{https://arxiv.org/abs/2202.00273}.

\bibitem[Saunshi et~al.(2025)Saunshi, Dikkala, Li, Kumar, and Reddi]{saunshi2025reasoninglatentthoughtspower}
N.~Saunshi, N.~Dikkala, Z.~Li, S.~Kumar, and S.~J. Reddi.
\newblock Reasoning with latent thoughts: On the power of looped transformers, 2025.
\newblock URL \url{https://arxiv.org/abs/2502.17416}.

\bibitem[Shen et~al.(2025)Shen, Tirumala, Yasunaga, Misra, Zettlemoyer, Yu, and Zhou]{shen2025catcontentadaptiveimagetokenization}
J.~Shen, K.~Tirumala, M.~Yasunaga, I.~Misra, L.~Zettlemoyer, L.~Yu, and C.~Zhou.
\newblock Cat: Content-adaptive image tokenization, 2025.
\newblock URL \url{https://arxiv.org/abs/2501.03120}.

\bibitem[Shu et~al.(2026)Shu, Qiu, Zhu, Chen, Liu, and Yang]{shu2026loopvitscalingvisualarc}
W.-J. Shu, X.~Qiu, R.-J. Zhu, H.~H. Chen, Y.~Liu, and H.~Yang.
\newblock Loopvit: Scaling visual arc with looped transformers, 2026.
\newblock URL \url{https://arxiv.org/abs/2602.02156}.

\bibitem[Singer et~al.(2022)Singer, Polyak, Hayes, Yin, An, Zhang, Hu, Yang, Ashual, Gafni, et~al.]{singer2022make}
U.~Singer, A.~Polyak, T.~Hayes, X.~Yin, J.~An, S.~Zhang, Q.~Hu, H.~Yang, O.~Ashual, O.~Gafni, et~al.
\newblock Make-a-video: Text-to-video generation without text-video data.
\newblock \emph{arXiv preprint arXiv:2209.14792}, 2022.

\bibitem[Skorokhodov et~al.(2022)Skorokhodov, Tulyakov, and Elhoseiny]{skorokhodov2022stylegan}
I.~Skorokhodov, S.~Tulyakov, and M.~Elhoseiny.
\newblock Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2.
\newblock In \emph{Proceedings of the IEEE/CVF conference on computer vision and pattern recognition}, pages 3626--3636, 2022.

\bibitem[Song et~al.(2020)Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole]{song2020score}
Y.~Song, J.~Sohl-Dickstein, D.~P. Kingma, A.~Kumar, S.~Ermon, and B.~Poole.
\newblock Score-based generative modeling through stochastic differential equations.
\newblock \emph{arXiv preprint arXiv:2011.13456}, 2020.

\bibitem[Song et~al.(2023)Song, Dhariwal, Chen, and Sutskever]{song2023consistencymodels}
Y.~Song, P.~Dhariwal, M.~Chen, and I.~Sutskever.
\newblock Consistency models, 2023.
\newblock URL \url{https://arxiv.org/abs/2303.01469}.

\bibitem[Soomro et~al.(2012)Soomro, Zamir, and Shah]{soomro2012ucf101dataset101human}
K.~Soomro, A.~R. Zamir, and M.~Shah.
\newblock Ucf101: A dataset of 101 human actions classes from videos in the wild, 2012.
\newblock URL \url{https://arxiv.org/abs/1212.0402}.

\bibitem[Unterthiner et~al.(2018)Unterthiner, Van~Steenkiste, Kurach, Marinier, Michalski, and Gelly]{unterthiner2019fvd}
T.~Unterthiner, S.~Van~Steenkiste, K.~Kurach, R.~Marinier, M.~Michalski, and S.~Gelly.
\newblock Towards accurate generative models of video: A new metric \& challenges.
\newblock \emph{arXiv preprint arXiv:1812.01717}, 2018.

\bibitem[Wang et~al.(2026)Wang, Lai, Chen, Guo, Guo, Li, Yue, and Guo]{wang2026elasticdiffusiontransformer}
J.~Wang, Z.~Lai, J.~Chen, J.~Guo, H.~Guo, X.~Li, X.~Yue, and C.~Guo.
\newblock Elastic diffusion transformer, 2026.
\newblock URL \url{https://arxiv.org/abs/2602.13993}.

\bibitem[Wang et~al.(2024)Wang, Ren, Lin, Han, Guo, Yang, Zou, Feng, and Liu]{wang2024parallelizedautoregressivevisualgeneration}
Y.~Wang, S.~Ren, Z.~Lin, Y.~Han, H.~Guo, Z.~Yang, D.~Zou, J.~Feng, and X.~Liu.
\newblock Parallelized autoregressive visual generation, 2024.
\newblock URL \url{https://arxiv.org/abs/2412.15119}.

\bibitem[Wang et~al.(2023)Wang, Jiang, Zheng, Wang, He, Wang, Chen, and Zhou]{wang2023patchdiffusionfasterdataefficient}
Z.~Wang, Y.~Jiang, H.~Zheng, P.~Wang, P.~He, Z.~Wang, W.~Chen, and M.~Zhou.
\newblock Patch diffusion: Faster and more data-efficient training of diffusion models, 2023.
\newblock URL \url{https://arxiv.org/abs/2304.12526}.

\bibitem[Weber et~al.(2024)Weber, Yu, Yu, Deng, Shen, Cremers, and Chen]{weber2024maskbitembeddingfreeimagegeneration}
M.~Weber, L.~Yu, Q.~Yu, X.~Deng, X.~Shen, D.~Cremers, and L.-C. Chen.
\newblock Maskbit: Embedding-free image generation via bit tokens, 2024.
\newblock URL \url{https://arxiv.org/abs/2409.16211}.

\bibitem[Yan et~al.(2025)Yan, Mnih, Faust, Zaharia, Abbeel, and Liu]{yan2025elastictokadaptivetokenizationimage}
W.~Yan, V.~Mnih, A.~Faust, M.~Zaharia, P.~Abbeel, and H.~Liu.
\newblock Elastictok: Adaptive tokenization for image and video, 2025.
\newblock URL \url{https://arxiv.org/abs/2410.08368}.

\bibitem[Yang et~al.(2023)Yang, Lee, Nowak, and Papailiopoulos]{yang2023looped}
L.~Yang, K.~Lee, R.~Nowak, and D.~Papailiopoulos.
\newblock Looped transformers are better at learning learning algorithms.
\newblock \emph{arXiv preprint arXiv:2311.12424}, 2023.

\bibitem[Yu et~al.(2023{\natexlab{a}})Yu, Cheng, Sohn, Lezama, Zhang, Chang, Hauptmann, Yang, Hao, Essa, et~al.]{yu2023magvit}
L.~Yu, Y.~Cheng, K.~Sohn, J.~Lezama, H.~Zhang, H.~Chang, A.~G. Hauptmann, M.-H. Yang, Y.~K. Hao, I.~Essa, et~al.
\newblock Magvit: Masked generative video transformer.
\newblock In \emph{Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition}, pages 10459--10469, 2023{\natexlab{a}}.

\bibitem[Yu et~al.(2023{\natexlab{b}})Yu, Lezama, Gundavarapu, Versari, Sohn, Minnen, Cheng, Gupta, Gu, Hauptmann, et~al.]{yu2023language}
L.~Yu, J.~Lezama, N.~B. Gundavarapu, L.~Versari, K.~Sohn, D.~Minnen, Y.~Cheng, A.~Gupta, X.~Gu, A.~G. Hauptmann, et~al.
\newblock Language model beats diffusion--tokenizer is key to visual generation.
\newblock \emph{arXiv preprint arXiv:2310.05737}, 2023{\natexlab{b}}.

% \bibitem[Yu et~al.(2024)Yu, He, Deng, Shen, and Chen]{yu2024randomizedautoregressivevisualgeneration}
% Q.~Yu, J.~He, X.~Deng, X.~Shen, and L.-C. Chen.
% \newblock Randomized autoregressive visual generation, 2024.
% \newblock URL \url{https://arxiv.org/abs/2411.00776}.

\bibitem[Yu et~al.(2022)Yu, Tack, Mo, Kim, Kim, Ha, and Shin]{yu2022generating}
S.~Yu, J.~Tack, S.~Mo, H.~Kim, J.~Kim, J.-W. Ha, and J.~Shin.
\newblock Generating videos with dynamics-aware implicit generative adversarial networks.
\newblock \emph{arXiv preprint arXiv:2202.10571}, 2022.

\bibitem[Zhao et~al.(2024{\natexlab{a}})Zhao, Xiong, and Krähenbühl]{zhao2024imagevideotokenizationbinary}
Y.~Zhao, Y.~Xiong, and P.~Krähenbühl.
\newblock Image and video tokenization with binary spherical quantization, 2024{\natexlab{a}}.
\newblock URL \url{https://arxiv.org/abs/2406.07548}.

\bibitem[Zhao et~al.(2024{\natexlab{b}})Zhao, Xu, Xiao, Jia, and Hou]{zhao2024mobilediffusioninstanttexttoimagegeneration}
Y.~Zhao, Y.~Xu, Z.~Xiao, H.~Jia, and T.~Hou.
\newblock Mobilediffusion: Instant text-to-image generation on mobile devices, 2024{\natexlab{b}}.
\newblock URL \url{https://arxiv.org/abs/2311.16567}.

\bibitem[Zheng et~al.(2022)Zheng, Vuong, Cai, and Phung]{zheng2022movqmodulatingquantizedvectors}
C.~Zheng, L.~T. Vuong, J.~Cai, and D.~Phung.
\newblock Movq: Modulating quantized vectors for high-fidelity image generation, 2022.
\newblock URL \url{https://arxiv.org/abs/2209.09002}.

\bibitem[Zheng et~al.(2023)Zheng, Nie, Vahdat, and Anandkumar]{zheng2023fast}
H.~Zheng, W.~Nie, A.~Vahdat, and A.~Anandkumar.
\newblock Fast training of diffusion models with masked transformers.
\newblock \emph{arXiv preprint arXiv:2306.09305}, 2023.

\bibitem[Zilberstein(1996)]{zilberstein1996using}
S.~Zilberstein.
\newblock Using anytime algorithms in intelligent systems.
\newblock \emph{AI magazine}, 17\penalty0 (3):\penalty0 73--73, 1996.

\end{thebibliography}
```