---
abstract: |
  Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last $k$ encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills *same representation* to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit *complementary* working mechanisms, allowing same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of DiT model, it can provide guidance for \`\`free". Overall, RAEv2 leads to more than $10\times$ faster convergence over original RAE, achieving a state-of-the-art gFID of $1.06$ in just 80 epochs on ImageNet-256. On FDr$^k$ RAEv2 achieves state-of-art 2.17 at just 80 epochs compared to previous best 3.26 (800 epochs) without any post-training. This motivates $\epfidk$ (epochs to reach unguided gFID $\le k$) as a measure of training efficiency. RAEv2 attains an $\epfid$ of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. We hope that this work provides useful insights for practical adoption of representation autoencoders.
author:
-
bibliography:
- main.bib
title: '`\center{Improved Baselines with Representation Autoencoders}`{=latex}'
---

```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\providecommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\zw}[1]{{\color{red}[\textbf{ZW:} #1]}}
```
```{=latex}
\newcommand{\bad}[1]{\textcolor{bad}{#1}}
```
```{=latex}
\newcommand{\good}[1]{\textcolor{good}{#1}}
```
```{=latex}
\newcommand{\cmark}{\ding{51}}
```
```{=latex}
\newcommand{\xmark}{\ding{55}}
```
```{=latex}
\newcommand{\Sref}[1]{\S\ref{#1}}
```
```{=latex}
\newcommand{\sref}[1]{\S\ref{#1}}
```
```{=latex}
\newcommand{\fref}[1]{Fig.~\ref{#1}}
```
```{=latex}
\newcommand{\tref}[1]{Tab.~\ref{#1}}
```
```{=latex}
\newcommand{\todo}[1]{
{\colorbox{red}{\bfseries\sffamily\scriptsize\textcolor{white}{TODO}}}
 {\textcolor{red}{\sf\small\textit{#1}}}
}
```
```{=latex}
\newcommand{\findingstylename}{finding}
```
```{=latex}
\newcommand{\findingstyle}[1]{\renewcommand{\findingstylename}{#1}}
```
```{=latex}
\newcommand{\findingstylefinding}[3]{%
    \begin{tcolorbox}[
        colback=white!90!gray,
        colframe=teal!60!black,
        arc=5pt,
        boxsep=5pt,
        left=10pt, right=10pt, top=2pt, bottom=2pt,
        boxrule=0.8pt,
        drop shadow=gray!50!white,
        enhanced jigsaw,
        before skip=8pt, after skip=8pt
    ]
    \noindent\faBookmark\textbf{Finding #1: #2} #3
    \end{tcolorbox}
}
```
```{=latex}
\newcommand{\findingstyletakeaway}[3]{%
    \begin{tcolorbox}[
        colback=blue!3!white,
        colframe=blue!40!black,
        fonttitle=\bfseries,
        title={\faLightbulbFinding #1: #2},
        arc=2mm,
        boxrule=0.5pt,
        left=5pt, right=5pt, top=2pt, bottom=2pt,
        before skip=10pt, after skip=10pt
    ]
    #3
    \end{tcolorbox}
}
```
```{=latex}
\newcommand{\finding}[3]{%
    \ifx\findingstylename\findingstylefindingname\findingstylefinding{#1}{#2}{#3}%
    \else\findingstyletakeaway{#1}{#2}{#3}\fi
}
```
```{=latex}
\newcommand{\findingstylefindingname}{finding}
```
```{=latex}
\newcommand{\fancybreak}[0]{
    \noindent~\hfill\noindent\rule{0.9\linewidth}{0.8pt}~\hfill~
}
```
```{=latex}
\newcommand{\rebuttal}[1]{{#1}}
```
```{=latex}
\newcommand{\vI}{{\bm{I}}}
```
```{=latex}
\newcommand{\richard}[1]{\textcolor{orange}{rz: #1}}
```
```{=latex}
\newcommand{\jas}[1]{\textcolor{gray!60}{J: #1}}
```
```{=latex}
\newcommand{\eli}[1]{\textcolor{gptgreen}{es: #1}}
```
```{=latex}
\newcommand{\str}[1]{\ensuremath{\mathtt{#1}}}
```
```{=latex}
\newcommand{\epfid}{\ensuremath{\mathrm{EP}_{\mathrm{FID@2}}}}
```
```{=latex}
\newcommand{\epfidk}{\ensuremath{\mathrm{EP}_{\mathrm{FID@}k}}}
```
```{=latex}
\newcommand{\Paragraph}[1]{\par\noindent\textbf{#1}}
```
```{=latex}
\newcommand{\Subparagraph}[1]{\smallskip\noindent{\emph{#1}}}
```
```{=latex}
\newcommand{\styledquote}[3][scholarblue!8]{%
\begin{center}
\colorbox{#1}{%
    %


    \small\itshape
    #2
    \def\temp{#3}%
    \ifx\temp\empty
        % Empty author, no attribution line
    \else
        \begin{flushright}
        \small\normalfont
        ---#3
        \end{flushright}
    \fi

    %
    %
}
\end{center}
%
}
```
```{=latex}
\newcommand{\smallsim}{\smallsym{\mathrel}{\sim}}
```
```{=latex}
\newcommand{\smallsym}[2]{#1{\mathpalette\make@small@sym{#2}}}
```
```{=latex}
\newcommand{\make@small@sym}[2]{%
  \vcenter{\hbox{$\m@th\downgrade@style#1#2$}}%
}
```
```{=latex}
\newcommand{\downgrade@style}[1]{%
  \ifx#1\displaystyle\scriptstyle\else
    \ifx#1\textstyle\scriptstyle\else
      \scriptscriptstyle
  \fi\fi
}
```
```{=latex}
\newcommand{\nbc}[3]{
 {\colorbox{#3}{\bfseries\sffamily\scriptsize\textcolor{white}{#1}}}
 {\textcolor{#3}{\sf\small$\blacktriangleright$\textit{#2}~$\blacktriangleleft$}}
 }
```
```{=latex}
\newcommand{\version}{\emph{\scriptsize\id}}
```
```{=latex}
\newcommand{\ms}[1]{\nbc{MS}{#1}{mscolor}}
```
```{=latex}
\newcommand{\nj}[1]{\nbc{NJ}{#1}{scholarpurple}}
```
```{=latex}
\newcommand{\ks}[1]{\nbc{KS}{#1}{red}}
```
```{=latex}
\newcommand{\maybe}[1]{\nbc{MAYBE}{#1}{orange}}
```
```{=latex}
\newcommand{\edit}[2]{
  \textcolor{gray}{\st{#1}} % Grey out old text
  \textcolor{mscolor}{#2} % Highlight new text
}
```
```{=latex}
\newcommand{\numrepos}{$10$\xspace}
```
```{=latex}
\newcommand{\numreposlarge}{$10$\xspace}
```
```{=latex}
\newcommand{\numproblems}{$4578$\xspace}
```
```{=latex}
\newcommand{\numproblemslarge}{$10000!$\xspace}
```
```{=latex}
\newcommand{\numtests}{IDK\xspace}
```
```{=latex}
\newcommand{\issuelength}{IDK\xspace}
```
```{=latex}
\newcommand{\goldpatchsize}{IDK\xspace}
```
```{=latex}
\newcommand{\goldpatchlines}{IDK\xspace}
```
```{=latex}
\newcommand{\failtopasstests}{IDK\xspace}
```
```{=latex}
\newcommand{\alltests}{IDK\xspace}
```
```{=latex}
\newcommand{\numtrajectories}{$3321$\xspace}
```
```{=latex}
\newcommand{\numtrajectoriestasks}{$2048$\xspace}
```
```{=latex}
\newcommand{\smalltextsc}[1]{\textsc{\small #1}}
```
```{=latex}
\newcommand{\scripttextsc}[1]{\textsc{\scriptsize #1}}
```
```{=latex}
\newcommand{\tinytextsc}[1]{\textsc{\tiny #1}}
```
```{=latex}
\newcommand{\agentsysname}{\textsc{AgentHub}\xspace}
```
```{=latex}
\newcommand{\sysname}{\envsysname{}\agentsysname{}\xspace}
```
```{=latex}
\newcommand{\gymname}{\textsc{AgentGym}\xspace}
```
```{=latex}
\newcommand{\agentname}{\textsc{AgentHub}\xspace}
```
```{=latex}
\newcommand{\swetext}{software engineering\xspace}
```
```{=latex}
\newcommand{\swe}{\textsc{Swe}\xspace}
```
```{=latex}
\newcommand{\dataset}{R2E-Gym\xspace}
```
```{=latex}
\newcommand{\envsysname}{\textsc{R2E-Gym}\xspace}
```
```{=latex}
\newcommand{\syngen}{\textsc{SweGen}\xspace}
```
```{=latex}
\newcommand{\scaffold}{AgentHub\xspace}
```
```{=latex}
\newcommand{\framework}{\envsysname}
```
```{=latex}
\newcommand{\benchmarkInstance}{$\gI$}
```
```{=latex}
\newcommand{\benchmarkDocstring}{$\gD$}
```
```{=latex}
\newcommand{\benchmarkRepo}{$\gR$}
```
```{=latex}
\newcommand{\benchmarkHarness}{$\gT$}
```
```{=latex}
\newcommand{\model}{SWE-LM\xspace}
```
```{=latex}
\newcommand{\llm}{\textsc{LLM}\xspace}
```
```{=latex}
\newcommand{\llamafactory}{\textsc{LLaMA-Factory}\xspace}
```
```{=latex}
\newcommand{\llms}{\textsc{LLMs}\xspace}
```
```{=latex}
\newcommand{\astree}{\textsc{AST}\xspace}
```
```{=latex}
\newcommand{\docker}{\textsc{AST}\xspace}
```
```{=latex}
\newcommand{\github}{\textsc{GitHub}\xspace}
```
```{=latex}
\newcommand{\stackoverflow}{\textsc{StackOverflow}\xspace}
```
```{=latex}
\newcommand{\python}{\smalltextsc{Python}\xspace}
```
```{=latex}
\newcommand{\api}{\smalltextsc{API}\xspace}
```
```{=latex}
\newcommand{\apis}{\smalltextsc{APIs}\xspace}
```
```{=latex}
\newcommand{\pycg}{\smalltextsc{PyCG}\xspace}
```
```{=latex}
\newcommand{\cotprompt}{\smalltextsc{COT}\xspace}
```
```{=latex}
\newcommand{\cllama}{\smalltextsc{CodeLLaMa}\xspace}
```
```{=latex}
\newcommand{\cllamashort}{\smalltextsc{CL}\xspace}
```
```{=latex}
\newcommand{\cllamaB}[1]{\smalltextsc{CodeLLaMa-#1B}\xspace}
```
```{=latex}
\newcommand{\wizard}{\smalltextsc{WizardCoder-34B}\xspace}
```
```{=latex}
\newcommand{\phind}{\smalltextsc{PhindCode-34B}\xspace}
```
```{=latex}
\newcommand{\codedavinci}{\smalltextsc{Code-Davinci-002}\xspace}
```
```{=latex}
\newcommand{\gptfamily}{\smalltextsc{GPT}\xspace}
```
```{=latex}
\newcommand{\gptthreefiveturbo}{\smalltextsc{GPT-3.5-turbo}\xspace}
```
```{=latex}
\newcommand{\gptfour}{\smalltextsc{GPT-4}\xspace}
```
```{=latex}
\newcommand{\gptfouro}{\smalltextsc{GPT-4o}\xspace}
```
```{=latex}
\newcommand{\gptfourturbo}{\smalltextsc{GPT-4-turbo}\xspace}
```
```{=latex}
\newcommand{\oone}{\smalltextsc{O1}\xspace}
```
```{=latex}
\newcommand{\oonemini}{\smalltextsc{O1-Mini}\xspace}
```
```{=latex}
\newcommand{\alphacode}{\smalltextsc{AlphaCode}\xspace}
```
```{=latex}
\newcommand{\alphacodeB}[1]{\smalltextsc{AlphaCode-#1B}\xspace}
```
```{=latex}
\newcommand{\claude}{\smalltextsc{Claude}\xspace}
```
```{=latex}
\newcommand{\claudetwo}{\smalltextsc{Claude-2}\xspace}
```
```{=latex}
\newcommand{\claudeinstantone}{\smalltextsc{Claude-Instant-1}\xspace}
```
```{=latex}
\newcommand{\sonnetthree}{\smalltextsc{Sonnet}\xspace}
```
```{=latex}
\newcommand{\sonnetthreefive}{\smalltextsc{Sonnet-3.5}\xspace}
```
```{=latex}
\newcommand{\sonnetthreesix}{\smalltextsc{Sonnet-3.5-v2}\xspace}
```
```{=latex}
\newcommand{\sonnetthreeseven}{\smalltextsc{Sonnet-3.7}\xspace}
```
```{=latex}
\newcommand{\geminipro}{\smalltextsc{Gemini-Pro}\xspace}
```
```{=latex}
\newcommand{\geminiultra}{\smalltextsc{Gemini-Ultra}\xspace}
```
```{=latex}
\newcommand{\deepseekcode}{\smalltextsc{DeepSeek-Coder}\xspace}
```
```{=latex}
\newcommand{\deepseekcodeB}[1]{\smalltextsc{DeepSeek-Coder-#1B}\xspace}
```
```{=latex}
\newcommand{\qwen}{\smalltextsc{Qwen}\xspace}
```
```{=latex}
\newcommand{\qwencoder}{\smalltextsc{Qwen-Coder}\xspace}
```
```{=latex}
\newcommand{\thestack}{\smalltextsc{Stack}\xspace}
```
```{=latex}
\newcommand{\humaneval}{\smalltextsc{HumanEval}}
```
```{=latex}
\newcommand{\mbpp}{\smalltextsc{MBPP}}
```
```{=latex}
\newcommand{\humanevalplus}{\smalltextsc{HumanEval+}}
```
```{=latex}
\newcommand{\apps}{\smalltextsc{APPS}\xspace}
```
```{=latex}
\newcommand{\contests}{\smalltextsc{Code-Contests}\xspace}
```
```{=latex}
\newcommand{\dsthousand}{\smalltextsc{DS-1000}}
```
```{=latex}
\newcommand{\pandaseval}{\smalltextsc{PandasEval}}
```
```{=latex}
\newcommand{\numpyeval}{\smalltextsc{NumpyEval}}
```
```{=latex}
\newcommand{\arcade}{\smalltextsc{Arcade}}
```
```{=latex}
\newcommand{\repoeval}{\smalltextsc{RepoEval}}
```
```{=latex}
\newcommand{\repoevalfunc}{\smalltextsc{RepoEval-Func}}
```
```{=latex}
\newcommand{\crosscodeeval}{\smalltextsc{CrossCodeEval}}
```
```{=latex}
\newcommand{\repobench}{\smalltextsc{RepoBench}}
```
```{=latex}
\newcommand{\swebench}{{SWE-Bench}}
```
```{=latex}
\newcommand{\swebenchlite}{\smalltextsc{SWEBench-Lite}}
```
```{=latex}
\newcommand{\swebenchverified}{\smalltextsc{SWEBench-Verified}}
```
```{=latex}
\newcommand{\swebv}{\smalltextsc{SWEB-V}}
```
```{=latex}
\newcommand{\intercode}{\smalltextsc{InterCode}}
```
```{=latex}
\newcommand{\webarena}{\smalltextsc{WebArena}}
```
```{=latex}
\newcommand{\odex}{\smalltextsc{ODEX}}
```
```{=latex}
\newcommand{\codeplan}{\smalltextsc{CodePlan}}
```
```{=latex}
\newcommand{\codet}{\smalltextsc{CodeT}}
```
```{=latex}
\newcommand{\lever}{\smalltextsc{LEVER}}
```
```{=latex}
\newcommand{\speakverify}{\smalltextsc{SpeakVerify}}
```
```{=latex}
\newcommand{\alphacodesearch}{\smalltextsc{AlphaCode-Search}}
```
```{=latex}
\newcommand{\funsearch}{\smalltextsc{FunSearch}}
```
```{=latex}
\newcommand{\reflexion}{\smalltextsc{Reflexion}}
```
```{=latex}
\newcommand{\alphacodium}{\smalltextsc{AlphaCodium}}
```
```{=latex}
\newcommand{\parsel}{\smalltextsc{Parsel}}
```
```{=latex}
\newcommand{\tot}{\smalltextsc{ToT}}
```
```{=latex}
\newcommand{\react}{\smalltextsc{ReAct}}
```
```{=latex}
\newcommand{\onedot}{.\xspace}
```
```{=latex}
\newcommand{\eg}{\emph{e.g}\onedot}
```
```{=latex}
\newcommand{\Eg}{\emph{E.g}\onedot}
```
```{=latex}
\newcommand{\ie}{\emph{i.e}\onedot}
```
```{=latex}
\newcommand{\Ie}{\emph{I.e}\onedot}
```
```{=latex}
\newcommand{\cf}{\emph{cf}\onedot}
```
```{=latex}
\newcommand{\Cf}{\emph{Cf}\onedot}
```
```{=latex}
\newcommand{\etc}{\emph{etc}\onedot}
```
```{=latex}
\newcommand{\wrt}{w.r.t\onedot}
```
```{=latex}
\newcommand{\dof}{d.o.f\onedot}
```
```{=latex}
\newcommand{\iid}{i.i.d\onedot}
```
```{=latex}
\newcommand{\wolog}{w.l.o.g\onedot}
```
```{=latex}
\newcommand{\etal}{\emph{et al}\onedot}
```
```{=latex}
\newcommand{\passmetric}[1]{\smalltextsc{Pass@}#1\xspace}
```
```{=latex}
\newcommand{\bestmetric}[1]{\smalltextsc{Best@}#1\xspace}
```
```{=latex}
\newcommand{\placeholder}[1]{%
\fcolorbox{black!15}{black!2}{%
%
\textcolor{black!25}{\scriptsize [placeholder]}%
}%
}
```
```{=latex}
\newcommand{\blue}[1]{\textcolor{blue}{#1}}
```
```{=latex}
\newcommand{\suggestion}[1]{%
    \begin{tcolorbox}[
        colback=white!90!gray,
        colframe=teal!60!black,
        arc=5pt,
        boxsep=5pt,
        left=10pt, right=10pt, top=2pt, bottom=2pt,
        boxrule=0.8pt,
        drop shadow=gray!50!white,
        enhanced jigsaw,
        before skip=8pt, after skip=8pt
    ]
    \noindent\faBookmark\textbf{Suggestion.} #1
    \end{tcolorbox}
}
```
```{=latex}
\renewcommand{\abscontent}{
    \noindent
    \centerline{\fontsize{14pt}{14pt}\selectfont\textbf{Abstract}}
    \parbox{\dimexpr\linewidth}{\absfont \theabstract}
    \@ifundefined{@keywords}{}{\vskip1em \noindent \keywordsfont Keywords: \@keywords}
}
```
```{=latex}
\renewcommand{\maketitle}{\bgroup\setlength{\parindent}{0pt}

    \begin{adjustwidth}{0pt}{0pt}
        \begin{flushleft}
            {{ \titlefont \@title\par}%
             \vskip10pt
             { \@author\par}
             \vskip10pt}%
        \end{flushleft}
    \end{adjustwidth}
    \egroup
    {{\abscontent}}%
    \thispagestyle{firststyle}
}
```
```{=latex}
\newcommand{\wwwicon}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{Logos/internet-icon.pdf}}\xspace}
```
```{=latex}
\newcommand{\ghicon}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{Logos/github-logo.pdf}}\xspace}
```
```{=latex}
\newcommand{\hficon}{\raisebox{-1.5pt}{\includegraphics[height=1.05em]{Logos/hf-logo.pdf}}\xspace}
```
```{=latex}
\renewcommand\Authfont{\normalfont\fontsize{11}{15}\selectfont}
```
```{=latex}
\renewcommand\Affilfont{\normalfont\fontsize{11}{15}\selectfont}
```
```{=latex}
\fancypagestyle{firststyle}{%
    \fancyhead[L]{}\fancyhead[C]{}\fancyhead[R]{}%
    \fancyfoot[L]{\footerfont\textbf{Project Page:}~\href{https://raev2.github.io}{https://raev2.github.io}}%
    \fancyfoot[C]{}\fancyfoot[R]{}%
    \renewcommand{\headrulewidth}{1pt}%
    \renewcommand{\footrulewidth}{1pt}%
}
```
```{=latex}
\maketitle
```
```{=latex}
\fancyhead[C]{\footerfont Improved Baselines with Representation Autoencoders}
```
`\noindent`{=latex}

![image](assets/fig1a_rfid_gfid.png){width="\\textwidth"}

```{=latex}
\hfill
```
![image](assets/fig_raev2_gfid_convergence_cfg.png){width="\\textwidth"}

```{=latex}
\vskip -0.05in
```
```{=latex}
\captionof{figure}{\textbf{Improved Representation Autoencoders.} \textbf{Left:} RAEv2 exhibits pareto-optimal reconstruction-generation performance at half the encoder FLOPs. $\ddagger$ denotes VAE / RAE / RAEv2 trained only on ImageNet. Training on more data (e.g., text) can further help reconstruction \cite{raet2i} (see \fref{fig:recon_qualitative_additional_data}). \textbf{Right:} Over $10\times$ faster convergence, achieving state-of-the-art gFID of 1.06 in just 80 epochs. }
```
`\label{fig:hero}`{=latex}

```{=latex}
\newpage
```
`\hypersetup{linkcolor=black}`{=latex} `\tableofcontents`{=latex} `\newpage`{=latex}

Introduction {#sec:intro}
============

```{=latex}
\begin{figure*}[t]

\begin{subfigure}[b]{0.24\textwidth}

    \includegraphics[width=\textwidth]{assets/fig2a_better_generation_fdr.pdf}
    \caption{Better generation}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.24\textwidth}

    \includegraphics[width=\textwidth]{assets/fig2b_faster_convergence.pdf}
    \caption{Faster convergence}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.24\textwidth}

    \includegraphics[width=\textwidth]{assets/fig2c_reconstruction.pdf}
    \caption{Better reconstruction}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.24\textwidth}

    \includegraphics[width=\textwidth]{assets/fig2d_nfe.pdf}
    \caption{Efficient inference}
\end{subfigure}
\caption{\textbf{Improved performance.} RAEv2 improves over RAE on (a) generation performance: achieving FDr$^6$ \cite{fdr} of 2.17 in just 80 epochs over RAE 3.26 (800 epochs) without any post-training. (b) faster convergence: improving $\epfid$ (epochs to reach unguided gFID $\le 2$) from 177 to 35 (see \sref{subsec:convergence}, \tref{tab:fd_eval}). (c) better reconstruction (d) efficient inference: reusing the REPA head for guidance, eliminating need for separate model (AutoGuidance) and extra forward pass (CFG).}
% on FD$_r^6$ ($3.26 \rightarrow 2.17$). (b) Over $5\times$ faster training measured by eFID-$k$ at $k{=}2$ (epochs to reach gFID $\le 2$; \sref{sec:experiments}). (c) Better reconstruction over RAE on rFID ($0.570 \rightarrow 0.190$), comparable to popular VAEs. (d) Half the inference cost using the REPA head for self-guidance (\sref{subsec:rae_x_prediction}), removing the extra forward pass required by CFG.}
\label{fig:highlights}
\end{figure*}
```
Representation Autoencoders (RAE) [@rae] have emerged as a powerful framework for replacing traditional VAEs in diffusion transformer training [@repa; @rae; @repae; @irepa; @fae], moving a step closer towards a unified tokenization for both understanding and generation. However, several problems persist towards practical adoption: 1) reconstruction performance lags behind specialized VAEs; 2) RAE is incompatible with traditional classifier-free guidance (CFG) [@rae], requiring training a secondary, weaker diffusion model for AutoGuidance [@autoguidance], adding compute and complexity; and 3) the encoder representations themselves remain underexplored, with prior work defaulting to final-layer features.

In this paper, we systematically investigate several design choices and find three key insights which significantly simplify and accelerate RAE training.

```{=latex}
\footnotetext{RAEv2 trains within $\sim$10.5 hours on our setup, compared to $>$1 week for 800 epochs in RAE \cite{rae}.}
```
`\noindent`{=latex} **Generalized Representation Autoencoder.** Prior works typically consider only the final layer output of a pretrained vision encoder as the representation for RAE. However, the representation from a pretrained encoder is not just its final layer; rich and diverse abstractions exist across all layers. We propose a generalized, training-free formulation that simply defines the encoder output as the sum of its last $k$ layers. We find that simply varying $k$ allows easy control over reconstruction quality, leading to Pareto-optimal performance for both reconstruction and generation (Fig. `\ref{fig:hero}`{=latex}, `\ref{fig:highlights}`{=latex}, `\ref{fig:recon_qualitative}`{=latex}).

`\noindent`{=latex} **RAE and REPA exhibit complementary working mechanisms.** We next study the prevailing assumption [@rae; @riprepa; @chang2026dino] that RAE (using pretrained representation as latent space encoder) eliminates the need for REPA [@repa], which distills the *same representation* to intermediate diffusion layers. Since RAE already uses encoder features as input, distilling them again to intermediate layers appears to be a wasteful skip connection. We perform large-scale empirical analysis across 27 vision encoders studying the working mechanism of RAE and REPA. The results are surprising: RAE and REPA operate through complementary mechanisms. RAE provides a more semantically rich latent space, while REPA improves the spatial structure of intermediate diffusion features [@irepa]. This encourages using the same representation as both encoder (RAE) and target for intermediate layers (REPA). Furthermore, the complementary mechanism enables stronger encoders (e.g., DINOv3-L) good in both global and spatial performance [@irepa; @simeoni2025dinov3] to also exhibit better generation performance (`\sref{subsec:rae_repa_orthogonal}`{=latex}).

`\noindent`{=latex} **REPA is x-prediction in RAE latent space.** The original RAE struggles with traditional classifier-free guidance (CFG), instead relying on AutoGuidance [@autoguidance], which requires training a secondary weaker diffusion model, adding compute and complexity. We observe a key property: when used with RAE, the REPA prediction head performs x-prediction in the target representation space. By simply reformulating the output head as also x-prediction [@jit], we find that the REPA head itself can be used as the weaker baseline for internal-guidance [@internalguidance]. This eliminates the need for a separate model entirely (AG). Also unlike CFG, which requires an additional unconditional forward pass (doubling the number of function evaluations at inference), internal-guidance [@internalguidance] with REPA head in x-prediction space is computed within the same forward pass, effectively halving the NFEs.

`\noindent`{=latex} **Training efficiency.** We combine these insights into an improved baseline RAEv2, which exhibits over $10\times$ faster convergence over original RAE, achieving state-of-art gFID of 1.06 in just 80 epochs. On recently proposed FDr$^k$ metric [@fdr] RAEv2 achieves 2.17 in just 80 epochs as opposed to previous best 3.26 (800 epochs) without any post-training. With improved convergence speed of RAEv2, we believe that incremental improvements in the gFID metric might provide little signal for practical applications. Instead the training efficiency of a given method, provides much more useful signal. Motivated by recent speedrun in language domain [@modded_nanogpt_2024], we therefore report $\epfidk$ (epochs to reach unguided gFID $\le k$) as a measure of training efficiency (`\tref{tab:fd_eval}`{=latex}). Notably, RAE marks a huge jump over prior works reducing $\epfid$ from 480 to 177. RAEv2 further boosts the training efficiency achieving $\epfid$ of just 35 epochs. We also validate our approach across diverse settings including text-to-image generation and navigation world models [@bar2024nwm] (`\sref{subsec:generalization}`{=latex}), showing consistent improvements.

```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/fig_recon_comparison_traffic-sign-05-main.pdf}
\vskip -0.1in
\includegraphics[width=\textwidth]{assets/fig_recon_comparison_handwritten-text-v2-cropped-main-uncapped.pdf}
\vskip -0.05in
\caption{\textbf{Qualitative reconstruction comparison.} $^\spadesuit$ denotes trained only on ImageNet.
RAEv2 despite only being trained on Imagenet performs competitively with proprietary VAEs. Training on more data (e.g., text) can further help reconstruction \cite{raet2i} (see \fref{fig:recon_qualitative_additional_data}). Results use DINOv3-L (K=23) for RAEv2.}
% enables better generation while also improving reconstruction performance over RAE.  While we note that RAEv2 is not as good as flux for reconstruction yet, it was only trained on Imagenet. As shown in \cite{raet2i}, training on more data (e.g., text) can further help reconstruction.}

% \caption{\textbf{Qualitative reconstruction comparison.} RAEv2 enables better generation while also improving reconstruction performance over RAE. $\spadesuit$ denotes VAE/RAE only trained on ImageNet. While we note that RAEv2 is not as good as flux for reconstruction yet, it was only trained on Imagenet. As shown in \cite{raet2i}, training on more data (e.g., text) can further help reconstruction.}
% Note: while RAEv2 still falls short of flux for reconstruction, its only trained on Imagenet. Training on more data (e.g., text) can further help reconstruction as shown in \cite{raet2i}.
% Please zoom in for best comparison.}
\label{fig:recon_qualitative}
\vskip -0.1in
\end{figure*}
```
```{=latex}
\suggestion{Incremental improvements in absolute gFID values might provide limited signal for practical applications. Inspired by recent speedrun in language domain, we also report \emph{training convergence} using $\epfidk$ (epochs to reach unguided gFID $\le k$) (see Table~\ref{tab:fd_eval}).}
```
Improved Representation Autoencoders {#sec:method}
====================================

We next discuss the improved baseline analyzing three insights for improving and simplifying RAE. First, in `\sref{subsec:generalized_rae}`{=latex} we generalize the RAE formulation to treat the encoder representation not as a single final-layer feature but as a signal distributed across all layers. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces) [@raet2i]. Next, in `\sref{subsec:rae_repa_orthogonal}`{=latex} we perform large-scale empirical analysis finding that RAE and REPA exhibit complementary working mechanisms. As a result, using the same representation as both encoder and intermediate target consistently not only improves generation, but also enables stronger encoders (e.g., DINOv3-L) excelling in both global and spatial performance to exhibit better generation with RAEv2. We then in `\sref{subsec:rae_x_prediction}`{=latex} show that REPA when applied with RAE can be viewed as performing x-prediction [@jit] in the target latent space. We therefore propose a simple reformulation, which allows the REPA prediction head itself to be used for guidance.

Generalized Representation Encoder {#subsec:generalized_rae}
----------------------------------

Prior work on RAE usually consider the encoder output as the final-layer feature of a pretrained vision encoder. However, different layers of a pretrained encoder capture complementary features [@bolya2025PerceptionEncoder]. As shown in `\fref{fig:per_layer_props}`{=latex}, feature visualizations and spatial self-similarity patterns vary substantially across depth, with later layers emphasizing global semantics and earlier-to-middle layers retaining finer spatial structure. The final layer alone is therefore not always the most informative signal for generation. A natural question arises: *\`\`instead of just relying on the final layer features, can we leverage features *across all layers* without introducing additional parameters or training cost?"*

```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/fig_encoder_var.pdf}
\vskip -0.05in
\caption{\textbf{RAE does not eliminate need for REPA.} Prevailing assumptions \cite{rae, riprepa, chang2026dino} say that using the pretrained representation (\eg, DINOv2) as both encoder and target of intermediate representations wastes model capacity by introducing a skip connection. Surprisingly, we instead find that RAE and REPA when used together work through complementary working mechanisms (\sref{subsec:rae_repa_orthogonal}). This leads to consistent improvements in generation performance across all pretrained representations.
% \todo{add remaining once exps finish}
}
\label{fig:encoder_sweep}
\vskip -0.1in
\end{figure*}
```
`\Paragraph{Naive concatenation is impractical.}`{=latex} A direct way to use multi-layer features is to concatenate them along the sequence or channel dimension. For an encoder with $L$ layers producing $N$ tokens of dimension $d$ each, this yields an $LN \times d$ latent sequence. While lossless, this causes an explosion in the latent sequence length, making the resulting latent space substantially more expensive for the diffusion model. On the other hand, concatenation along the channel dimension yields $N \times Ld$ significantly increasing the latent space dimension, making it harder to learn the diffusion model [@ldit].

We instead consider two approaches that combine features across the last $K$ layers while preserving original latent shape $N \times d$. Let $\vz_\ell \in \mathbb{R}^{N \times d}$ denote the feature map at layer $\ell$ of an $L$-layer encoder.

-   **Simple addition.** The encoder output is defined as the sum of the last $K$ layer features. In high-dimensional spaces, addition preserves the geometric structure of the underlying subspaces [@wiki:dimreduction]: $$\vx \;=\; \sum_{\ell = L-K+1}^{L} \vz_\ell \;\in\; \mathbb{R}^{N \times d}.$$

-   **Random-matrix projection.** We concatenate the last $K$ layer features along the channel dimension and project back to $d$ with a fixed random matrix $\mR \in \mathbb{R}^{Kd \times d}$ (sampled once at initialization, `\eg `{=latex}i.i.d. Gaussian, and held fixed). Random projections are a standard tool in dimensionality reduction [@wiki:dimreduction] and preserve pairwise distances in expectation: $$\vx \;=\; \big[\, \vz_{L-K+1} \,\big\|\, \cdots \,\big\|\, \vz_L \,\big]\, \mR \;\in\; \mathbb{R}^{N \times d}.$$

The original RAE is thus a special case in this generalized formulation with $K=1$ i.e., just the final layer. Both approaches keep the latent footprint identical to the original RAE and add no extra learned parameters. We defer a head-to-head empirical comparison of the two to `\sref{sec:experiments}`{=latex}.

```{=latex}
\finding{1}{Generalized Representation Encoders.}{Pretrained vision encoders are more than their final layer. Simply aggregating features across layers of a pretrained vision encoder greatly improves reconstruction without encoder finetuning or specialized data
(e.g., text, faces).
% (Fig.~\ref{fig:hero},\ref{fig:rfid_gfid})
% without any specialized training data e.g., text, faces
}
```
RAE and REPA exhibit Complementary Working Mechanisms {#subsec:rae_repa_orthogonal}
-----------------------------------------------------

```{=latex}
\Paragraph{Empirical results.}
```
We next study the prevailing assumption [@rae; @riprepa; @chang2026dino] that RAE eliminates need for REPA. Since RAE already uses encoder features as input, distilling them again to intermediate layers appears to be a wasteful skip connection. To this end, we first perform large-scale empirical analysis, using the same representation as both encoder and target at intermediate diffusion layers (refer `\fref{fig:encoder_sweep}`{=latex}). Results are surprising. Across all encoders, instead of hurting performance, the use of REPA with RAE consistently leads to better generation performance. This suggests a fundamental difference in how representation alignment (REPA) and RAE benefit diffusion training.

```{=latex}
\Paragraph{Working mechanism.}
```
We next analyze how REPA impacts diffusion features when combined with RAE. As shown in `\fref{fig:working_mechanism}`{=latex}, adding REPA on top of RAE has minimal impact on the peak *global semantic information* (measured through linear probing) of diffusion features. Instead, we observe that REPA improves the spatial self-similarity structure of the learned diffusion features (i.e., how different tokens pay attention to each other) - an intriguing phenomenon recently identified in iREPA [@irepa]. This suggests complementary working mechanisms for REPA and RAE: RAE provides a semantically rich latent space for diffusion, while REPA regularizes the token-token similarity structure in intermediate diffusion features.

```{=latex}
\Paragraph{Correlation analysis.}
```
To further validate the complementary mechanisms of RAE and REPA, we follow the practice in iREPA [@irepa], analyzing the Imagenet linear probing accuracy (LP) and local distance similarity score (LDS) [@irepa] across 27 vision encoders, and report their Pearson correlation $r$ with generation quality (gFID). As shown in `\fref{fig:repr_correlation}`{=latex}, for REPA alone (with VAE), LDS is highly predictive ($|r|{=}0.89$) while LP is actually anticorrelated ($r{=}{+}0.34$), consistent with findings in [@irepa]. In contrast, when using RAE alone, LP dominates ($|r|{=}0.81$) while LDS barely correlates ($r{=}|0.13|$). When combining RAE with REPA, neither metric alone is strongly predictive, but the average of LP (global semantics) and LDS (spatial structure) achieves the highest correlation ($|r|{=}0.83$). This confirms that RAE and REPA operate through complementary mechanisms: RAE leverages global semantics while REPA regularizes spatial structure.

```{=latex}
\begin{figure*}[t]

\begin{subfigure}[b]{0.29\textwidth}

\includegraphics[width=\linewidth]{assets/fig_wm_lp.pdf}
\caption{Impact of REPA on global semantics with RAE (DINOv2-B)}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.69\textwidth}

\includegraphics[width=\linewidth]{assets/fig_wm_selfsim-v1.pdf}
\caption{Impact of REPA on spatial structure with RAE (DINOv2-B)}
\end{subfigure}
% \vskip -0.1in
\caption{\textbf{Working mechanism of REPA with RAE.}
While REPA applied with RAE  has minimal impact on global semantics, it significantly improves spatial structure \cite{irepa} of diffusion features.}
% Interestingly, unlike REPA with VAE \cite{repa, irepa}, REPA with RAE has minimal impact on global semantics. While impact on global semantics is minimal, }
\label{fig:working_mechanism}
\vskip -0.2in
\end{figure*}
```
```{=latex}
\vskip -0.1in
```
![REPA alone (SD-VAE)](assets/fig_corr_repa.png){width="\\linewidth"}

```{=latex}
\hfill
```
![RAE alone](assets/fig_corr_rae.png){width="\\linewidth"}

```{=latex}
\vskip 0.05in
```
![RAE + REPA](assets/fig_corr_rae_repa.png){width="\\linewidth"}

```{=latex}
\hfill
```
```{=latex}
\scriptsize
```
```{=latex}
\setlength{\tabcolsep}{1.5mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
  Method                                     LP ($r$)$\downarrow$   LDS ($r$)$\downarrow$   Avg ($r$)$\downarrow$
  ----------------------------------------- ---------------------- ----------------------- -----------------------
  REPA alone                                        +0.34                 **-0.89**                 -0.56
  RAE alone                                       **-0.81**                 -0.13                   -0.55
  `\rowcolor{black!5}`{=latex} RAE + REPA           -0.64                   -0.53                 **-0.83**

```{=latex}
\vskip -0.2in
```
`\Paragraph{Selecting the best representation.}`{=latex} The above complementarity also enables stronger representations (e.g., DINOv3-L) that perform well for both global (LP) and spatial (LDS) performance, to also exhibit better generation with RAEv2 We defer the detailed encoder-selection study to `\sref{subsec:ablations}`{=latex}.

```{=latex}
\finding{2}{RAE and REPA exhibit complementary working mechanisms.}{RAE leverages semantic quality  while REPA regularizes spatial structure.
% Their combination achieves the best correlation with generation quality.
This complementary nature allows using same pretrained representation as both encoder (RAE) and target for intermediate diffusion features (REPA). This also explains why stronger representations like DINOv3-L, which excel in both global and spatial performance, achieve the best generation with RAEv2 (see \tref{tab:encoder}).}
```
Reformulating REPA as x-prediction with RAE {#subsec:rae_x_prediction}
-------------------------------------------

We next show that when used with RAE, the REPA head itself can be used for guidance, eliminating need for training second weaker diffusion model (AutoGuidance) or additional forward pass (CFG).

```{=latex}
\begin{wraptable}{r}{0.45\textwidth}

\vskip -0.15in
\scriptsize
% \small
\setlength{\tabcolsep}{2.5mm}
\renewcommand{\arraystretch}{1.05}
\begin{tabular}{lcc}
\toprule
Guidance & gFID$\downarrow$ & IS$\uparrow$ \\
\midrule
w/o Guidance              & 3.75 & 198.7 \\
CFG~\cite{cfg}             & 3.86 & 276.4 \\
\rowcolor{black!5} Autoguidance (AG)~\cite{autoguidance} & \textbf{3.31} & 219.1 \\
\bottomrule
\end{tabular}
\vskip -0.05in
\caption{RAE DiT$^{DH}$-XL, 20 epochs.}
\vskip -0.15in
% \caption{\textbf{Original RAE struggles with traditional CFG.} CFG fails to improve gFID; AG helps but requires a separate weaker model and an extra forward pass. Original RAE setup: DINOv2-B, DiT$^{DH}$-XL, 20 epochs.}
\label{tab:guidance}
\end{wraptable}
```
```{=latex}
\Paragraph{RAE struggles with traditional CFG.}
```
As shown in `\tref{tab:guidance}`{=latex}, we confirm that original RAE [@rae] struggles with standard classifier-free guidance [@cfg]. RAE therefore relies on AutoGuidance [@autoguidance], a separate, smaller model trained to serve as the \`\`weaker" baseline, adding compute and complexity.

```{=latex}
\Paragraph{REPA is x-prediction in RAE latent space.}
```
We observe a key connection. In RAE, the clean latent *is* the encoder representation: $\vx = E(\vI)$. The REPA projection head $h_\phi$ maps early-layer intermediate features $\vh$ to predict the clean latent $\hat{\vx}_{\text{repa}} = h_\phi(\vh)$. This is exactly x-prediction [@jit] in the RAE latent space. Importantly, because $h_\phi$ is a lightweight MLP that only accesses early-layer features, its prediction is inherently weaker than the full model's, playing the same role as the separately trained smaller model in AutoGuidance [@autoguidance].

```{=latex}
\Paragraph{Reformulating REPA head for guidance.}
```
If we reformulate the full model output to also give x-prediction instead of velocity [@dit], both outputs live in the same space. Let $\hat{\vx}_{\text{full}}$ denote the full model's x-prediction (all layers) and $\hat{\vx}_{\text{repa}}$ the REPA head's x-prediction (early layers only). We can then apply internal-guidance [@internalguidance] directly as, $$\hat{\vx}_{\text{guided}} = \hat{\vx}_{\text{full}} + w \cdot (\hat{\vx}_{\text{full}} - \hat{\vx}_{\text{repa}}),
\label{eq:self_guidance}$$ and convert back to velocity for sampling or loss computation: $\vv = (\vx_t - \hat{\vx}_{\text{guided}}) / t$. The REPA head runs during the same forward pass as main model, so this eliminates the need for training a separate weaker model (AG [@autoguidance]) and no additional forward pass (CFG [@cfg]).

Thus, *when used with RAE* our formulation is equivalent to a deep supervised network [@lee2015deeply] or internal-guidance [@internalguidance], with additional reparameterization to x-prediction. The reparameterization to x-prediction [@jit] is important as it allows use of REPA-head for both supervising spatial structure of intermediate layers (`\sref{subsec:rae_repa_orthogonal}`{=latex}) and act as a weaker baseline for guidance after reparametrization. Please see `\tref{tab:xpred_ablation}`{=latex} for ablation on importance of reparameterization to x-prediction.

```{=latex}
\finding{3}{REPA enables self-guidance.}{
REPA is x-prediction in RAE latent space. By reformulating output head also as x-prediction, REPA head itself can be used for internal-guidance; eliminating need for a separate model (AG) or extra forward pass (CFG).}
```
Experiments {#sec:experiments}
===========

We validate the performance of our approach through extensive experiments on ImageNet, text-to-image generation and world models. In particular, we investigate the following research questions:

-   Does the improved training recipe consistently improve convergence speed with representation autoencoders across diverse settings, model scales etc? (Fig. `\ref{fig:hero}`{=latex}, `\ref{fig:highlights}`{=latex}, `\ref{fig:encoder_sweep}`{=latex}, `\ref{fig:convergence}`{=latex}, `\ref{fig:rfid_gfid}`{=latex}; Tab. `\ref{tab:encoder}`{=latex}, `\ref{tab:convergence_guidance}`{=latex}, `\ref{tab:model_scale}`{=latex}, `\ref{tab:fd_eval}`{=latex})

-   Can we use generalized RAE formulation for improving reconstruction performance of representation autoencoders in a training free manner? (Fig. `\ref{fig:hero}`{=latex}, `\ref{fig:recon_qualitative}`{=latex}, `\ref{fig:k_sweep}`{=latex}, `\ref{fig:rfid_gfid}`{=latex}; Tab. `\ref{tab:genrae_formulation}`{=latex}, `\ref{tab:rec_comparison}`{=latex})

-   Does the proposed approach generalize to diverse training settings including text-to-image generation and world models? (Fig. `\ref{fig:qualitative_t2i_main}`{=latex}, `\ref{fig:nwm_horizon}`{=latex}, `\ref{fig:nwm_qualitative}`{=latex}; Tab. `\ref{tab:nwm_fvd}`{=latex}, `\ref{tab:t2i}`{=latex})

Ablation Studies {#subsec:ablations}
----------------

We first ablate different design choices for different components proposed in `\sref{sec:method}`{=latex} on ImageNet-256. Unless otherwise specified we use DiT$^{DH}$-XL, DINOv3-L as encoder and batch size 1024.

`\noindent`{=latex} `\sethlcolor{green!10}`{=latex}`\hl{\textbf{Encoder selection.}}`{=latex} Results are shown in `\tref{tab:encoder}`{=latex}. The original RAE [@rae] uses DINOv2-B as its encoder because it gives the best generation among the encoders tested under the RAE recipe. With RAEv2, however, the picture changes: stronger representations such as DINOv3-B [@simeoni2025dinov3] yield better generation, despite performing worse than DINOv2-B under the original RAE recipe. This is consistent with correlation analysis in `\sref{subsec:rae_repa_orthogonal}`{=latex}; stronger representations (e.g., DINOv3-L) which excel in both global semantics and spatial performance lead to best generation with RAEv2. Based on this finding, we use DINOv3-L as the default encoder for all subsequent RAEv2 experiments.

`\noindent`{=latex} `\sethlcolor{Plum!10}`{=latex}`\hl{\textbf{Formulation for generalized RAE.}}`{=latex} In `\tref{tab:genrae_formulation}`{=latex}, we compare the two parameter-free aggregation schemes from `\sref{subsec:generalized_rae}`{=latex} on DINOv3-L: simple addition of the last $K$ encoder layers (**MLS**) versus fixed random-matrix projection of their channel-wise concatenation (**MLR**). Interestingly, while both methods are effectively tied on Stage-1 reconstruction, MLS consistently wins on Stage-2 performance. We therefore use MLS as the default aggregation in the rest of the paper.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{3mm}
```
::: {#tab:encoder}
  ------------------------------------------------------------ -------------------- ------------------------------------------------- --------------------------- ---------- ----------------- --
  Encoder                                                       Encoder properties   gFID (DiT$^{\text{DH}}$-XL @ 20ep) $\downarrow$
  `\cmidrule`{=latex}(lr)2-4 `\cmidrule`{=latex}(lr)5-6           LP $\uparrow$                      LDS $\uparrow$                    Avg(LP', LDS') $\uparrow$     RAE      RAEv2 ($k{=}1$)
  MoCov3-B [@mocov3]                                                   76.4                               0.15                                   0.46               13.84          8.35
  WebSSL-1B [@fan2025scaling]                                          84.1                               0.18                                   0.51                8.60          4.16
  DINOv3-B [@simeoni2025dinov3]                                        84.5                               0.38                                   0.61                4.25          2.76
  DINOv2-B [@dinov2]                                                   83.9                               0.41                                   0.62                3.75          2.81
  `\rowcolor{black!5}`{=latex} DINOv3-L [@simeoni2025dinov3]         **87.0**                           **0.42**                               **0.65**            **3.30**      **2.61**
  ------------------------------------------------------------ -------------------- ------------------------------------------------- --------------------------- ---------- ----------------- --

  : `\sethlcolor{green!10}`{=latex}`\hl{\textbf{Ablation on choice of pretrained vision encoder.}}`{=latex} gFID at 20 epochs (DiT$^{\text{DH}}$-XL). We observe that with RAEv2, stronger encoders e.g, DINOv3-L with both better global (LP) and spatial (LDS) [@irepa] representations achieve the best performance. Please refer `\tref{tab:encoder_appendix}`{=latex} for further results.
:::

```{=latex}
\vskip 0.1in
```
```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{2.5mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
```{=latex}
\newcommand{\gb}{\cellcolor{gray!15}}
```
::: {#tab:convergence_guidance}
   $K$  Method                 rFID $\downarrow$        gFID $\downarrow$
  ----- -------------------- ---------------------- --------------------------
    2   MLR                          0.570                    3.085
        `\gb{}`{=latex}MLS    `\gb{}`{=latex}0.532   `\gb{}`{=latex}**2.586**
    8   MLR                          0.268                    3.580
        `\gb{}`{=latex}MLS    `\gb{}`{=latex}0.264   `\gb{}`{=latex}**2.688**

  : `\sethlcolor{yellow!10}`{=latex}`\hl{\textbf{Ablation on Guidance mechanism in RAEv2.}}`{=latex} Guidance with REPA and x-prediction achieves best results at no extra inference cost. Full results in `\tref{tab:convergence_guidance_appendix}`{=latex}.
:::

```{=latex}
\vskip 0.05in
```
```{=latex}
\hfill
```
```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{1.5mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
::: {#tab:convergence_guidance}
+-----------------------------------+-----------------------------+------------------------------+
| Guidance                          | gFID ($K{=}7$) $\downarrow$ | gFID ($K{=}23$) $\downarrow$ |
+:==================================+:===========================:+:============================:+
| w/o Guidance                      | 1.65                        | 3.01                         |
+-----------------------------------+-----------------------------+------------------------------+
| CFG [@cfg]                        | 1.49                        | 2.83                         |
+-----------------------------------+-----------------------------+------------------------------+
| Autoguidance (AG) [@autoguidance] | 1.14                        | 1.37                         |
+-----------------------------------+-----------------------------+------------------------------+
| ```{=latex}                       | **1.06**                    | **1.25**                     |
| \rowcolor{black!5}                |                             |                              |
| ```                               |                             |                              |
| REPA Guidance                     |                             |                              |
+-----------------------------------+-----------------------------+------------------------------+

: `\sethlcolor{yellow!10}`{=latex}`\hl{\textbf{Ablation on Guidance mechanism in RAEv2.}}`{=latex} Guidance with REPA and x-prediction achieves best results at no extra inference cost. Full results in `\tref{tab:convergence_guidance_appendix}`{=latex}.
:::

```{=latex}
\vskip 0.05in
```
`\noindent`{=latex} `\sethlcolor{cyan!10}`{=latex}`\hl{\textbf{Choice of $K$ for generalized RAE.}}`{=latex} We sweep $K \in \{1, \dots, 10, 23\}$ for generalized RAE on DINOv3-L (`\fref{fig:k_sweep}`{=latex}). Stage-1 reconstruction (rFID, PSNR) improves monotonically with $K$, rFID 0.18 and PSNR 27.03 at $K{=}23$, well past the standard RAE baseline (rFID 0.60, PSNR 18.93). Stage-2 generation behaves differently: at just 80 epochs, the unguided gFID is best near $K{=}1$ (1.50), while the guided gFID performs best with $K{=}7$ (1.06). Thus, interestingly the generalized RAE not only improves reconstruction performance but also leads to better generation performance with guidance.

![`\sethlcolor{cyan!10}`{=latex}`\hl{\textbf{Ablation on choice of $K$ for generalized RAE.}}`{=latex} (a, b) Stage-2 generation quality without and with guidance. (c, d) Stage-1 reconstruction (rFID and PSNR). All results with DINOv3-L (24 layers), DDT-XL and 80 epochs. Stage-1 reconstruction (rFID, PSNR) improves monotonically with $K$. Interestingly, the generalized RAE not only improves reconstruction performance but also leads to better generation performance with guidance (best at $K=7$).](assets/fig_k_sweep_gfid_noguide.png "fig:"){#fig:k_sweep width="\\linewidth"} `\subcaption{gFID $\downarrow$ (no guidance)}`{=latex} `\label{fig:k_sweep_gfid_noguide}`{=latex}

```{=latex}
\hfill
```
![`\sethlcolor{cyan!10}`{=latex}`\hl{\textbf{Ablation on choice of $K$ for generalized RAE.}}`{=latex} (a, b) Stage-2 generation quality without and with guidance. (c, d) Stage-1 reconstruction (rFID and PSNR). All results with DINOv3-L (24 layers), DDT-XL and 80 epochs. Stage-1 reconstruction (rFID, PSNR) improves monotonically with $K$. Interestingly, the generalized RAE not only improves reconstruction performance but also leads to better generation performance with guidance (best at $K=7$).](assets/fig_k_sweep_gfid_guide.png "fig:"){#fig:k_sweep width="\\linewidth"} `\subcaption{gFID $\downarrow$ (with guidance)}`{=latex} `\label{fig:k_sweep_gfid_guide}`{=latex}

```{=latex}
\hfill
```
![`\sethlcolor{cyan!10}`{=latex}`\hl{\textbf{Ablation on choice of $K$ for generalized RAE.}}`{=latex} (a, b) Stage-2 generation quality without and with guidance. (c, d) Stage-1 reconstruction (rFID and PSNR). All results with DINOv3-L (24 layers), DDT-XL and 80 epochs. Stage-1 reconstruction (rFID, PSNR) improves monotonically with $K$. Interestingly, the generalized RAE not only improves reconstruction performance but also leads to better generation performance with guidance (best at $K=7$).](assets/fig_k_sweep_rfid.png "fig:"){#fig:k_sweep width="\\linewidth"} `\subcaption{rFID $\downarrow$}`{=latex} `\label{fig:k_sweep_rfid}`{=latex}

```{=latex}
\hfill
```
![`\sethlcolor{cyan!10}`{=latex}`\hl{\textbf{Ablation on choice of $K$ for generalized RAE.}}`{=latex} (a, b) Stage-2 generation quality without and with guidance. (c, d) Stage-1 reconstruction (rFID and PSNR). All results with DINOv3-L (24 layers), DDT-XL and 80 epochs. Stage-1 reconstruction (rFID, PSNR) improves monotonically with $K$. Interestingly, the generalized RAE not only improves reconstruction performance but also leads to better generation performance with guidance (best at $K=7$).](assets/fig_k_sweep_psnr.png "fig:"){#fig:k_sweep width="\\linewidth"} `\subcaption{PSNR $\uparrow$}`{=latex} `\label{fig:k_sweep_psnr}`{=latex}

```{=latex}
\vskip -0.15in
```
```{=latex}
\begin{wraptable}{r}{0.35\textwidth}

\vskip -0.15in
\scriptsize
\setlength{\tabcolsep}{2mm}
\renewcommand{\arraystretch}{1.05}
\begin{tabular}{l c}
\toprule
$K$ & LP top-1 (\%) $\uparrow$ \\
\midrule
1 (last layer; RAE) & 85.39 \\
4                   & 85.15 \\
7                   & 85.10 \\
23 (full MLS)       & 85.24 \\
\bottomrule
\end{tabular}
\vskip -0.05in
\caption{Linear probing on ImageNet across $K$ (DINOv3-L); 30 epochs of LP training, further training may improve scores. Full sweep in \tref{tab:genrae_lp}.}
\label{tab:genrae_lp_main}
\vskip -0.2in
\end{wraptable}
```
`\noindent`{=latex} `\sethlcolor{Lavender!30}`{=latex}`\hl{\textbf{Impact of generalized RAE on understanding performance.}}`{=latex} A key advantage of RAE is that it provides a unified tokenization for both understanding and generation. We study the impact of the generalized formulation on the encoder's understanding performance with different $K$ in `\tref{tab:genrae_lp_main}`{=latex} ($K{=}1$ is the original RAE). The generalized formulation improves reconstruction and guided generation (`\fref{fig:k_sweep}`{=latex}) while preserving linear probing performance on ImageNet. Full sweep over $K \in \{1, \dots, 10, 23\}$ is in `\tref{tab:genrae_lp}`{=latex}.

```{=latex}
\finding{4}{}{
Generalized formulation of RAE helps improve reconstruction and generation performance with guidance (\fref{fig:k_sweep}) while preserving global semantics of the representation space (\tref{tab:genrae_lp_main}). This enables its use for a unified tokenization for both understanding and generation.}
```
`\noindent`{=latex} `\sethlcolor{yellow!10}`{=latex}`\hl{\textbf{Guidance mechanism in RAEv2.}}`{=latex} We ablate four guidance options for RAEv2 in `\tref{tab:convergence_guidance}`{=latex}: (i) no guidance, (ii) classifier-free guidance (CFG) [@cfg], (iii) AutoGuidance [@autoguidance], and (iv) internal guidance [@internalguidance] with REPA-head and x-prediction (`\sref{subsec:rae_x_prediction}`{=latex}). CFG fails to meaningfully improve gFID, AG helps but requires an additional model and forward pass. In contrast, internal guidance with REPA-head achieves the best gFID at no extra inference cost.

Impact on Convergence Speed {#subsec:convergence}
---------------------------

`\noindent`{=latex} `\sethlcolor{black!10}`{=latex}`\hl{\textbf{Convergence speed.}}`{=latex} Results are shown in `\fref{fig:convergence}`{=latex}. We observe that across various vision encoders, RAEv2 consistent improves convergence speed over original RAE.

```{=latex}
\begin{figure*}[t]

\begin{subfigure}[b]{0.32\textwidth}

\includegraphics[width=\linewidth]{assets/fig_convergence_exp_dinov2b-fixed.pdf}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.32\textwidth}

\includegraphics[width=\linewidth]{assets/fig_convergence_exp_dinov3b-fixed.pdf}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.32\textwidth}

\includegraphics[width=\linewidth]{assets/fig_convergence_exp_dinov3l-fixed.pdf}
\end{subfigure}
% \\[0.5em]
\begin{subfigure}[b]{0.32\textwidth}

\includegraphics[width=\linewidth]{assets/fig_convergence_exp_eupeb-fixed.pdf}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.32\textwidth}

\includegraphics[width=\linewidth]{assets/fig_convergence_exp_webssl1b-fixed.pdf}
\end{subfigure}
\hfill
\begin{subfigure}[b]{0.32\textwidth}

\includegraphics[width=\linewidth]{assets/fig_convergence_exp_spatialpeb-fixed.pdf}
\end{subfigure}
\vskip -0.05in
\caption{\sethlcolor{black!10}\hl{\textbf{Convergence comparison with original RAE.}} Across DINOv2-B~\cite{dinov2}, DINOv3-B/L~\cite{simeoni2025dinov3}, EUPE-B~\cite{eupe}, WebSSL-1B~\cite{fan2025scaling}, and SpatialPE-B~\cite{bolya2025PerceptionEncoder}, the improved training recipe (RAEv2) consistently leads to faster convergence. All results: DiT$^{DH}$-XL, $K=1$ for RAEv2, 1024 batch-size.}
\label{fig:convergence}
% \vskip -0.2in
\end{figure*}
```
```{=latex}
\begin{wraptable}{r}{0.42\textwidth}

\vskip -0.2in
% \scriptsize
\small
\setlength{\tabcolsep}{1.5mm}
\renewcommand{\arraystretch}{1.05}
\begin{tabular}{l c cc}
\toprule
Scale & \#Params & gFID (RAE) $\downarrow$ & gFID (RAEv2) $\downarrow$ \\
\midrule
B  & 165M & 5.48 & \textbf{3.37} \\
L  & 470M & 3.80 & \textbf{2.76} \\
\rowcolor{black!5} XL & 839M & 3.75 & \textbf{2.61} \\
\bottomrule
\end{tabular}
\vskip -0.05in
\caption{{Variation in model scale.}
% RAEv2 improves over RAE across all model scales. 20 epochs, no guidance.
}
\label{tab:model_scale}
\vskip -0.4in
\end{wraptable}
```
`\noindent`{=latex} `\sethlcolor{blue!10}`{=latex}`\hl{\textbf{Variation in Model scale.}}`{=latex} We further validate that the gains from RAEv2 transfer across model scales. `\tref{tab:model_scale}`{=latex} compares RAE and RAEv2 on DiT$^{DH}$-B, -L, and -XL backbones at 20 epochs: RAEv2 consistently improves generation performance across different scales.

`\noindent`{=latex} `\sethlcolor{Apricot!30}`{=latex}`\hl{\textbf{Training efficiency.}}`{=latex} With improved convergence speed of RAEv2 (1.06 gFID in just 80 epochs), we believe that incremental improvements in the gFID metric might provide little signal for practical applications. Instead the training efficiency of a given method, provides much more useful signal for practical applications (e.g., T2I, world models etc `\sref{subsec:generalization}`{=latex}). Motivated by the recent speedrun in the language domain [@modded_nanogpt_2024], we therefore report results on $\epfidk$ (number of epochs needed to reach unguided gFID $\le k$). We report results for $k{=}2$ in `\tref{tab:fd_eval}`{=latex}. Compared to absolute gFID which shows little variance among various methods, we observe that $\epfidk$ provides a much better signal for measuring training efficiency of a method. Notably, RAE marks a huge jump over prior works reducing $\epfid$ from 480 to 177. RAEv2 further boosts the training efficiency achieving $\epfid$ of just 35 epochs.

`\noindent`{=latex} `\sethlcolor{Goldenrod!30}`{=latex}`\hl{\textbf{Evaluation with alternate metrics.}}`{=latex} We also evaluate generation quality with alternate evaluation metrics using recently proposed Representation Fréchet Distance (FD$_r$) [@fdr], which scores sample fidelity in six different feature spaces: Inception, ConvNeXt, DINOv2, MAE, SigLIP, and CLIP. As shown in `\tref{tab:fd_eval}`{=latex}, despite training for only 80 epochs, RAEv2 achieves state-of-the-art performance on both FID and FD$_r^6$, surpassing prior baselines that are trained for 800 epochs without any post-training.

Impact on Reconstruction Performance {#subsec:rec_gen_analysis}
------------------------------------

`\noindent`{=latex} `\sethlcolor{purple!15}`{=latex}`\hl{\textbf{Qualitative Results.}}`{=latex} `\fref{fig:recon_qualitative}`{=latex} provides qualitative comparisons comparing RAEv2 reconstruction performance with RAE and proprietary VAEs (Flux-VAE, SDVAE, SDXL-VAE). We observe that despite being only trained on ImageNet, RAEv2 shows competitive performance with proprietary VAEs for reconstruction. We further compare reconstruction quality after using additional data from [@raet2i] for training the decoder (encoder remains frozen). We find that RAEv2 shows better reconstruction then SDVAE and SDXL-VAE while performing competitively with Flux-VAE for reconstruction.

`\noindent`{=latex}

```{=latex}
\begin{wrapfigure}[8]{r}{0.45\textwidth}

\vskip -0.15in
\includegraphics[width=\linewidth]{assets/fig_rfid_gfid_20ep.pdf}
\vskip -0.05in
\caption{\sethlcolor{BurntOrange!15}\hl{\textbf{Reconstruction-generation trade-off.}}}
\label{fig:rfid_gfid}
\vskip -0.3in
\end{wrapfigure}
```
```{=latex}
\sethlcolor{BurntOrange!15}
```
```{=latex}
\hl{\textbf{Reconstruction and generation tradeoff.}}
```
Results are shown in `\fref{fig:rfid_gfid}`{=latex}. RAEv2 achieves Pareto-optimal trade-off between generation (gFID) and reconstruction (rFID) without encoder finetuning or specialized data (e.g, text) [@raet2i]. All results are reported with DINOv3-L encoder, DDT-XL and 20 epochs. Please also see `\tref{tab:rec_comparison}`{=latex} for further comparisons.

```{=latex}
\begin{table*}[t]

\vskip -0.4in
% \scriptsize
\small
\setlength{\tabcolsep}{4pt}
% \renewcommand{\arraystretch}{1.15}
\begin{tabular}{l c cc cccccc c}
\toprule
Method & Epochs
  & Training Efficiency
  & Representation Fr\'echet Distance (FD$_r$)~\cite{fdr} $\downarrow$
  & FD$_r^6$$\downarrow$ \\
\cmidrule(lr){3-4} \cmidrule(lr){5-10}
 & & $\epfid\downarrow$ & gFID$\downarrow$ & Incep. & ConvNeXt & DINOv2 & MAE & SigLIP & CLIP & \\
\midrule
SiT-XL/2~\cite{sit}                   & 800 & $>$800        & 2.12          & 1.26          & 2.02          & 7.89          & 5.62          & 16.14         & 17.69         & 8.44 \\
DDT-XL~\cite{ddt}                     & 800 & --            & 1.26          & 0.75          & 1.02          & 4.26          & 4.11          & 10.16         & 13.86         & 5.70 \\
SiT-XL/2-REPA~\cite{repa}             & 800 & $>$800        & 1.42          & 0.85          & 1.22          & 4.27          & 3.85          &  9.87         & 12.65         & 5.45 \\
LightningDiT~\cite{lgt}               & 800 & $>$800        & 1.42          & 0.85          & 1.09          & 3.76          & 3.02          &  8.47         & 10.21         & 4.57 \\
REG~\cite{reg}                        & 800 & 560           & 1.54          & 0.92          & 1.14          & 3.45          & 3.02          &  8.42         & 10.86         & 4.64 \\
REPA-E~\cite{repae}                   & 800 & 480           & 1.12          & 0.70          & 1.28          & 2.44          & \textbf{2.52} &  5.04         &  6.28         & 3.04 \\
RAE-XL~\cite{rae}                     & 800 & 177           & 1.13          & 0.69          & 1.79          & 2.11          & 3.30          &  3.79         &  7.87         & 3.26 \\
\rowcolor{black!5} \textbf{RAEv2 ($K{=}7$, ours)} & \textbf{80}  & \textbf{35} & \textbf{1.06} & \textbf{0.64} & \textbf{0.77} & \textbf{1.15} & 2.67 & \textbf{2.54} & \textbf{5.21} & \textbf{2.17} \\
\bottomrule
\end{tabular}
\vskip -0.05in
\caption{\sethlcolor{Goldenrod!30}\hl{\textbf{Training efficiency and evaluation under alternative metrics.}}
\textbf{Left:} Training efficiency comparisons reporting $\epfid$ (epochs to reach unguided gFID $\le 2$) and the final guided gFID. Unlike incremental improvements in gFID, $\epfid$ provides a much better signal for training convergence across methods.
\textbf{Right:} Representation Fr\'echet Distance (FD$_r$)~\cite{fdr} computed in six feature spaces (Inception, ConvNeXt, DINOv2, MAE, SigLIP, CLIP), and FD$_r^6$.
Compared to prior works trained for 800 epochs, RAEv2 attains the best $\epfid$, gFID, and FD$_r^6$ in just 80 epochs, without any post-training.
}
\label{tab:fd_eval}
% \vskip -0.35in
\end{table*}
```
```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/fig_recon_comparison_rendered-text-10-v3-main.pdf}
\vskip -0.1in
\includegraphics[width=\textwidth]{assets/fig_recon_comparison_handwritten-text-v2-cropped-v3-main-uncapped.pdf}
\vskip -0.05in
\caption{\textbf{Qualitative reconstruction comparisons} with additional data for decoder training \cite{raet2i} (pretrained vision encoder remains frozen). RAEv2 performs competitively with proprietary VAEs. Results use DINOv3-L as encoder for RAEv2. Please see \tref{tab:rec_comparison} for further quantitative results.}
\label{fig:recon_qualitative_additional_data}
\vskip -0.1in
\end{figure*}
```
Generalization to Other Tasks {#sec:generalization}
=============================

`\label{subsec:generalization}`{=latex}

We further validate the generalization of our improved baseline (RAEv2) on text-to-image generation and navigation world model [@bar2024nwm] tasks. Please refer `\sref{sec:appendix_t2i}`{=latex}, `\ref{sec:appendix_nwm}`{=latex} for detailed task setup and additional results.

Text-to-Image Generation {#subsec:t2i_main}
------------------------

![`\sethlcolor{Plum!10}`{=latex}`\hl{\textbf{Text-to-image qualitative samples at 256$\times$256.}}`{=latex} Qualitative samples from RAEv2 (0.9B) trained for 100K iterations with batch size 1024 (equivalent to $\sim$80 epochs on ImageNet at the same batch size), evaluated on MJHQ test set prompts. Additional samples and full prompts are provided in `\fref{fig:qualitative_t2i}`{=latex}--`\fref{fig:qualitative_t2i_prompts3}`{=latex}.](assets/t2i_qualitative_p1.png){#fig:qualitative_t2i_main width="\\linewidth"}

`\Paragraph{Training setup.}`{=latex} We first adapt the DiT$^{DH}$-XL backbone for T2I generation. We follow the same incontext architecture from ImageNet experiments (`\sref{sec:experiments}`{=latex}), replacing the 8 incontext class-conditional embedding tokens with 256 text-embedding tokens for input captions encoded by Qwen3-0.6B [@qwen2]. We pretrain on JourneyDB [@journeydb] together with the long-caption and short-caption subsets of BLIP3o [@blip3o] for 150K iterations at batch size 1024, and then finetune on BLIP3o-60k for 50 epochs at the same batch size.

```{=latex}
\Paragraph{Evaluation.}
```
Following [@raet2i], we report results on GenEval [@geneval], DPG-Bench [@dpgbench], and GenAI-Bench [@li2024genai]. All samples are generated with the ODE (Euler) sampler at 50 steps using the EMA model.

```{=latex}
\begin{wraptable}{r}{0.5\textwidth}

\vskip -0.15in
% \scriptsize
\small
\setlength{\tabcolsep}{1.5mm}
\renewcommand{\arraystretch}{1.05}
\begin{tabular}{l cc cc}
\toprule
 & Pretraining & Finetuning \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
Method & GenEval $\uparrow$ & DPG $\uparrow$ & GenEval $\uparrow$ & DPG $\uparrow$ \\
\midrule
Flux-VAE~\cite{flux} & 41.7          & 77.6          & 78.3          & 79.2          \\
RAE~\cite{rae}       & 58.4          & 80.1          & 81.5          & 80.6          \\
\rowcolor{black!5} RAEv2 (ours) & \textbf{62.4} & \textbf{81.7} & \textbf{82.7} & \textbf{82.3} \\
\bottomrule
\end{tabular}
\vskip -0.05in
\caption{Text-to-image generation.}
\label{tab:t2i_main}
\vskip -0.4in
\end{wraptable}
```
```{=latex}
\Paragraph{Results.}
```
RAEv2 leads to consistent improvements over Flux-VAE and the original RAE (`\tref{tab:t2i_main}`{=latex}). On pretraining, GenEval improves from 41.7 (Flux-VAE) to 62.4 (RAEv2). Similarly after finetuning, RAEv2 reaches 82.7 on GenEval compared 78.3 and 81.5 for Flux-VAE and RAE respectively. `\fref{fig:qualitative_t2i_main}`{=latex} shows qualitative visualization of generated samples. The results show strong prompt adherence across diverse subjects despite short training schedule (comparable to 80 epochs on ImageNet). Please see `\sref{sec:appendix_t2i}`{=latex} for detailed results.

Navigation World Models {#subsec:nwm_main}
-----------------------

```{=latex}
\Paragraph{Training setup.}
```
We further validate our approach for action-conditioned future-frame prediction [@bar2024nwm], which stress-tests the latent space along two axes: 1) substantially longer conditioning context, and 2) autoregressive rollouts that compound error over time. The model conditions on $N{=}4$ past frames at $256\times 256$ resolution; each frame is encoded by the RAE encoder into a $16\times 16$ patch grid, giving $N \times 256 = 1024$ context tokens (compared to $8$ for class-conditional ImageNet and $256$ for T2I). We additionally append $4$ action tokens (encoding the egocentric action $(\Delta x, \Delta y, \Delta\psi)$) and a Fourier-embedded rollout time token. Following [@bar2024nwm], we train on RECON [@sridhar2023nomad] at 4 FPS, reusing the DiT$^{DH}$-XL backbone and flow-matching recipe from our ImageNet experiments.

```{=latex}
\Paragraph{Evaluation.}
```
Following [@bar2024nwm], we evaluate predicted frames against ground truth at horizons of $\{1, 2, 4, 8, 16\}$ seconds. Given an FPS of $f$, we obtain the prediction at a target horizon $T$ via $T \cdot f$ autoregressive rollout steps: at each step the model predicts the next frame conditioned on the current sliding window of $N$ context frames and the next ground-truth action, and the predicted RGB is re-encoded and fed back as context. We report FVD [@fvd], FID [@fid] and LPIPS [@lpips] over the RECON validation split.

```{=latex}
\begin{wraptable}{r}{0.3\textwidth}

\vskip -0.1in
% \scriptsize
\small
\setlength{\tabcolsep}{2.5mm}
\renewcommand{\arraystretch}{1.05}
\begin{tabular}{l c}
% \toprule
Method & FVD~\cite{fvd} $\downarrow$ \\
\midrule
DIAMOND~\cite{alonso2024diffusion} & 762.73 \\
NWM~\cite{bar2024nwm}              & 200.97 \\
RAE~\cite{rae}                     & 312.01 \\
\rowcolor{black!5} RAEv2 (ours)    & \textbf{105.61} \\
% \bottomrule
\end{tabular}
\vskip -0.05in
\caption{Video prediction quality upto 16s on RECON \cite{sridhar2023nomad}.}
\label{tab:nwm_fvd}
\vskip -0.4in
\end{wraptable}
```
```{=latex}
\Paragraph{Video generation quality.}
```
On RECON [@sridhar2023nomad], RAEv2-NWM achieves an FVD of 105.61, substantially better than DIAMOND (762.73), NWM (200.97), and RAE (312.01) (`\tref{tab:nwm_fvd}`{=latex}). The same ordering holds at every horizon from 1 to 16 seconds on both FID and LPIPS (`\fref{fig:nwm_horizon}`{=latex}). Furthermore, we observe that qualitative rollouts also exhibit much less flickering between consecutive frames (`\fref{fig:nwm_qualitative}`{=latex}).

![`\sethlcolor{black!10}`{=latex}`\hl{\textbf{Future state prediction across rollout horizons.}}`{=latex} Comparing generation accuracy and quality of NWM [@bar2024nwm] and DIAMOND [@alonso2024diffusion] at 1 and 4 FPS as function of time, up to 16 seconds of generated video on the RECON dataset.](assets/fig_horizon_compare_selected.png "fig:"){#fig:nwm_horizon width="\\linewidth"} `\hfill`{=latex}

![`\sethlcolor{green!10}`{=latex}`\hl{\textbf{Qualitative rollouts with and without the generalized representation autoencoder.}}`{=latex} Consecutive frames at $t$ and $t{+}0.25$s for ground truth, RAE, and RAEv2-NWM (ours, with the generalized RAE of `\sref{subsec:generalized_rae}`{=latex}). RAE leads to flickering between consecutive frame predictions (e.g., different number of windows between consecutive frames). In contrast, RAEv2-NWM better retains low-level detail and preserves scene structure across time, which translates into substantially better FVD (`\tref{tab:nwm_fvd}`{=latex}).](assets/qualitative-nwm-vis.png){#fig:nwm_qualitative width="\\linewidth"}

```{=latex}
\Paragraph{Importance of generalized representation autoencoders.}
```
A large fraction of these gains comes from the generalized RAE formulation (`\sref{subsec:generalized_rae}`{=latex}), which aggregates the encoder's last $K$ layers rather than relying on the final layer alone. Earlier layers retain low-level texture and geometry critical for temporally consistent navigation rollouts. This leads to better future-state prediction and video quality across rollout horizons; which translates into the substantially lower FVD (`\tref{tab:nwm_fvd}`{=latex}).

```{=latex}
\begin{wrapfigure}{r}{0.5\textwidth}

\vskip -0.15in
\includegraphics[width=\linewidth]{assets/fig_training_convergence_nwm.pdf}
\vskip -0.05in
\caption{\sethlcolor{blue!10}\hl{\textbf{Convergence speed on RECON validation set.}} FID and LPIPS over training steps for RAE and RAEv2-NWM (ours), evaluated under the single-shot prediction protocol with random target offset $\in [1, 8]$ frames at 4 FPS.}
\label{fig:nwm_convergence}
\vskip -0.2in
\end{wrapfigure}
```
```{=latex}
\Paragraph{Impact on convergence speed.}
```
`\fref{fig:nwm_convergence}`{=latex} shows training curves under the single-shot protocol (no autoregressive rollout) with random target offset $\in [1, 8]$ frames at 4 FPS, i.e. predictions $0.25$--$2$ seconds into the future. RAEv2-NWM leads to much faster convergence over RAE; matching RAE's final FID within the first 10K iterations and converges within $\sim$30K to substantially lower FID (7.5 vs. 18.0) and LPIPS (0.24 vs. 0.29). This mirrors the speedup observed on ImageNet (`\sref{subsec:convergence}`{=latex}), indicating that the improved recipe transfers to navigation world models without modification.

Related Work {#sec:related}
============

We discuss the most relevant work here and provide more detailed discussion in the appendix.

```{=latex}
\Paragraph{Pretrained encoders as latent spaces.}
```
A growing line of work replaces VAE latents with pretrained vision encoders for diffusion [@rae; @raet2i; @svg; @chen2025aligning; @fae; @maetok; @reals; @flatdino]. We show that the original RAE recipe can be significantly improved with few simple insights leading to more then 10$\times$ faster convergence.

`\Paragraph{Representation alignment}`{=latex} distills pretrained representations to intermediate diffusion layers [@repa; @irepa; @repae; @reg; @ddt]. We study the prevalent assumption [@rae; @riprepa; @chang2026dino] that RAE (using pretrained representation as encoder) eliminates the need for REPA. We find that RAE and REPA work through complementary mechanisms. Their combination is not only useful but also simplifies guidance with RAE.

`\Paragraph{Reconstruction quality of vision encoders}`{=latex}. [@raet2i] tries to improve RAE reconstruction using specialized data (text, faces). [@lvrae; @psvae; @uae; @dinotok; @chang2026dino; @aligntok; @rpiae; @vfmvae] finetune pretrained encoder itself for reconstruction. We find that frozen vision encoders themselves contain low-level details for reconstruction; achieving pareto-optimal reconstruction generation performance without any additional training.

Conclusion {#sec:conclusion}
==========

We study an improved baseline which simplifies and improves RAE. We find that frozen vision encoders themselves contain low-level details for reconstruction. Simply aggregating the last $K$ layers leads to pareto-optimal reconstruction-generation performance. We next perform large-scale empirical analysis showing that RAE and REPA exhibit complementary working mechanisms. Their combination is not only useful but also simplifies guidance with RAE. Furthermore it enables stronger representations (e.g., DINOv3-L) which excel in both spatial and global performance to also give better generation performance. Overall, RAEv2 achieves 10$\times$ faster convergence over RAE, improves reconstruction, and achieves state-of-art gFID and FDr$^6$ in just 80 epochs without any post-training. We hope our work provides useful insights for practical adoption of representation autoencoders.

```{=latex}
\clearpage
```
```{=latex}
\small
```
```{=latex}
\bibliographystyle{plainnat}
```
```{=latex}
\appendix
```
```{=latex}
\newpage
```
Implementation Details {#sec:appendix_implementation}
======================

We provide detailed implementation configurations for reproducibility. `\tref{tab:hyperparameters}`{=latex} summarizes all hyperparameters for both class-conditional ImageNet and text-to-image experiments.

```{=latex}
\begin{table*}[t]

\scriptsize
\renewcommand{\arraystretch}{1.15}
\setlength{\tabcolsep}{6pt}
\begin{tabular}{l ccc}
\toprule
Configuration & ImageNet 256$\times$256 & Text-to-Image & World Models \\
\midrule
\rowcolor{black!8} Architecture \\
Backbone & DiT$^{DH}$-XL & DiT$^{DH}$-XL & DiT$^{DH}$-XL \\
Encoder blocks / Hidden dim / Heads & 28 / 1152 / 16 & 28 / 1152 / 16 & 28 / 1152 / 16 \\
Decoder blocks / Hidden dim / Heads & 2 / 2048 / 16 & 2 / 2048 / 16 & 2 / 2048 / 16 \\
MLP ratio & 4.0 & 4.0 & 4.0 \\
Patch size (latent) & 1 & 1 & 1 \\
Input channels & 768 & 768 & 768 \\
Conditioning & In-context & In-context & In-context \\
Conditioning tokens & 4 + 8 & 4 + 256 & 1029 \\
Positional embedding & APE + RoPE & APE + RoPE & APE + RoPE \\
Normalization & RMSNorm & RMSNorm & RMSNorm \\
FFN activation & SwiGLU & SwiGLU & SwiGLU \\
\midrule
\rowcolor{black!8} RAE Encoder \\
Vision encoder & DINOv3-L & SiGLIP2-B & DINOv3-L \\
Encoder input resolution & 256 & 224 & 256 \\
Encoder patch size & 16 & 16 & 16 \\
Latent shape & $1024 \times 16 \times 16$ & $768 \times 16 \times 16$ & $1024 \times 16 \times 16$ \\
Encoder normalization & Layer norm & Layer norm & Layer norm \\
\midrule
\rowcolor{black!8} REPA \\
Target encoder & Same as RAE encoder & Same as RAE encoder & Same as RAE encoder \\
Alignment layer depth & 8 & 8 & 8 \\
Projection type & Linear & Linear & Linear \\
REPA coefficient ($\lambda$) & 0.5 & 0.5 & 0.5 \\
\bottomrule
\end{tabular}
\vskip 0.1in
\caption{\textbf{Architecture and model configurations.} Model architecture, RAE encoder, and REPA settings for class-conditional ImageNet 256$\times$256, text-to-image, and navigation world models. All settings share the same backbone and differ primarily in the conditioning. Continued in \tref{tab:hyperparameters2}.}
\label{tab:hyperparameters}
\end{table*}
```
```{=latex}
\begin{table*}[t]

\scriptsize
\renewcommand{\arraystretch}{1.15}
\setlength{\tabcolsep}{6pt}
\begin{tabular}{l ccc}
\toprule
Configuration & ImageNet 256$\times$256 & Text-to-Image & World Models \\
\midrule
\rowcolor{black!8} Training \\
Dataset & ImageNet-1K & JourneyDB + BLIP3o & RECON \\
Base learning rate & $2 \times 10^{-4}$ & $2 \times 10^{-4}$ & $2 \times 10^{-4}$ \\
Final learning rate & $2 \times 10^{-5}$ & $2 \times 10^{-5}$ & $2 \times 10^{-5}$ \\
LR schedule & Linear decay & Linear decay & Linear decay \\
Warmup epochs / iterations & 25 epochs & 50K iter & 25K iter \\
Warmup decay end (LR reaches final) & 50 epochs & 150K iter & 60K iter \\
Weight decay & 0.0 & 0.0 & 0.0 \\
Global batch size & 1024 & 1024 & 256 \\
Mixed precision & bfloat16 & bfloat16 & bfloat16 \\
Gradient clipping (max norm) & 1.0 & 1.0 & 1.0 \\
EMA decay & 0.9995 & 0.9995 & 0.9995 \\
Training epochs / iterations & 80 & 150K iter (pretrain) + 50 ep (finetune) & 100K iter \\
CFG dropout probability & 0.1 & 0.1 & -- \\
\midrule
\rowcolor{black!8} Flow Matching \\
Base prediction type & $x$-prediction & $x$-prediction & $x$-prediction \\
REPA head prediction type & $x$-prediction & $x$-prediction & $x$-prediction \\
Time distribution & Logit-normal & Logit-normal & Logit-normal \\
\midrule
\rowcolor{black!8} Sampling \\
Sampler & ODE (Euler) & ODE (Euler) & ODE (Euler) \\
Number of steps & 50 & 50 & 50 \\
Guidance interval & $[0.0, 1.0]$ & $[0.0, 1.0]$ & -- \\
\midrule
\rowcolor{black!8} Text Conditioning (T2I only) \\
Text encoder & -- & Qwen3-0.6B & -- \\
Max sequence length & -- & 256 & -- \\
Finetuning dataset & -- & BLIP3o-60k & -- \\
\bottomrule
\end{tabular}
\vskip 0.1in
\caption{\textbf{Training and sampling configurations (continued).} Training hyperparameters, flow matching, sampling, and text conditioning settings. Continuation of \tref{tab:hyperparameters}.}
\label{tab:hyperparameters2}
\end{table*}
```
```{=latex}
\Paragraph{Architecture.}
```
We use the DDT [@ddt] backbone (DiT$^{DH}$-XL), which consists of a 28-block transformer encoder with hidden dimension 1152 and 16 attention heads, followed by a 2-block DDT decoder with hidden dimension 2048. All layers use RMSNorm, SwiGLU activation in the feed-forward network (MLP ratio 4.0), and rotary positional embeddings (RoPE) combined with absolute positional embeddings (APE). The latent patch size is 1, producing a sequence of $16 \times 16 = 256$ tokens from the encoder output.

```{=latex}
\Paragraph{RAE encoder.}
```
For ImageNet and navigation world model experiments, we use DINOv3-L [@simeoni2025dinov3] as the default encoder. The encoder processes $256 \times 256$ images with patch size 16, producing $16 \times 16 = 256$ patch tokens of dimension 1024, giving a latent representation of shape $1024 \times 16 \times 16$. For text-to-image experiments, we use SiGLIP2-B [@siglip] following [@raet2i], with the same $16 \times 16$ patch grid and a feature dimension of 768. We discard \[CLS\] and register tokens and apply layer normalization to the patch outputs. The RAE decoder is pretrained separately for 16 epochs following [@rae] and kept frozen during diffusion training.

```{=latex}
\Paragraph{Vision encoders.}
```
We evaluate pretrained vision encoders across 8 families following [@irepa]: DINOv2 [@dinov2], DINOv3 [@simeoni2025dinov3], WebSSL [@fan2025scaling], Perception Encoders [@bolya2025PerceptionEncoder], MoCov3 [@mocov3], CLIP [@clip], I-JEPA [@ijepa], and MAE [@mae]. Each encoder is wrapped in a unified interface that extracts patch tokens, discards any \[CLS\] or register tokens, and applies layer normalization. The full encoder-sweep results, comparing RAE and RAEv2 on every variant, are reported in `\tref{tab:encoder_appendix}`{=latex}.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{3mm}
```
::: {#tab:encoder_appendix}
  ------------------------------------------------------------ -------------------- ------------------------------------------------- --------------------------- ---------- ----------------- --
  Encoder                                                       Encoder properties   gFID (DiT$^{\text{DH}}$-XL @ 20ep) $\downarrow$
  `\cmidrule`{=latex}(lr)2-4 `\cmidrule`{=latex}(lr)5-6           LP $\uparrow$                      LDS $\uparrow$                    Avg(LP', LDS') $\uparrow$     RAE      RAEv2 ($k{=}1$)
  MoCov3-B [@mocov3]                                                   76.4                               0.15                                   0.46               13.84          8.35
  CLIP-L [@clip]                                                       84.5                               0.14                                   0.49                7.85          4.38
  PE-L [@bolya2025PerceptionEncoder]                                   85.5                               0.14                                   0.50                7.06          4.09
  PE-B [@bolya2025PerceptionEncoder]                                   80.7                               0.20                                   0.50                6.22          3.88
  WebSSL-1B [@fan2025scaling]                                          84.1                               0.18                                   0.51                8.60          4.16
  LangPE-L [@bolya2025PerceptionEncoder]                               83.0                               0.20                                   0.52                5.04           --
  SpatialPE-B [@bolya2025PerceptionEncoder]                            70.8                               0.33                                   0.52               11.24          5.04
  JEPA-H [@ijepa]                                                      77.5                               0.33                                   0.55               12.46          4.48
  SpatialPE-L [@bolya2025PerceptionEncoder]                            78.4                               0.34                                   0.56                8.77          3.97
  DINOv3-B [@simeoni2025dinov3]                                        84.5                               0.38                                   0.61                4.25          2.76
  DINOv2-B [@dinov2]                                                   83.9                               0.41                                   0.62                3.75          2.81
  `\rowcolor{black!5}`{=latex} DINOv3-L [@simeoni2025dinov3]         **87.0**                           **0.42**                               **0.65**            **3.30**      **2.61**
  ------------------------------------------------------------ -------------------- ------------------------------------------------- --------------------------- ---------- ----------------- --

  : **Full ablation on choice of pretrained vision encoder.** Extended version of `\tref{tab:encoder}`{=latex}. gFID at 20 epochs (DiT$^{\text{DH}}$-XL), sorted by the composite score Avg(LP', LDS'). LP denotes ImageNet linear-probing accuracy (a measure of global semantic quality), and LDS denotes the local-distance similarity score from iREPA [@irepa] (a measure of spatial structure); LP' and LDS' are the min-max normalized values used to form the composite score. RAEv2 consistently improves over RAE across all encoder families. Stronger encoders (e.g., DINOv3-L) which excel at both global and spatial performance achieve the best generation quality. All results are reported without guidance and at batch size 1024.
:::

```{=latex}
\vskip 0.1in
```
```{=latex}
\Paragraph{REPA configuration.}
```
We apply representation alignment at encoder block depth 8 with a single linear projection layer mapping from the transformer hidden dimension (1152) to the target encoder dimension (768). The REPA loss coefficient is set to $\lambda = 0.5$ following [@repa; @irepa]. The target encoder is the same as the RAE encoder (self-REPA), as we show in `\sref{subsec:rae_repa_orthogonal}`{=latex} that this consistently improves generation across various pretrained encoders.

```{=latex}
\Paragraph{Conditioning.}
```
We replace adaLN-Zero [@dit] with in-context conditioning. The timestep is embedded via Gaussian Fourier features into 4 tokens, and the class label (or text) is embedded into 8 tokens (or up to 256 tokens for T2I). These are concatenated with the image token sequence and processed jointly through self-attention. The DDT decoder strips the conditioning tokens before producing the final output corresponding to the 256 image latent tokens.

```{=latex}
\Paragraph{Training.}
```
We use a base learning rate of $2 \times 10^{-4}$, linearly decayed to $2 \times 10^{-5}$ by epoch 50, with 25 epochs of warmup, and no weight decay. Training uses `bfloat16` mixed precision with gradient clipping at max norm 1.0. We apply EMA with decay 0.9995 and report all results using the EMA model. All models are trained with global batch size of 1024.

```{=latex}
\Paragraph{Flow matching.}
```
We use continuous-time flow matching with velocity prediction and logit-normal time sampling following [@sit]. For self-guidance (`\sref{subsec:rae_x_prediction}`{=latex}), we convert the model output to $x$-prediction at inference time and apply guidance via the REPA head prediction as defined in Eq. `\ref{eq:self_guidance}`{=latex}.

```{=latex}
\Paragraph{Sampling and evaluation.}
```
We use the ODE solver with Euler discretization for all experiments. We follow the online evaluation protocol from [@jit] and report gFID [@fid] and Inception Score (IS) [@is] on 50K generated images. Following recent work, we additionally report FD$_r$ [@fdr] computed across six representation feature spaces (Inception, ConvNeXt, DINOv2, MAE, SigLIP, CLIP), and the geometric mean FD$_r^6$. As a measure of training efficiency, we report $\epfidk$, the number of training epochs to reach unguided gFID $\le k$; we report $k{=}2$. We generate 50 images per class (balanced sampling) following [@rae].

```{=latex}
\Paragraph{Text-to-image.}
```
We adapt the DiT$^{DH}$-XL backbone for T2I generation. We follow the same incontext architecture from ImageNet experiments (`\sref{sec:experiments}`{=latex}), replacing the 8 incontext class-conditional embedding tokens with 256 text-embedding tokens for input captions encoded by Qwen3-0.6B [@qwen2]. We pretrain on JourneyDB [@journeydb] together with the long-caption and short-caption subsets of BLIP3o [@blip3o] for 150K iterations at batch size 1024, and then finetune on BLIP3o-60k for 50 epochs at the same batch size. We evaluate on GenEval [@geneval], DPG-Bench [@dpgbench], and GenAI-Bench [@li2024genai].

```{=latex}
\Paragraph{Navigation world models.}
```
We use the same DiT$^{DH}$-XL backbone as in the ImageNet and T2I settings, only altering the conditioning tokens to handle navigation inputs. The model conditions on $N=4$ past frames at $256\times 256$ resolution; each frame is encoded by the RAE encoder into a $16\times 16$ patch grid, giving $N \times 256 = 1024$ context tokens. We additionally append $4$ action tokens (encoding the egocentric action $(\Delta x, \Delta y, \Delta\psi)$) and a single Fourier-embedded time token for the rollout offset, for a total of $1029$ conditioning tokens (compared to $8$ for class-conditional ImageNet and $256$ for T2I). Following [@shah2022gnm; @shah2023vint; @sridhar2023nomad], we use the RECON [@bar2024nwm; @sridhar2023nomad] dataset with the same flow-matching, learning-rate schedule, and EMA recipe as our ImageNet experiments. We train for 100K iterations at batch size 256. For evaluation, following [@bar2024nwm], we evaluate predicted frames against ground truth at horizons of $\{1, 2, 4, 8, 16\}$ seconds. Given an FPS of $f$, we obtain the prediction at a target horizon of $T$ seconds via $T \cdot f$ autoregressive rollout steps: at each step the model predicts the next frame conditioned on the current sliding window of $N$ context frames and the next ground-truth action, and the predicted RGB is re-encoded and fed back as context. Following [@bar2024nwm], we report FID [@fid], LPIPS [@lpips] at each horizon, computed over rollout episodes sampled from the held-out RECON [@sridhar2023nomad] validation split. We also report FVD as a measure of video generation quality for autoregressive rollouts upto 16s.

![**Different layers of a pretrained encoder provide complementary features.** Aggregating across layers yields richer representations than using the final layer alone.](assets/fig_per_layer_viz_dinov2b-v2.png){#fig:per_layer_props width="\\linewidth"}

```{=latex}
\Paragraph{Compute.}
```
We report results using a $4\times 8$ H100 setup, which trains RAEv2 to gFID 1.06 in roughly $12$ hours, compared to over a week for the original RAE (800 epochs) under the same setup.

Extended Related Work {#sec:appendix_related}
=====================

We provide a more comprehensive discussion of related work extending `\sref{sec:related}`{=latex}.

```{=latex}
\Paragraph{Representation alignment for generation.}
```
REPA [@repa] aligns intermediate DiT features with pretrained encoders (`\eg `{=latex}DINOv2) via a projection head, accelerating convergence. iREPA [@irepa] showed that spatial structure in the alignment target matters more than global information. REPA-E [@repae] extends this to end-to-end VAE tuning. Orthogonally, RAE [@rae] replaces the VAE latent space entirely with pretrained encoder features. LDiT [@ldit] studies the tension between reconstruction and generation in the latent space. A common assumption is that RAE subsumes REPA since both use the same encoder. We find that RAE and REPA exhibit complementary working mechanisms. Their combination is not only useful but also simplifies guidance with RAE.

`\Paragraph{Guidance Mechanisms.}`{=latex} Classifier-free guidance (CFG) [@cfg] has become the standard technique for improving sample quality at the cost of diversity, by interpolating between conditional and unconditional predictions. CFG Interval [@cfg_interval] showed that applying guidance only during specific noise levels improves both sample and distribution quality. Autoguidance [@autoguidance] replaced the unconditional model with a weaker conditional model, demonstrating that guidance fundamentally works by contrasting a stronger model against a weaker one rather than requiring unconditional training.

Self-Representation Alignment (SRA) [@sra] showed that diffusion transformers can provide representation guidance by themselves, using intermediate features to steer generation without external models. The dispersive loss [@dispersiveloss] regularizes representations during diffusion training itself to improve generation quality. Internal dynamics guidance [@internalguidance] further explores how a model's own internal representations can substitute for external guidance signals. Recent work on guidance-free generation [@guidancefree] aims to eliminate the need for guidance entirely by incorporating its benefits into training. Our improved self-guidance approach relates to SRA and autoguidance: we show that the REPA prediction head, when combined with $x$-prediction, can serve as an internal guidance signal, avoiding both the unconditional forward pass of CFG and the separate weaker model of autoguidance.

```{=latex}
\Paragraph{Representation Learning and Generation.}
```
Würstchen [@wurstchen] demonstrated that operating in a highly compressed semantic latent space (rather than pixel-level VAE latents) enables efficient large-scale text-to-image synthesis. This insight is closely related to the RAEs of using pretrained encoder features as the diffusion latent space. Several recent works explore how to best construct latent spaces that serve both reconstruction and semantic tasks. FAE [@fae] proposes single-layer adaptation of pretrained features for latent diffusion, showing that minimal fine-tuning of a frozen encoder can yield effective generation latents. MAETok [@maetok] uses a masked autoencoder tokenizer to bridge self-supervised features and discrete token-based generation. FlatDINO [@flatdino] compresses DINOv2 patch features into flatter distributions better suited for diffusion training. ReaLS [@reals] injects semantic priors from pretrained models into the VAE latent space, while SVG [@svg] directly uses frozen DINO features as the generation target.

Unified Latents [@unifiedlatents] jointly trains the encoder, diffusion prior, and decoder with MSE regularization, showing that end-to-end optimization of the full latent pipeline can improve over separately trained components. PS-VAE [@psvae] addresses the tension between semantic richness and pixel-level reconstruction by training representation encoders that excel at both, making them ready for text-to-image generation and editing. These works share a common theme with our approach: the latent space is not merely a compression bottleneck but an active design choice that shapes generation quality, training efficiency, and downstream flexibility. Several works have extended representation alignment beyond static image generation. VideoREPA [@zhang2025videorepa] applies relational alignment with foundation models to video generation, while Geometry Forcing [@wu2025geometry] marries video diffusion with 3D representations. JanusFlow [@ma2025janusflow] unifies multimodal understanding and generation through shared representations and rectified flow.

In this work, we show that pretrained vision encoders themselves have rich representations across different layers. Simply aggregating these features (e.g., through simple addition) enables better generation and reconstruction performance without affecting the understanding performance (measured through linear probing) of the vision encoder.

Additional Results {#sec:appendix_additional_results}
==================

Comparisons with original RAE {#sec:appendix_comparisons}
-----------------------------

```{=latex}
\Paragraph{Additional results on generation-reconstruction performance.}
```
`\tref{tab:rec_comparison}`{=latex} compares RAEv2 against recent representation-based autoencoders (`\sref{sec:related}`{=latex}) that target improved reconstruction. All prior works rely on auxiliary losses, encoder fine-tuning, or architectural modifications to the pretrained encoder; in contrast, our generalized RAE formulation (MLS) is strictly training-free, yet simultaneously achieves the best generation quality (at $K{=}7$) and the best reconstruction quality (at $K{=}23$).

```{=latex}
\begin{table*}[t]

\small
% \footnotesize
% \setlength{\tabcolsep}{3mm}
\renewcommand{\arraystretch}{1.15}
\begin{tabular}{l c c c c c c c}
\toprule
Encoder & \shortstack{Training-free \\ Encoder} & \shortstack{Recon-Gen \\ Tradeoff Control} & Epochs & Stage 1 & Stage 2 \\
\cmidrule(lr){5-6} \cmidrule(lr){7-8}
 & & & & rFID$\downarrow$ & PSNR$\uparrow$ & gFID$\downarrow$ & IS$\uparrow$ \\
\midrule
DINO-Tok~\cite{dinotok}        & \xmark & \xmark & 80 & 0.32  & \textbf{28.54} & 5.94          & 152.6          \\
DINO-SAE~\cite{chang2026dino}  & \xmark & \xmark & 80 & 0.37  & 26.20          & 3.07          & 209.7          \\
VFM-VAE~\cite{vfmvae}          & \xmark & \xmark & 80 & 0.52  & --             & 3.41          & 160.4          \\
AlignTok~\cite{aligntok}       & \xmark & \xmark & 80 & 0.26  & 25.83          & 3.71          & 148.9          \\
RPiAE~\cite{rpiae}             & \xmark & \xmark & 80 & 0.50  & 21.30          & 2.25          & 208.7          \\

RAE~\cite{rae}                 & \cmark & \xmark & 80 & 0.602 & 18.93          & 2.23          & 214.8          \\
\midrule
\rowcolor{black!5}
\textbf{RAEv2 ($K{=}7$, ours)}  & \cmark & \cmark & 80 & 0.29          & 22.57 & \textbf{1.65} & \textbf{228.0} \\
\rowcolor{black!5}
\textbf{RAEv2 ($K{=}23$, ours)} & \cmark & \cmark & 80 & \textbf{0.18} & 27.03 & 3.02          & 206.0          \\
\bottomrule
\end{tabular}
\vskip 0.05in
\caption{\textbf{Reconstruction vs Generation Comparison.} RAEv2 in its generalized form improves both reconstruction and generation performance over recent representation-based autoencoders~\cite{lvrae,psvae,uae,dinotok,chang2026dino,vfmvae,aligntok,rpiae} \emph{without} fine-tuning the pretrained encoder. Furthermore, by simply varying the value of $K$ (the number of last-layer features aggregated), the generalized formulation provides an easy way to control the reconstruction--generation trade-off; on this benchmark RAEv2 achieves the best generation quality at $K{=}7$ and the best reconstruction quality at $K{=}23$.}
\label{tab:rec_comparison}
\end{table*}
```
`\Paragraph{Impact of additional decoder training data for reconstruction.}`{=latex} `\tref{tab:decoder_data}`{=latex} reports reconstruction performance for the RAEv2 decoder trained on ImageNet and with additional training data from [@raet2i]. Training for longer with more data consistently improves reconstruction. Note: All results are reported with training for only 16 epochs. Training with more data (similar to proprietary VAEs) and for longer can help further improve reconstruction performance.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{3mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
::: {#tab:decoder_data}
  Decoder                                 PSNR $\uparrow$   SSIM $\uparrow$   LPIPS $\downarrow$   rFID $\downarrow$
  -------------------------------------- ----------------- ----------------- -------------------- -------------------
  DINOv3-L ($K{=}7$)                           22.58            0.6257              0.1531               0.299
  `\;`{=latex}`\;`{=latex} + more data         24.18            0.6946              0.1209               0.276
  DINOv3-L ($K{=}23$)                          27.04            0.8062              0.0874               0.185
  `\;`{=latex}`\;`{=latex} + more data         29.13            0.8625              0.0654               0.158

  : **Impact of additional data on RAEv2 decoder training.** Results with and without training on additional data [@raet2i] for decoder training. Training for longer with more data consistently improves reconstruction. Note: All results are reported with training for only 16 epochs with frozen pretrained vision encoder. Training with more data (similar to proprietary VAEs) and for longer can help further improve reconstruction performance.
:::

```{=latex}
\vskip 0.05in
```
`\Paragraph{Generalized RAE formulation.}`{=latex} `\tref{tab:genrae_formulation_appendix}`{=latex} extends `\tref{tab:genrae_formulation}`{=latex} (main paper) with all swept $K \in \{2, 4, 6, 8\}$ values and all five reconstruction/generation metrics (PSNR, SSIM, rFID, gFID, IS). MLS consistently dominates MLR on Stage-2 generation (gFID) across every $K$, while the two methods are essentially tied on the Stage-1 reconstruction metrics.

```{=latex}
\setlength{\tabcolsep}{4mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
```{=latex}
\newcommand{\gb}{\cellcolor{gray!15}}
```
::: {#tab:genrae_formulation_appendix}
  ------------------------------------------------------- -------------------- ---------------------- ---------------------- ---------------------- -------------------------- --------------------------

                        (last $K$)                        Method                  Stage-1 metrics        Stage-2 metrics
   `\cmidrule`{=latex}(lr)3-5 `\cmidrule`{=latex}(lr)6-7                          PSNR $\uparrow$        SSIM $\uparrow$       rFID $\downarrow$        gFID $\downarrow$            IS $\uparrow$
                             2                            MLR                          19.72                  0.509                  0.570                    3.085                        --
                                                          `\gb{}`{=latex}MLS    `\gb{}`{=latex}19.44   `\gb{}`{=latex}0.502   `\gb{}`{=latex}0.532   `\gb{}`{=latex}**2.586**   `\gb{}`{=latex}**243.6**
                             4                            MLR                          20.86                  0.558                  0.425                    2.954                        --
                                                          `\gb{}`{=latex}MLS    `\gb{}`{=latex}20.50   `\gb{}`{=latex}0.545   `\gb{}`{=latex}0.435   `\gb{}`{=latex}**2.622**   `\gb{}`{=latex}**230.9**
                             6                            MLR                          21.97                  0.607                  0.342                    3.118                        --
                                                          `\gb{}`{=latex}MLS    `\gb{}`{=latex}21.92   `\gb{}`{=latex}0.605   `\gb{}`{=latex}0.336   `\gb{}`{=latex}**2.637**   `\gb{}`{=latex}**223.3**
                             8                            MLR                          23.36                  0.669                  0.268                    3.580                        --
                                                          `\gb{}`{=latex}MLS    `\gb{}`{=latex}23.30   `\gb{}`{=latex}0.663   `\gb{}`{=latex}0.264   `\gb{}`{=latex}**2.688**   `\gb{}`{=latex}**220.8**
  ------------------------------------------------------- -------------------- ---------------------- ---------------------- ---------------------- -------------------------- --------------------------

  : **Full ablation on formulation for generalized RAE.** Extended version of `\tref{tab:genrae_formulation}`{=latex}. We compare two parameter-free ways of combining the last $K$ encoder layers (`\sref{subsec:generalized_rae}`{=latex}): **MLS** (multi-layer sum) is a simple addition $\vx = \sum_{\ell} \vz_\ell$; **MLR** (multi-layer random projection) concatenates the layers and projects back with a fixed random matrix. Encoder is DINOv3-L (24 layers); Stage-1 reports decoder reconstruction (PSNR, SSIM, rFID); Stage-2 reports DiT$^{DH}$-XL training (gFID, IS at 20 epochs). MLS dominates MLR on Stage-2 gFID at every $K$, while the two are essentially tied on Stage-1 reconstruction.
:::

```{=latex}
\vskip 0.05in
```
`\Paragraph{Ablation on guidance mechanism.}`{=latex} `\tref{tab:convergence_guidance_appendix}`{=latex} extends `\tref{tab:convergence_guidance}`{=latex} (main paper) with the additional Inception Score (IS) column for both $K{=}7$ and $K{=}23$. REPA Guidance achieves the best gFID and IS in both configurations while requiring no separate model and no extra forward pass.

```{=latex}
\setlength{\tabcolsep}{4mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.1}
```
::: {#tab:convergence_guidance_appendix}
+:------------------------------------------------------+:-----------------:+:----------------:+:-----------------:+:-------------:+
| Guidance                                              | RAEv2 ($K{=}7$)   | RAEv2 ($K{=}23$) |                   |               |
+-------------------------------------------------------+-------------------+------------------+-------------------+---------------+
| `\cmidrule`{=latex}(lr)2-3 `\cmidrule`{=latex}(lr)4-5 | gFID $\downarrow$ | IS $\uparrow$    | gFID $\downarrow$ | IS $\uparrow$ |
+-------------------------------------------------------+-------------------+------------------+-------------------+---------------+
| w/o Guidance                                          | 1.65              | 228.0            | 3.01              | 206.0         |
+-------------------------------------------------------+-------------------+------------------+-------------------+---------------+
| CFG [@cfg]                                            | 1.49              | 242.1            | 2.83              | 220.1         |
+-------------------------------------------------------+-------------------+------------------+-------------------+---------------+
| Autoguidance (AG) [@autoguidance]                     | 1.14              | 255.3            | 1.37              | 252.0         |
+-------------------------------------------------------+-------------------+------------------+-------------------+---------------+
| ```{=latex}                                           | **1.06**          | **255.3**        | **1.25**          | **256.8**     |
| \rowcolor{black!5}                                    |                   |                  |                   |               |
| ```                                                   |                   |                  |                   |               |
| REPA Guidance (Ours)                                  |                   |                  |                   |               |
+-------------------------------------------------------+-------------------+------------------+-------------------+---------------+

: **Full ablation on guidance mechanism in RAEv2.** Extended version of `\tref{tab:convergence_guidance}`{=latex}, additionally reporting Inception Score (IS). We compare four guidance options for RAEv2 across two encoder-layer aggregation choices ($K{=}7$ and $K{=}23$). REPA Guidance (`\sref{subsec:rae_x_prediction}`{=latex}) achieves the best gFID and IS while requiring no additional model (unlike AG) and no extra forward pass (unlike CFG). DiT$^{DH}$-XL backbone with DINOv3-L encoder.
:::

```{=latex}
\vskip 0.05in
```
```{=latex}
\Paragraph{Importance of x-prediction for self-guidance.}
```
We further ablate the choice of reparameterization, verifying the importance of x-prediction (`\sref{subsec:rae_x_prediction}`{=latex}) for self-guidance. `\tref{tab:xpred_ablation}`{=latex} compares v-prediction and x-prediction at $K{=}7$ without guidance. We observe that x-prediction, which corresponds to using REPA at intermediate layers, leads to the best generation performance with the generalized RAEv2.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{4mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.05}
```
::: {#tab:xpred_ablation}
  Parameterization                                                                     gFID $\downarrow$   IS $\uparrow$
  ----------------------------------------------------------------------------------- ------------------- ---------------
  Internal Guidance [@internalguidance]                                                      1.87             220.19
  `\rowcolor{black!5}`{=latex} Internal Guidance w/ x-prediction + REPA-head (ours)        **1.65**         **228.00**

  : **Importance of reparameterization to x-prediction with internal-guidance [@internalguidance] for RAEv2.** Generation performance at $K{=}7$, 80 epochs and DINOv3-L without guidance. x-prediction (equivalent to using REPA at intermediate layers, `\sref{subsec:rae_x_prediction}`{=latex}) outperforms default internal guidance [@internalguidance]. Thus, reparameterization to x-prediction is important to achieve the best generation performance with the RAEv2.
:::

```{=latex}
\vskip 0.05in
```
```{=latex}
\Paragraph{Impact of generalized RAE on understanding (linear probing).}
```
A key advantage of RAE is that it provides a unified tokenization for both understanding and generation. While the generalized RAE formulation greatly improves both reconstruction and generation performance (`\sref{subsec:generalized_rae}`{=latex}), it is important to understand its impact on the encoder's understanding performance (linear probing). We compare the original DINOv3-L final-layer encoder ($K{=}1$) against the generalized multi-layer-sum (MLS) variants used in RAEv2 ($K{=}7$ and $K{=}23$) in `\tref{tab:genrae_lp}`{=latex}. Despite significantly improving reconstruction and generation performance, the generalized RAE formulation does not meaningfully degrade the encoder's understanding performance, as measured by linear probing accuracy on ImageNet.

```{=latex}
\small
```
```{=latex}
\setlength{\tabcolsep}{3mm}
```
```{=latex}
\renewcommand{\arraystretch}{1.05}
```
::: {#tab:genrae_lp}
  Encoder                           Feature dim   LP top-1 (%) $\uparrow$
  -------------------------------- ------------- -------------------------
  DINOv3-L ($K{=}1$, last layer)       1024                85.39
  DINOv3-L MLS ($K{=}2$)               1024                85.29
  DINOv3-L MLS ($K{=}3$)               1024                85.28
  DINOv3-L MLS ($K{=}4$)               1024                85.15
  DINOv3-L MLS ($K{=}5$)               1024                85.13
  DINOv3-L MLS ($K{=}6$)               1024                85.14
  DINOv3-L MLS ($K{=}7$)               1024                85.10
  DINOv3-L MLS ($K{=}8$)               1024                85.10
  DINOv3-L MLS ($K{=}9$)               1024                85.10
  DINOv3-L MLS ($K{=}10$)              1024                85.12
  DINOv3-L MLS ($K{=}23$, full)        1024                85.24

  : **Impact of generalized RAE on understanding (linear probing).** Linear probing top-1 accuracy on ImageNet across all $K \in \{1, \dots, 10, 23\}$ for the generalized multi-layer-sum (MLS) variant on DINOv3-L. $K{=}1$ corresponds to the original RAE (final-layer feature). The generalized formulation (`\sref{subsec:generalized_rae}`{=latex}) improves reconstruction without meaningfully impacting global semantic performance, enabling unified tokenization for both understanding and generation. All values are computed at 30 epochs of LP training with learning rate $1\times 10^{-2}$; continued training may further improve linear probing scores.
:::

```{=latex}
\vskip 0.05in
```
```{=latex}
\Paragraph{Evaluation under the Monge Distance.}
```
Following the recent MIND framework [@mind], we additionally evaluate RAEv2 against RAE and REPA-E using the Monge Distance, an optimal-transport based alternative to the Fréchet Distance. `\tref{tab:monge_eval}`{=latex} reports the Representation Monge Distance (MD$_r$) computed across the same six feature spaces used for FD$_r$ in `\tref{tab:fd_eval}`{=latex}. RAEv2 attains the best MD$_r$ in five of six feature spaces in just 80 epochs, further corroborating the strong results under alternative evaluation metrics.

```{=latex}
\begin{table*}[t]

\small
\setlength{\tabcolsep}{4pt}
\begin{tabular}{l c cccccc}
\toprule
Method & Epochs
  & Representation Monge Distance (MD$_r$)~\cite{mind} $\downarrow$ \\
\cmidrule(lr){3-8}
 & & Incep. & ConvNeXt & DINOv2 & MAE & SigLIP & CLIP \\
\midrule
REPA-E~\cite{repae}  & 800 & 1.112 $\pm$0.08 & 56.63 $\pm$1.69 & 26.82 $\pm$0.55 & 0.196 $\pm$0.01 &  4.44 $\pm$0.12 & 44.75 $\pm$0.14 \\
RAE-XL~\cite{rae}    & 800 & \textbf{0.808 $\pm$0.04} & 70.29 $\pm$1.87 & 19.70 $\pm$0.32 & 0.230 $\pm$0.01 &  2.96 $\pm$0.17 & 68.46 $\pm$1.18 \\
\rowcolor{black!5} \textbf{RAEv2 ($K{=}7$, ours)} & \textbf{80}  & 0.997 $\pm$0.04 & \textbf{31.71 $\pm$0.58} & \textbf{7.27 $\pm$0.20} & \textbf{0.133 $\pm$0.00} & \textbf{1.71 $\pm$0.08} & \textbf{41.68 $\pm$2.66} \\
\bottomrule
\end{tabular}
\caption{\textbf{Evaluation under the Monge Distance.} Following~\cite{mind}, we additionally evaluate methods using the Monge Distance~\cite{mind} as an alternative to the Fr\'echet Distance. Analogous to FD$_r$ \cite{fdr}, we report the Representation Monge Distance (MD$_r$) computed in six feature spaces (Inception, ConvNeXt, DINOv2, MAE, SigLIP, CLIP). Compared to prior baselines trained with 800 epochs, RAEv2 attains the best MD$_r$ with different feature spaces in just 80 epochs, without any post-training. All results with 50K evaluation samples.}
\label{tab:monge_eval}
\end{table*}
```
Text-to-Image Generation {#sec:appendix_t2i}
------------------------

`\label{sec:appendix_generalization}`{=latex}

```{=latex}
\Paragraph{Training setup.}
```
We pretrain a text-to-image model from scratch on JourneyDB [@journeydb] together with the long-caption and short-caption subsets of BLIP3o [@blip3o], for 150K iterations at a global batch size of 1024. Following [@raet2i], we use SiGLIP2-B [@siglip] as the RAE encoder and adapt the DiT$^{DH}$-XL backbone for text-conditioning. Text captions are encoded by Qwen3-0.6B [@qwen2] with a maximum sequence length of 256 tokens. Optimization mirrors the ImageNet recipe (lr $2\times 10^{-4}$ linearly decayed to $2\times 10^{-5}$, `bfloat16`, EMA decay 0.9995). We then finetune on the BLIP3o-60k subset for 50 epochs at the same batch size.

```{=latex}
\Paragraph{Evaluation.}
```
Following [@raet2i], we report results on GenEval [@geneval], DPG-Bench [@dpgbench], and GenAI-Bench [@li2024genai], covering compositional, dense-prompt, and human-preference axes. Samples are generated with the ODE (Euler) sampler at 50 steps using the EMA model.

`\noindent`{=latex} `\sethlcolor{Plum!10}`{=latex}`\hl{\textbf{Pretraining.}}`{=latex} Results are shown in Tab. `\ref{tab:t2i}`{=latex}. We observe that as compared to widely used Flux-VAE [@flux], the use of representation autoencoders leads to significant improvements for text-to-image generation. Furthermore, using the improved training recipe leads to even further gains across all evaluation metrics. For instance, while Flux-VAE and RAE lead to a GenEval score of 41.7 and 58.4 respectively, the use of improved baseline RAEv2 leads to better performance with GenEval score of 62.4.

`\noindent`{=latex} `\sethlcolor{cyan!10}`{=latex}`\hl{\textbf{Finetuning.}}`{=latex} Following [@raet2i], we also perform finetuning of our pretrained model using the 60k finetuning dataset from BLIP3o [@blip3o]. We use a batch size of 1024 and 50 epochs for finetuning. Similar to findings of [@raet2i], we find that this helps significantly increase the performance to 82.7 on GenEval with RAEv2. Furthermore, while finetuning reduces the gap between various methods, RAEv2 still shows improved performance over original RAE and Flux-VAE.

::: {#tab:t2i}
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| Method                                   | Model                | Params | GenEval$\uparrow$ | GenAI-Bench$\uparrow$ | DPG-Bench$\uparrow$ |
+:=========================================+:=====================+:======:+:=================:+:=====================:+:===================:+
| `\rowcolor{Plum!10}`{=latex} Pretraining |                      |        |                   |                       |                     |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| ```{=latex}                              | DiT$^{\text{DH}}$-XL | 0.9B   | 41.7              | 57.3                  | 77.6                |
| \arrayrulecolor{black!30}                |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
| ```{=latex}                              |                      |        |                   |                       |                     |
| \midrule                                 |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
| Flux-VAE [@flux]                         |                      |        |                   |                       |                     |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| RAE [@rae]                               | DiT$^{\text{DH}}$-XL | 0.9B   | 58.4              | 63.2                  | 80.1                |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| `\rowcolor{black!5}`{=latex} RAEv2       | DiT$^{\text{DH}}$-XL | 0.9B   | **62.4**          | **63.8**              | **81.7**            |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| ```{=latex}                              |                      |        |                   |                       |                     |
| \arrayrulecolor{black}                   |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
| ```{=latex}                              |                      |        |                   |                       |                     |
| \midrule                                 |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
| `\rowcolor{cyan!10}`{=latex} Finetuning  |                      |        |                   |                       |                     |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| ```{=latex}                              | DiT$^{\text{DH}}$-XL | 0.9B   | 78.3              | 63.9                  | 79.2                |
| \arrayrulecolor{black!30}                |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
| ```{=latex}                              |                      |        |                   |                       |                     |
| \midrule                                 |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
| Flux-VAE [@flux]                         |                      |        |                   |                       |                     |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| RAE [@rae]                               | DiT$^{\text{DH}}$-XL | 0.9B   | 81.5              | 67.2                  | 80.6                |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| `\rowcolor{black!5}`{=latex} RAEv2       | DiT$^{\text{DH}}$-XL | 0.9B   | **82.7**          | **68.0**              | **82.3**            |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+
| ```{=latex}                              |                      |        |                   |                       |                     |
| \arrayrulecolor{black}                   |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
| ```{=latex}                              |                      |        |                   |                       |                     |
| \bottomrule                              |                      |        |                   |                       |                     |
| ```                                      |                      |        |                   |                       |                     |
+------------------------------------------+----------------------+--------+-------------------+-----------------------+---------------------+

: `\sethlcolor{Plum!10}`{=latex}`\hl{\textbf{Text-to-image generation.}}`{=latex} Results comparing proposed RAEv2 with original RAE [@rae] and Flux-VAE [@flux]. Results for pretraining are reported at 150K steps with batch-size of 1024 and JourneyDB, long-caption and short-caption subsets from BLIP3o pretraining subset [@blip3o]. For finetuning similar to [@raet2i], we use the 60k subset from [@blip3o], and 1024 batchsize. Across all settings, we observe that RAEv2 leads to faster training over original RAE and Flux-VAE.
:::

```{=latex}
\vskip 0.1in
```
Navigation World Models {#sec:appendix_nwm}
-----------------------

We follow the navigation world modeling setup of NWM [@bar2024nwm]. In this setting, the model is conditioned on the last $N=4$ egocentric video frames together with an action sequence, and is trained to predict the next frame in the trajectory. At inference time, the model rolls out future frames *autoregressively*: at each step, the predicted frame is fed back into the context window so that long-horizon predictions can be produced from a short history.

```{=latex}
\Paragraph{Training setup.}
```
We use the same DiT$^{DH}$-XL backbone as in the previous sections, only altering the conditioning tokens to handle navigation inputs. The model conditions on $N=4$ past frames at $256\times 256$ resolution; each frame is encoded by the RAE encoder into a $16\times 16$ patch grid, giving $N \times 256 = 1024$ context tokens. We additionally append $4$ action tokens (encoding the egocentric action $(\Delta x, \Delta y, \Delta\psi)$) and a single Fourier-embedded time token for the rollout offset, for a total of $1029$ conditioning tokens (compared to $8$ for class-conditional ImageNet and $256$ for T2I). Following [@shah2022gnm; @shah2023vint; @sridhar2023nomad], we use the RECON [@bar2024nwm; @sridhar2023nomad] dataset with the same flow-matching, learning-rate schedule, and EMA recipe as our ImageNet experiments. We train for 100K iterations at a batch size of 256, on the same $4\times 8$ H100 setup used for the ImageNet experiments.

```{=latex}
\Paragraph{Evaluation.}
```
Following [@bar2024nwm], we evaluate predicted frames against ground truth at horizons of $\{1, 2, 4, 8, 16\}$ seconds. Given an FPS of $f$, we obtain the prediction at a target horizon of $T$ seconds via $T \cdot f$ autoregressive rollout steps: at each step the model predicts the next frame conditioned on the current sliding window of $N$ context frames and the next ground-truth action, and the predicted RGB is re-encoded and fed back as context. We report FID [@fid], LPIPS [@lpips], PSNR, and DreamSim [@fu2023dreamsim] at each horizon, computed over rollout episodes sampled from the RECON validation split.

`\noindent`{=latex} `\sethlcolor{black!10}`{=latex}`\hl{\textbf{Future state prediction and synthesis.}}`{=latex} Across rollout horizons RAEv2-NWM produces noticeably more accurate and temporally stable predictions than DIAMOND, NWM, and the RAE baseline. Quantitatively, RAEv2-NWM achieves an FVD of 105.61 on the RECON validation set, compared to 762.73 for DIAMOND, 200.97 for NWM, and 312.01 for RAE (`\tref{tab:nwm_fvd_appendix}`{=latex}); the same ordering holds at every horizon from 1 to 16 seconds on both FID and LPIPS (`\fref{fig:nwm_horizon}`{=latex}). Qualitatively, the rollouts also exhibit much less flickering between consecutive frames (`\fref{fig:nwm_qualitative}`{=latex}).

```{=latex}
\setlength{\tabcolsep}{8pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.15}
```
::: {#tab:nwm_fvd_appendix}
                             DIAMOND [@alonso2024diffusion]   NWM [@bar2024nwm]   RAE [@rae]   RAEv2 (ours)
  ------------------------- -------------------------------- ------------------- ------------ --------------
  \#Params                                 1B                        1B              622M          622M
  FVD [@fvd] $\downarrow$                762.73                    200.97           312.01      **105.61**

  : **Video prediction quality on RECON [@sridhar2023nomad].** FVD computed over autoregressive rollouts up to 16s. Reference values for DIAMOND and NWM are from [@bar2024nwm].
:::

```{=latex}
\vskip 0.05in
```
`\noindent`{=latex} `\sethlcolor{green!10}`{=latex}`\hl{\textbf{Importance of generalized representation autoencoders.}}`{=latex} A large fraction of these gains comes from using the generalized RAE formulation (`\sref{subsec:generalized_rae}`{=latex}), which aggregates the encoder's last $K$ layers rather than relying on the final layer alone. The earlier layers retain low-level texture and geometry that are critical for temporally consistent navigation rollouts. As a result, the generalized formulation converges substantially faster during training (`\fref{fig:nwm_convergence}`{=latex}; reaching the RAE baseline's final error within roughly 10K iterations), and produces better future-state prediction and video quality across rollout horizons, translating into the substantially lower FVD reported in `\tref{tab:nwm_fvd}`{=latex}.

`\noindent`{=latex} `\sethlcolor{blue!10}`{=latex}`\hl{\textbf{Impact on convergence speed.}}`{=latex} `\fref{fig:nwm_convergence}`{=latex} shows training curves on RECON under the online single-shot protocol with random offset $\in [1, 8]$ frames at 4 FPS, i.e. predictions $0.25$--$2$ seconds into the future. RAEv2-NWM converges within $\sim$30K iterations to noticeably lower FID and LPIPS than the RAE baseline (FID 7.5 vs. 18.0, LPIPS 0.24 vs. 0.29), and matches the RAE baseline's final FID within the first 10K iterations. This mirrors the speedup we observe on ImageNet (`\sref{subsec:convergence}`{=latex}) and indicates that the improved recipe transfers to navigation world models without modification.

Qualitative Results {#sec:appendix_qualitative}
===================

```{=latex}
\Paragraph{Text-to-image generation.}
```
We additionally show text-to-image samples from RAEv2 (0.9B) in `\fref{fig:qualitative_t2i}`{=latex}--`\fref{fig:qualitative_t2i3}`{=latex}. The model (0.9B) is trained for 100K iterations with batch size 1024 and evaluated on MJHQ test set prompts, generating at 256$\times$256 resolution using self-guidance with the REPA head. Despite the relatively short training schedule and small model size, the samples demonstrate strong prompt adherence across a range of subjects including animals, landscapes, and stylized scenes. The corresponding text prompts are listed in `\fref{fig:qualitative_t2i_prompts}`{=latex}--`\fref{fig:qualitative_t2i_prompts3}`{=latex}.

```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/t2i_qualitative_p1.pdf}
\vskip -0.05in
\caption{\textbf{Text-to-image qualitative examples at 256$\times$256 resolution (1/3).} RAEv2 (0.9B) trained for 100K iterations with batch size 1024, evaluated on MJHQ test set prompts. Corresponding prompts are listed in \fref{fig:qualitative_t2i_prompts}.}
\label{fig:qualitative_t2i}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/t2i_qualitative_p2.pdf}
\vskip -0.05in
\caption{\textbf{Text-to-image qualitative examples at 256$\times$256 resolution (2/3).} RAEv2 (0.9B) trained for 100K iterations with batch size 1024, evaluated on MJHQ test set prompts. Corresponding prompts are listed in \fref{fig:qualitative_t2i_prompts}.}
\label{fig:qualitative_t2i2}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/t2i_qualitative_p3.pdf}
\vskip -0.05in
\caption{\textbf{Text-to-image qualitative examples at 256$\times$256 resolution (3/3).} RAEv2 (0.9B) trained for 100K iterations with batch size 1024, evaluated on MJHQ test set prompts. Corresponding prompts are listed in \fref{fig:qualitative_t2i_prompts}.}
\label{fig:qualitative_t2i3}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/t2i_qualitative_prompts_p1.pdf}
\vskip -0.05in
\caption{\textbf{Text prompts for T2I qualitative samples (1/3).} Prompts corresponding to the generated images in \fref{fig:qualitative_t2i}--\fref{fig:qualitative_t2i3}.}
\label{fig:qualitative_t2i_prompts}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/t2i_qualitative_prompts_p2.pdf}
\vskip -0.05in
\caption{\textbf{Text prompts for T2I qualitative samples (2/3).}}
\label{fig:qualitative_t2i_prompts2}
\end{figure*}
```
```{=latex}
\begin{figure*}[t]

\includegraphics[width=\textwidth]{assets/t2i_qualitative_prompts_p3.pdf}
\vskip -0.05in
\caption{\textbf{Text prompts for T2I qualitative samples (3/3).}}
\label{fig:qualitative_t2i_prompts3}
\end{figure*}
```
Discussion and Limitations {#sec:discussion}
==========================

We next provide a discussion of some of the limitations of the current work, which might motivate further research in this area. We only consider very simple approaches for Generalized Representation Autoencoders. In particular, we only consider simple addition and random projection as one of the key ways for aggregating features across different layers of a pretrained vision encoder. In future, better optimization of the aggregation recipe can provide further gains for both generation and reconstruction.

Also similar to iREPA [@irepa], we identify the best representation for RAE through empirical search over a discrete set of pretrained encoders. In future work, we would like to directly optimize the representation itself for better generation, with end-to-end learning [@repae].

```{=latex}
\addtocontents{toc}{\protect\setcounter{tocdepth}{-1}}
```
Note on LLM Usage {#sec:appendix_llm_usage}
=================

All figures in the paper are directly generated from our experiment logs and checkpoints using Claude Code (Anthropic, 2025). Additionally, we use LLM help for searching and formulating relevant work in `\sref{sec:appendix_related}`{=latex}. We also use Cursor in some parts to help with paper writing.
