---
abstract: |
  End-to-end backpropagation requires storing activations throughout all layers, creating memory bottlenecks that limit model scalability. Existing block-wise training methods offer means to alleviate this problem, but they rely on ad-hoc local objectives and remain largely unexplored beyond classification tasks. We propose *DiffusionBlocks*, a principled framework for transforming transformer-based networks into genuinely independent trainable blocks that maintain competitive performance with end-to-end training. Our key insight leverages the fact that residual connections naturally correspond to updates in a dynamical system. With minimal modifications to this system, we can convert the updates to those of a denoising process, where each block can be learned independently by leveraging the score matching objective. This independence enables training with gradients for only one block at a time, thereby reducing memory requirements in proportion to the number of blocks. Our experiments on a range of transformer architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion) demonstrate that DiffusionBlocks training matches the performance of end-to-end training while enabling scalable block-wise training on practical tasks beyond small-scale classification. DiffusionBlocks provides a theoretically grounded approach that successfully scales to modern generative tasks across diverse architectures.

  Code is available at: `\small`{=latex}<https://github.com/SakanaAI/DiffusionBlocks>.
author:
- |
  Makoto Shing^1^, Masanori Koyama^2^, Takuya Akiba^1^\
  ^1^Sakana AI, ^2^The University of Tokyo\
  `{mkshing,takiba}@sakana.ai`, `masanori.koyama@weblab.t.u-tokyo.ac.jp`
bibliography:
- iclr2026\_conference.bib
title: 'DiffusionBlocks: Block-wise Neural Network Training via Diffusion Interpretation'
---

```{=latex}
\PassOptionsToPackage{table}{xcolor}
```
```{=latex}
\newcommand{\cZ}{\mathcal{Z}}
```
```{=latex}
\newcommand{\bz}{\mathbf{z}}
```
```{=latex}
\newcommand{\cF}{\mathcal{F}}
```
```{=latex}
\newcommand{\RR}{\mathbb{R}}
```
```{=latex}
\newcommand{\cN}{\mathcal{N}}
```
```{=latex}
\newcommand{\cM}{\mathcal{M}}
```
```{=latex}
\newcommand{\cT}{\mathcal{T}}
```
```{=latex}
\newcommand{\method}{DiffusionBlocks}
```
```{=latex}
\newcommand{\DF}{DiffusionBlock}
```
```{=latex}
\newcommand{\fblock}{\bar{f}}
```
```{=latex}
\renewcommand{\algorithmicrequire}{\textbf{Input:}}
```
```{=latex}
\renewcommand{\algorithmicensure}{\textbf{Output:}}
```
```{=latex}
\renewcommand{\algorithmicreturn}{\textbf{Given:}}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\newcommand{\todo}[1]{\textcolor{red}{\textbf{TODO: }#1}}
```
```{=latex}
\newcommand{\takuya}[1]{\textcolor{orange}{\textbf{Takuya: }#1}}
```
```{=latex}
\newcommand{\makoto}[1]{\textcolor{cyan}{\textbf{Makoto: }#1}}
```
```{=latex}
\newcommand{\koyama}[1]{\textcolor{magenta}{\textbf{}#1}}
```
```{=latex}
\newcommand{\ja}[1]{\begin{CJK}{UTF8}{min}#1\end{CJK}}
```
```{=latex}
\maketitle
```
Introduction
============

#### The memory bottleneck in neural network training.

Modern AI led by generative models [@brown2020languagemodelsfewshotlearners; @rombach2022high; @touvron2023llama2openfoundation; @peebles2023dit] has become integral to everyday life. These models rely on *end-to-end backpropagation*, which requires storing intermediate activations across network layers during training. This fundamental requirement causes memory consumption to grow linearly with network depth, creating computational bottlenecks that limit both research flexibility and practical deployment.

#### Block-wise training: promises and limitations.

*Block-wise training* methods[^1] partition networks into smaller components that can be trained independently, promising dramatic memory savings. Despite this potential, existing approaches [@hinton2022forwardforward; @bengio2006greedy; @nokland2019training; @belilovsky2019greedy; @siddiqui2024blockwise] consistently underperform end-to-end training. The core challenge is twofold: (1) lack of theoretical grounding: existing methods rely on ad-hoc local objectives without principled coordination between blocks, (2) limited applicability, where they require paradigm-specific designs, task-specific objectives that do not naturally extend beyond classification. Their results are typically demonstrated only on custom architectures without providing systematic procedures to be applied to modern architectures such as Transformers [@vaswani2017attention] (Section `\ref{sec:relatedworks}`{=latex}), leaving their applicability to modern generative AI largely unexplored. Without a systematic framework grounded in theory, block-wise training remains an unfulfilled promise.

#### Diffusion models: a mathematical foundation for decomposition.

Score-based diffusion models [@song2019; @song2021scorebased] model the data distribution through a continuous-time process that gradually adds noise, then learns to reverse this process by estimating the score function at each noise level. Crucially, the denoising step at each noise level can be optimized independently from other noise levels. This independence property provides the theoretical foundation that has been missing from block-wise training approaches: it allows us to partition networks into blocks, each responsible for a specific noise level range, without compromising global coherence.

#### Our approach: interpreting networks as diffusion processes.

We propose **`\method`{=latex}**, a framework that enables principled block-wise training by interpreting sequential layer updates in transformer-based networks as discretized steps of a continuous-time diffusion process. Building on the established connection between residual networks and differential equations [@haber2017stable; @chen2018neural], we leverage the fact that residual connections naturally correspond to Euler discretization of the probability flow ODE in diffusion models. This correspondence allows us to partition networks with residual connections, particularly transformer-based networks, into blocks that each handle specific noise-level ranges. These blocks can be trained completely independently, requiring gradients for only one block at a time. Figure `\ref{fig:overview}`{=latex} illustrates the core concept of `\method{}`{=latex}. Unlike previous block-wise methods with ad-hoc objectives, our framework derives each block's training objective from score matching theory. As a result, consistent local optimization at each noise level collectively yields a faithful approximation of the global reverse process, while also allowing practitioners to seamlessly adopt techniques such as those of @karras2022edm to further enhance training.

![**Overview of DiffusionBlocks.** **Left:** Standard networks require backpropagation through all layers. **Center:** DiffusionBlocks partitions networks into blocks, each trained independently to denoise within assigned noise ranges. **Right:** Applications. For diffusion models (top), inference requires only the relevant block per denoising step. For recurrent-depth models (bottom), our framework replaces iterative training with single-pass training, eliminating the computational overhead of backpropagation through time.](figures/DiffusionBlocks.png){#fig:overview width=".8\\textwidth"}

Our main contributions are:

-   **Block-wise training via continuous-time diffusion interpretation:** We show that transformer-based networks can be interpreted as implementing discretized steps of continuous-time diffusion processes (Section `\ref{sec:residual}`{=latex}), enabling genuinely independent block training. Each block learns to denoise within its assigned noise level range, requiring gradients for only one block at a time during training (Section `\ref{sec:method-core}`{=latex}).

-   **Equi-probability partitioning for balanced learning**: We propose a principled, diffusion theoretic strategy that partitions noise levels based on equal cumulative probability mass, ensuring balanced parameter utilization across blocks (Section `\ref{sec:partitioning}`{=latex}).

-   **Broad applicability with maintained performance:** We conduct extensive experiments (Section `\ref{sec:experiments}`{=latex}), demonstrating that DiffusionBlocks successfully applies to diverse architectures (vision, diffusion, autoregressive, recurrent-depth, and masked diffusion), achieving competitive performance to end-to-end backpropagation while requiring gradients for only one block at a time. Additionally, our framework naturally extends to recurrent-depth models, transforming their multiple-iteration training into single-pass training (Section `\ref{sec:exp-recurrent}`{=latex}).

-   **Significant efficiency gains:** During training, only one block requires gradient computation, reducing memory requirements proportionally to the number of blocks. For diffusion models, inference requires only one relevant block per denoising step (Section `\ref{sec:exp-edm}`{=latex}). For recurrent-depth models, our framework eliminates $K$ iterations during training, demonstrating up to $K$-fold reduction in training computation (Section `\ref{sec:exp-recurrent}`{=latex}).

Preliminaries {#sec:preliminaries}
=============

Score-based diffusion models
----------------------------

We adopt the Variance Exploding (VE) formulation [@song2021scorebased; @karras2022edm] where a clean data $\mathbf{y} \sim p_{\text{data}}$ is perturbed with Gaussian noise at noise level $\sigma$: $\mathbf{z}_{\sigma} = \mathbf{y} + \sigma \boldsymbol{\epsilon}$ where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. For generations, we use the deterministic probability flow ODE that reverses the noising process: $$\frac{\mathrm{d}\mathbf{z}_{\sigma}}{\mathrm{d}\sigma} = -\sigma \nabla_{\mathbf{z}} \log p_{\sigma}(\mathbf{z}_{\sigma}),
\label{eq:pf_ode_sigma}$$ where $\nabla_{\mathbf{z}} \log p_\sigma(\mathbf{z}_\sigma)$ is the score function. Using Tweedie's formula, the score is approximated via a denoiser $D_{\boldsymbol{\theta}}(\mathbf{z}_{\sigma}, \sigma)$ that predicts clean data from noisy input: $\nabla_{\mathbf{z}} \log p_{\sigma}(\mathbf{z}_{\sigma}) \approx \frac{D_{\boldsymbol{\theta}}(\mathbf{z}_{\sigma}, \sigma) - \mathbf{z}_{\sigma}}{\sigma^2}$ [@Robbins1992; @hyvarinen2005estimation; @vincent2011dsm]. The denoiser is trained by minimizing: $$\mathcal{L}(\boldsymbol{\theta}) := \mathbb{E}_{\mathbf{z}_0 \sim p_{\text{data}}, \sigma \sim p_{\text{noise}}, \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ w(\sigma) \|D_{\boldsymbol{\theta}}(\mathbf{y} + \sigma\boldsymbol{\epsilon}, \sigma) - \mathbf{y}\|_2^2 \right],
\label{eq:denoising-loss}$$ where $w(\sigma)$ weights different noise levels and $p_{\text{noise}}$ is the noise level distribution used during training. The choice of $p_{\text{noise}}$ determines which noise levels are emphasized during training. @karras2022edm uses a log-normal distribution to concentrate training on perceptually important intermediate noise levels where image structure emerges. The weighting $w(\sigma)$ is designed to counteract the sampling bias from $p_{\text{noise}}$, ensuring balanced gradient magnitudes across all noise levels [@karras2022edm].

Residual connections as Euler steps of the reverse diffusion process {#sec:residual}
--------------------------------------------------------------------

The connection between residual networks and differential equations has been established in prior works [@haber2017stable; @chen2018neural]. We extend this perspective to show that residual networks naturally implement discretized steps of the reverse diffusion process. Applying Euler discretization to Eq. (`\ref{eq:pf_ode_sigma}`{=latex}) with noise levels $\sigma_0 > \sigma_1 > \cdots > \sigma_T$, we define $\Delta\sigma_\ell := \sigma_{\ell-1} - \sigma_\ell > 0$ and obtain: $$\begin{aligned}
\mathbf{z}_{\sigma_l} &= \mathbf{z}_{\sigma_{l-1}} - \Delta\sigma_\ell \cdot \sigma_{\ell-1} \nabla_{\mathbf{z}}\log p_{\sigma_{\ell-1}}(\mathbf{z}_{\sigma_{\ell-1}})\\
&= \mathbf{z}_{\sigma_{\ell-1}} + \frac{\Delta\sigma_\ell}{\sigma_{\ell-1}} \left(\mathbf{z}_{\sigma_{\ell-1}} - D_{\boldsymbol{\theta}}(\mathbf{z}_{\sigma_{\ell-1}}, \sigma_{\ell-1})\right).
\label{eq:euler-step}\end{aligned}$$ As has historically been utilized in the development of the networks with sequential updates, this update rule has an affinity with skip connections. In fact, modern architectures such as Transformers [@vaswani2017attention] employ residual connections where each block updates its input through an additive transformation: $\mathbf{z}_{\ell} = \mathbf{z}_{\ell-1} + f_{\theta_\ell}(\mathbf{z}_{\ell-1})$ where $\mathbf{z}_{\ell} \in \mathbb{R}^d$ denotes the intermediate output of the block $\ell$, and $f_{\theta_\ell}$ is the block transformation parameterized by $\theta_\ell$. This structure appears in ResNets [@he2016resnet], Transformers, and other modern architectures [@peebles2023dit; @touvron2023llama2openfoundation; @deepseekr1]. This scheme is also used in the recent development of recurrent-depth models [@dehghani2018universal; @fan2025looped; @geiping2025scalingtesttimecomputelatent], which apply the same network parameters $\boldsymbol{\theta}$ recursively $K$ times: $\mathbf{z}_{k} = \mathbf{z}_{k-1} + f_{\boldsymbol{\theta}}(\mathbf{z}_{k-1})$ for $k \in [K]$. However, these methods suffer from the expensive *backpropagation through time (BPTT)*, and various measures have been taken to reduce its computational burden, for example, by gradient truncation [@williams1995gradient; @mikolov2010recurrent; @geiping2025scalingtesttimecomputelatent]. That being said, the critical observation is that, in the setting of the diffusion introduced in the previous section, $D_{\boldsymbol{\theta}}$ itself in Eq. (`\ref{eq:euler-step}`{=latex}) can be trained with Eq. (`\ref{eq:denoising-loss}`{=latex}) without BPTT, thereby providing a theoretically sound optimization method of a dynamical system through an ensemble of local optimization. In the next section, we provide a recipe for converting networks with skip connections into diffusion, thereby replacing the *backpropagation through layers* with the optimization scheme analogous to Eq. (`\ref{eq:denoising-loss}`{=latex}).

Method {#sec:method}
======

![**3-step conversion of a standard neural network to DiffusionBlocks at training phase.** **Step 1:** Partition $L$ layers into $B$ blocks. **Step 2:** Define noise distribution $p_\sigma$ (e.g., log-normal) and partition the range $[\sigma_{\min}, \sigma_{\max}]$ into $B$ intervals $\{[\sigma_{b}, \sigma_{b-1}]\}_{b=1}^B$, assigning each block a specific noise range (Section `\ref{sec:partitioning}`{=latex}). **Step 3:** Augment blocks with noise conditioning: extend input to $\tilde{\mathbf{x}} = (\mathbf{x}, \mathbf{z}_\sigma)$ where $\mathbf{z}_\sigma = \mathbf{y} + \sigma\boldsymbol{\epsilon}$, and incorporate noise-level conditioning (e.g., via AdaLN). Then, each block is trained independently from other blocks to predict target $\mathbf{y}$ within its assigned noise range.](figures/meta-algo.png){#fig:meta-algo width=".85\\linewidth"}

```{=latex}
\small
```
```{=latex}
\definecolor{myblue}{HTML}{70A3D2}
```
```{=latex}
\definecolor{myorange}{HTML}{FF9966}
```
```{=latex}
\begin{tcolorbox}[
title=Standard Network -- Training,
    left=0mm, right=2mm,
    colback=myblue!5!white,
    colframe=myblue!75!black,
    halign title=center,
]\begin{algorithmic}[1]
\small
\State \textbf{Given:} Network with parameters $\boldsymbol{\theta}$
\State Sample data $(\mathbf{x}, \mathbf{y})$
\State $\mathbf{z}_{0} \leftarrow \mathbf{x}$
\For{$\ell = 1$ to $L$}
    \State $\mathbf{z}_{\ell} \leftarrow \mathbf{z}_{\ell-1} + f_{\theta_\ell}(\mathbf{z}_{\ell-1})$
\EndFor
\State $\hat{\mathbf{y}} \leftarrow \mathbf{z}_{L}$
\State $\mathcal{L} \leftarrow \text{Loss}(\hat{\mathbf{y}}, \mathbf{y})$
\State Update all $\boldsymbol{\theta}$ via backprop
\end{algorithmic}
\end{tcolorbox}
```
```{=latex}
\hfill
```
```{=latex}
\begin{tcolorbox}[
title=DiffusionBlocks -- Training,
    left=0mm, right=2mm,
    colback=myorange!5!white,
    colframe=myorange!75!black,
    halign title=center,
]\begin{algorithmic}[1]
\small
\State \textbf{Given:} A single block $b \in [B]$ with parameters $\boldsymbol{\theta}_b$
\State Sample data $(\mathbf{x}, \mathbf{y})$
\State Sample $\sigma \sim p_{\text{noise}}^{(b)}$ from $[\sigma_{b}, \sigma_{b-1}]$ 
\State $\hat{\mathbf{y}} \leftarrow \fblock_{\boldsymbol{\theta}_b \mid \sigma}(\mathbf{x}, \mathbf{y} + \sigma\boldsymbol{\epsilon})$, where $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ \Comment{Apply block $b$ to denoise}
\State $\mathcal{L} \leftarrow w(\sigma) \cdot  \text{Loss}(\hat{\mathbf{y}}, \mathbf{y})$ \Comment{Weighted loss}
\State Update only $\boldsymbol{\theta}_b$ via backprop
\end{algorithmic}
\end{tcolorbox}
```
```{=latex}
\begin{tcolorbox}[title=Standard Network -- Inference,
    left=0mm, right=2mm,
    colback=myblue!5!white,
    colframe=myblue!75!black,
    halign title=center,
    ]\begin{algorithmic}[1]
\small
\State \textbf{Input:} $\mathbf{x}$
\State $\mathbf{z}_{0} \leftarrow \mathbf{x}$
\For{$\ell = 1$ to $L$}
    \State $\mathbf{z}_{\ell} \leftarrow \mathbf{z}_{\ell-1} + f_{\theta_\ell}(\mathbf{z}_{\ell-1})$
\EndFor
\State \textbf{Output:} $\mathbf{z}_{L}$
\end{algorithmic}
\end{tcolorbox}
```
```{=latex}
\hfill
```
```{=latex}
\begin{tcolorbox}[
title=DiffusionBlocks -- Inference,
    left=0mm, right=2mm,
    colback=myorange!5!white,
    colframe=myorange!75!black,
    halign title=center,
    ]\begin{algorithmic}[1]
\footnotesize
\State \textbf{Input:} $\mathbf{x}$, noise levels $\{\sigma_i\}_{i=1}^T$ \Comment{Typically,  $T=B$}
\State $\mathbf{z}_0 \sim \mathcal{N}(\mathbf{0}, \sigma_{\max}^2\mathbf{I})$
\For{$i = 0$ to $T-1$}
\State Select block $b$ where $\sigma_i \in [\sigma_{b}, \sigma_{b-1}]$
\State $\hat{\mathbf{y}} \leftarrow \fblock_{\boldsymbol{\theta}_b \mid \sigma_{i-1}}(\mathbf{x}, \mathbf{z}_{i-1})$
\State $\mathbf{z}_{i} \leftarrow \text{Euler step}(\mathbf{z}_{i-1}, \hat{\mathbf{y}}, \sigma_{i-1}, \sigma_i)$ \Comment{Eq.~(\ref{eq:new_update})}
\EndFor
\State \textbf{Output:} $\mathbf{z}_T$
\end{algorithmic}
\end{tcolorbox}
```
Converting a neural network to `\method{}`{=latex} {#sec:method-core}
--------------------------------------------------

Our goal in this section is to transform a given feedforward system into a discretized version of the recursive denoising steps in the diffusion model. Throughout this paper, we denote by $(\mathbf{x}, \mathbf{y})$ the input-output pairs where $\mathbf{x}$ represents the network input (e.g., images for classification) and $\mathbf{y}$ is the target output (e.g., class label for classification). Figure `\ref{fig:overview}`{=latex} provides an overview: instead of backpropagating through all layers, we partition networks into blocks that independently learn to denoise within assigned noise level ranges. Consider a neural network in a form of a stack of set-to-set maps (e.g. transformer-based networks) $\cF = \{f_{\theta_\ell} \mid \ell \in [L] \}$ with the same output and input dimensions, so that $f_{\theta_\ell}$ maps a variable set of tokens in $\RR^d$ to the same number of tokens in $\RR^d$. The original network therefore processes the input with $f_{\theta_{L}} \circ \cdots \circ f_{\theta_{0}}$, followed possibly by a readout module. Or, in more conventional formulation with the presence of residual, the original network may update the $\ell$-th layer input $\mathbf{z}_\ell$ to the next layer via the rule $\textbf{z}_{\ell+1} = \textbf{z}_\ell + f_{\theta_\ell}(\textbf{z}_\ell)$. We transform this network into a stack of Diffusion Blocks through the following three steps (Figure `\ref{fig:meta-algo}`{=latex}).

#### Step 1: Block partitioning.

We partition $\cF$ into $B$ blocks $\cF = \uplus_{b=1}^B \cF_b$, where $\cF_b$ contains layers indexed by $\{\ell_{b-1}+1, \ldots, \ell_b\}$. Let $\fblock_{\boldsymbol{\theta}_b} := f_{\theta_{\ell_b}} \circ \cdots \circ f_{\theta_{\ell_{b-1}+1}}$ be the composition of layers in $\cF_b$.

#### Step 2: Noise range assignment.

We define a noise distribution $p_{\text{noise}}$ and define a noise range $[\sigma_{\min}, \sigma_{\max}]$. We partition the range into $B$ intervals $\{[\sigma_{b}, \sigma_{b-1}]\}_{b=1}^B$. We recommend the choice of log-normal for $p_{\text{noise}}$, following @karras2022edm, along with the partitioning strategy in Section `\ref{sec:partitioning}`{=latex}.

#### Step 3: Augmenting blocks with noise conditioning.

Finally, we suit $\{\fblock_{\theta_b} \}_b$ to the update rule in Eq. (`\ref{eq:euler-step}`{=latex}) by letting $\fblock_{\boldsymbol{\theta}_b}$ play the role of $D_{\boldsymbol{\theta}_b}$. Leveraging the assumption that $\fblock_{\boldsymbol{\theta}_b}$ is a map from a set of tokens to a set of tokens, we alter the input $\fblock_{\boldsymbol{\theta}_b}$ from $\mathbf{x}$ to $\tilde{\mathbf{x}} = (\mathbf{x}, \mathbf{z})$. Additionally, we extend each block $f_{\boldsymbol{\theta}_b}$ to incorporate noise-level conditioning through, for example, via normalization (AdaLN) [@peebles2023dit]. We denote this noise-conditioned version as $\fblock_{\boldsymbol{\theta}_b \mid \sigma}$. Altogether, the update of the diffusion block constructed from $\cF$ is given by: $$\begin{aligned}
\bz_b  = \bz_{b-1} + \frac{\Delta\sigma_b}{\sigma_{b-1}} \left(\mathbf{z}_{b-1} - [\fblock_{\boldsymbol{\theta}_b \mid \sigma_{b-1}}(\textbf{x},  \bz_{b-1} )]_\bz \right),
\label{eq:new_update} \end{aligned}$$ where $[\fblock(\cdot)]_\bz$ is the set of tokens corresponding to $\bz$ (i.e. $\fblock(\cdot) = ([\fblock(\cdot)]_{\textbf{x}}, [\fblock(\cdot)]_\bz)$). More abstractly put, our modified update rule Eq. (`\ref{eq:new_update}`{=latex}) can be rewritten as $\bz_b = \alpha \bz_{b-1} + \beta \fblock_{\boldsymbol{\theta}_b \mid \sigma_{b-1}}(\textbf{x}, \bz_{b-1} )$ where $\alpha$ and $\beta$ are constants dependent on $\sigma$ ratio. We note that our modification of the network into the stack of diffusion blocks maintains most of the structure of the original, particularly in the presence of skip connection, so that $\bz_\ell = \bz_{\ell-1} + f_{\theta_\ell}(\bz_{\ell-1})$ is the original update rule. At the time of inference, $\bz_b$ serves as the intermediate estimator of the target variable, with $\bz_0 = \sigma_{\max} \boldsymbol{\epsilon}$ being the pure noise. Please see Figure `\ref{fig:meta-algo-inference}`{=latex} in Appendix `\ref{app:architectures}`{=latex} for the conversion of this inference process.

Block-independent training of the diffusion blocks
--------------------------------------------------

By the network modification recipe in the previous section, we transform the original feedforward map to the recursive denoising map in a diffusion process. The advantage of this modification is the fact that the objective in Eq. (`\ref{eq:denoising-loss}`{=latex}) can be optimized at any noise level $\sigma$ independently without knowledge of other noise levels. This allows us to define a training objective for each block $b$: $$\mathcal{L}_b(\boldsymbol{\theta}_b) := \mathbb{E}_{(\mathbf{x}, \mathbf{y}) \sim p_{\text{data}}, \sigma \sim p_{\text{noise}}^{(b)}, \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})} \left[ w(\sigma)\cdot \text{Loss}( \fblock_{\boldsymbol{\theta}_b | \sigma}(\mathbf{x}, \mathbf{y} + \sigma \boldsymbol{\epsilon}) , \mathbf{y})\right], \label{eq:Loss}$$ where $p_{\text{noise}}^{(b)}$ is the noise distribution $p_{\text{noise}}$ with the support of $[\sigma_{b}, \sigma_{b-1}]$ and renormalized, and $\text{Loss}(\cdot, \cdot)$ is the inner loss function, typically L2 loss as in Eq. (`\ref{eq:denoising-loss}`{=latex}). Each block independently learns to denoise within its assigned range, with training samples drawn according to the original distribution $p_{\text{noise}}$. Collectively, the $B$ blocks cover the entire noise distribution: $\bigcup_{b=1}^B [\sigma_{b}, \sigma_{b-1}] = [\sigma_{\min}, \sigma_{\max}]$, ensuring that the complete network can denoise at any noise level while each block specializes in its designated range. This independence enables training with memory requirements for only $L/B$ layers, storing activations only for the active block, compared to all $L$ layers required by standard training. More succinctly in comparison to the original network, we gain this block-wise independence from the fact that $\fblock_{\boldsymbol{\theta}_b \mid \sigma}(\mathbf{x},  \mathbf{y} + \sigma \boldsymbol{\epsilon})$ is now modified to predict $\mathbf{y}$ for each $b$. This way, training for each block can be carried out without waiting to receive the output of the previous layer. Please see Figure `\ref{fig:meta-algo-training}`{=latex} in Appendix `\ref{app:architectures}`{=latex} for the training process in the specific adaptations for different architectures. Figure `\ref{fig:diffusionblocks-algo}`{=latex} provides an algorithmic procedure of training and inference. This approach achieves a $B\times$ memory reduction during training, as gradients are computed for only one block at a time.

Block partitioning strategy {#sec:partitioning}
---------------------------

A critical design choice in `\method{}`{=latex} is how to partition the noise level range $[\sigma_{\min}, \sigma_{\max}]$ into $B$ intervals. A naive approach would divide the range uniformly: $\sigma_b = \sigma_{\min} + b \cdot (\sigma_{\max} - \sigma_{\min})/B$. However, this fails to account for the varying difficulty of denoising at different noise levels. Following @karras2022edm, we adopt a log-normal distribution for sampling noise levels during training: $\log \sigma \sim \mathcal{N}(P_{\text{mean}}, P_{\text{std}}^2)$. This distribution concentrates probability mass at intermediate noise levels, which empirically contribute most to generation quality.

```{=latex}
\begin{wrapfigure}{r}{0.45\textwidth}
    
    \includegraphics[width=.95\linewidth]{figures/boundaries.pdf}
\caption{{\footnotesize \textbf{Equi-probability partitioning ($B=3$).} 
Blocks partition the log-normal $p_\sigma$ by equal probability mass (orange boundaries), not uniform spacing (gray), concentrating capacity where denoising is most challenging.
}}
\label{fig:partitioning}

\end{wrapfigure}
```
To preserve this distribution across the entire network while ensuring each block handles equal denoising difficulty, we partition based on cumulative probability mass. Specifically, we choose boundaries $\{\sigma_b\}_{b=1}^B$ such that each block handles exactly $1/B$ of the total probability mass: $\int_{\sigma_{b-1}}^{\sigma_b} p_{\text{noise}}(\sigma) d\sigma = 1/B$. The block boundaries are computed as $\sigma_b = \exp(P_{\text{mean}} + P_{\text{std}} \cdot \Phi^{-1}(q_b))$, where $\Phi^{-1}$ is the inverse standard normal CDF and $q_b = q_{\min} + \frac{b}{B} (q_{\max} - q_{\min})$, with $q_{\min/\max} = \Phi\left(\frac{\log \sigma_{\min/\max} - P_{\text{mean}}}{P_{\text{std}}}\right)$. This *equi-probability partitioning* ensures that each block handles an equal amount of the training distribution's probability mass, leading to balanced parameter utilization. As shown in Figure `\ref{fig:partitioning}`{=latex}, blocks assigned to intermediate noise levels, where denoising is most challenging, receive narrower intervals, while blocks handling very high or low noise levels receive wider intervals. This strategy optimizes learning efficiency across all blocks. In Section `\ref{sec:ablations}`{=latex}, we demonstrate that this strategy contributes significantly to the training of `\method`{=latex}. Also, see Appendix `\ref{app:implementation}`{=latex} for implementation details.

Related works {#sec:relatedworks}
=============

#### Block-wise training methods.

Various block-wise training approaches [@hinton2022forwardforward; @bengio2006greedy; @nokland2019training; @belilovsky2019greedy; @siddiqui2024blockwise] partition networks into independently trainable components but lack theoretical grounding, relying on heuristic objectives that fail to guarantee global performance when optimized locally. Approaches like Forward-Forward algorithm [@hinton2022forwardforward] rely on contrastive objectives, which fundamentally limit them to classification tasks and make adaptation to generation non-trivial. In contrast, `\method{}`{=latex} leverages denoising score matching theory, which naturally decomposes into independent local objectives without task-specific constructs, enabling application to both classification and generative tasks.

#### Comparison with NoProp.

Concurrently with our submission, @li2025noprop has also released a backpropagation-free strategy in close relation to our philosophy. However, they present their technique together with the custom CNN-based architecture in one package and evaluate only on classification tasks, making it unclear how to apply their approach to modern architectures or tasks other than the classification they showcase in their work. In contrast, DiffusionBlocks provides a systematic procedure for converting any residual networks, particularly modern transformers, into block-wise trainable models with minimal modifications. We partition the continuous noise range using equi-probability partitioning and demonstrate success on both generative tasks and classification tasks. In Section `\ref{sec:exp-noprop}`{=latex}, we apply `\method{}`{=latex} to their architecture, and demonstrate that our continuous-time block-wise training with equi-probability partitioning is more effective.

#### Stage-specific diffusion models.

Several works train specialized models for different noise levels in diffusion [@balaji2023ediffi; @fang2024remixdit; @park2024switchdit; @reuss2025mode]. However, these approaches train models jointly or fine-tune from shared parameters. `\method{}`{=latex} trains blocks independently, with no shared parameters or joint fine-tuning, achieving complete isolation.

Experimental results {#sec:experiments}
====================

We evaluate `\method{}`{=latex} across diverse architectures and tasks to demonstrate its generality and effectiveness. Detailed experimental configurations are provided in Appendix `\ref{app:exp-details}`{=latex}. For each architecture, we report task performance alongside the memory reduction factor $B$, where only $L/B$ layers require gradients during training.

#### Baselines.

Because `\method{}`{=latex} is a framework for transforming networks into block-wise trainable models, we evaluate its efficacy by comparing the modified network (trained block-wise) against the original network (trained with end-to-end backpropagation). Other block-wise training methods in practice today also include Forward-Forward (FF) [@hinton2022forwardforward] and the concurrent NoProp [@li2025noprop]. Fair comparison against these methods warrants careful experimental design. Firstly, we compare against FF only on classification tasks (Section `\ref{sec:exp-vit}`{=latex}) since its contrastive objective does not naturally extend to generation. Also, because NoProp is proposed together with a custom architectural design rather than with a principled transformation procedure to be applied to a vanilla network, the adaptation of NoProp to other architectures involves nontrivial design choices and freedom. To enable fair comparison with NoProp, we therefore use their specific architecture as the base diffusion model on which to apply our `\method{}`{=latex} (Section `\ref{sec:exp-noprop}`{=latex}).

Vision transformers for image classification {#sec:exp-vit}
--------------------------------------------

We first validate DiffusionBlocks on classification tasks using Vision Transformer (ViT) [@dosovitskiy2021vit] on CIFAR-100 [@krizhevsky2009cifar10]. A 12-layer ViT is partitioned into $B$=3 blocks, with noise added to class label embeddings during training. We compare against the Forward-Forward algorithm, a representative block-wise training method that uses contrastive objectives. Table `\ref{tab:vit}`{=latex} maintains baseline accuracy while requiring gradients for only $4$ layers. Notably, Forward-Forward achieves only 7.85% accuracy, highlighting the importance of principled denoising objectives over ad-hoc contrastive approaches.

Diffusion models for image generation {#sec:exp-edm}
-------------------------------------

```{=latex}
\resizebox{1\linewidth}{!}{
\footnotesize
\begin{tabular}{lr}
\toprule
\textbf{Method} & \textbf{Accuracy ($\uparrow$)} \\
\midrule
ViT & 60.25 \\
+ Forward-Forward & 7.85 \\
\rowcolor{lightgray}
\textbf{+ DiffusionBlocks} & \textbf{59.30} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\hfill
```
```{=latex}
\resizebox{1\linewidth}{!}{
\footnotesize
\begin{tabular}{llr}
\toprule
\textbf{Dataset} & \textbf{Method} & \textbf{FID ($\downarrow$)} \\
\midrule
CIFAR-10 & DiT & 32.84 / 39.83 \\
% \rowcolor{lightgray}
& \cellcolor{lightgray} + \textbf{DiffusionBlocks} & \cellcolor{lightgray} \textbf{30.59 / 37.20} \\
\midrule
ImageNet & DiT & 9.01 / 12.09 \\
% \rowcolor{lightgray}
& \cellcolor{lightgray} + \textbf{DiffusionBlocks} & \cellcolor{lightgray} \textbf{9.00 / 10.63} \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\hfill
```
```{=latex}
\resizebox{1\linewidth}{!}{
\footnotesize
\begin{tabular}{lr}
\toprule
\textbf{Method} & \textbf{BPC ($\downarrow$)}  \\
\midrule
MDM & 1.56 \\
\rowcolor{lightgray}
+ \textbf{DiffusionBlocks} & \textbf{1.45} \\
\bottomrule
\end{tabular}
}
```
Having established its effectiveness on classification tasks, we now turn to generative models. We begin with image generation, where DiffusionBlocks provides both training and inference efficiency benefits. We apply DiffusionBlocks to DiT [@peebles2023dit] within the EDM [@karras2022edm] framework. We evaluate 12-layer DiT (DiT-S/2) on CIFAR-10 [@krizhevsky2009cifar10] and 24-layer DiT (DiT-L/2) on ImageNet at $256\times256$ resolution [@deng2009imagenet], both with $B$=3 blocks. During inference, we use Euler sampling with 50 steps and classifier-free guidance (scale 2.0) [@ho2021classifierfree]. Table `\ref{tab:img-gen}`{=latex} shows that DiffusionBlocks achieves comparable FID scores with 3$\times$ memory reduction. Additionally, inference requires only one block per denoising step, providing computational savings proportional to the number of steps.

Masked diffusion models for text generation {#sec:exp-mdm}
-------------------------------------------

We extend DiffusionBlocks to masked diffusion language models using MD4 [@shi2024md4] on the text8 dataset [@text8]. While continuous diffusion models naturally map to our framework through noise levels $\sigma$, extending DiffusionBlocks to discrete masked diffusion requires careful adaptation. Specifically, we partition the masking schedule rather than continuous noise levels, ensuring each block handles an equal share of the demasking work (details in Appendix `\ref{app:diffusion-lm}`{=latex}). We use a 12-layer DiT-based transformer [@lou2024sedd; @sahoo2024mdlm] partitioned into $B$=3 blocks. Table `\ref{tab:mdm}`{=latex} shows that DiffusionBlocks achieves 1.45 bits-per-character (BPC) compared to MD4's 1.56, while using 3$\times$ less memory. This improvement confirms that our principled noise-level partitioning effectively extends to discrete diffusion processes.

Autoregressive models for text generation {#sec:exp-ar}
-----------------------------------------

```{=latex}
\small
```
```{=latex}
\resizebox{.8\columnwidth}{!}{
    \footnotesize
    \begin{tabular}{llrrr}
    \toprule
    \textbf{Dataset} & \textbf{Method} & \textbf{MAUVE ($\uparrow$)} & \textbf{PPL {\footnotesize(\texttt{Llama-2})} ($\downarrow$)} & \textbf{PPL {\footnotesize(\texttt{GPT2-XL})} ($\downarrow$)} \\
    \midrule
    LM1B & AR & 0.50 & 14.58 & 38.87 \\
    % \rowcolor{lightgray}
    & \cellcolor{lightgray} + \textbf{DiffusionBlocks}
    &  \cellcolor{lightgray}  \textbf{0.71} &  \cellcolor{lightgray}  \textbf{12.32} &  \cellcolor{lightgray}  \textbf{30.99} \\
    \midrule
    OWT & AR & \textbf{0.85} & 15.05 & \textbf{25.24} \\
    % \rowcolor{lightgray}
    &  \cellcolor{lightgray}  + \textbf{DiffusionBlocks}
    &  \cellcolor{lightgray}  0.82 &  \cellcolor{lightgray}  \textbf{14.99} &  \cellcolor{lightgray}  26.33 \\
    \bottomrule
    \end{tabular}
    }
```
```{=latex}
\small
```
```{=latex}
\resizebox{.8\columnwidth}{!}{
    \footnotesize
    \begin{tabular}{lrrr}
    \toprule
    \textbf{Method} & \textbf{MAUVE ($\uparrow$)} & \textbf{PPL {\footnotesize(\texttt{Llama-2})} ($\downarrow$)} & \textbf{PPL {\footnotesize(\texttt{GPT2-XL})} ($\downarrow$)} \\
    \midrule
    \texttt{Huginn}~{\footnotesize \citep{geiping2025scalingtesttimecomputelatent}} & 0.49 & 17.04 & 46.73 \\
    \rowcolor{lightgray}
    + \textbf{DiffusionBlocks}
    &  \textbf{0.70} &  \textbf{16.08} &  \textbf{42.43} \\
    \bottomrule
    \end{tabular}
    }
```
```{=latex}
\resizebox{.6\columnwidth}{!}{
\begin{tabular}{llccr}
\toprule
 \textbf{Method} & & \textbf{Continuous} & \textbf{Block-wise} & \textbf{Accuracy ($\uparrow$)} \\
\midrule
Backprop & & & 47.80 \\
\midrule
\texttt{NoProp-DT} & & $\checkmark$ & 46.06 \\
\texttt{NoProp-CT} & $\checkmark$ & & 21.31 \\
\texttt{NoProp-FM} & $\checkmark$ & & 37.57 \\
\rowcolor{lightgray}
 \textbf{(Ours) DiffusionBlocks} & $\checkmark$ & $\checkmark$ & \textbf{46.88} \\
\bottomrule
\end{tabular}
}
```
We demonstrate that DiffusionBlocks successfully transforms standard autoregressive (AR) models, which are architectures originally designed for next-token prediction, not denoising. Using 12-layer Llama-2-style transformers [@touvron2023llama2openfoundation] with $B$=4 blocks, we evaluate on 1 Billion Words Dataset (LM1B) [@chelba2014lm1b] and OpenWebText (OWT) [@Gokaslan2019OpenWeb]. While AR models are typically evaluated using perplexity, computing traditional perplexity is non-trivial for our diffusion framework as it is not derived from ELBO. Instead, we evaluate using MAUVE [@pillutla2021mauve] scores following SEDD [@lou2024sedd] to measure similarity between generated and real text. We also report generative perplexity from two teacher models, `Llama-2-7B` and `GPT2-XL` [@radford2019language], following @lou2024sedd [@sahoo2024mdlm]. Table `\ref{tab:lm}`{=latex} shows that DiffusionBlocks achieves comparable performance despite training only 3 layers at a time, demonstrating the framework's broad applicability beyond diffusion-native architectures.

Recurrent-depth models for text generation {#sec:exp-recurrent}
------------------------------------------

We now showcase a different application of `\method{}`{=latex} beyond block-wise training. As noted in Section `\ref{sec:residual}`{=latex}, the updates in recurrent-depth models naturally correspond to diffusion steps. Following Section `\ref{sec:method-core}`{=latex}, we apply `\method{}`{=latex} to `Huginn` [@geiping2025scalingtesttimecomputelatent], a recurrent-depth model that applies the same network multiple times, starting from noise. While `Huginn` uses 8-step truncated BPTT to avoid the full BPTT over 32 iterations, `\method{}`{=latex} makes this optimization even more efficient, because it only requires a single forward pass per training step. Table `\ref{tab:recurrent}`{=latex} shows better performance on LM1B for text generation while eliminating 32 iterations. This demonstrates that our framework enables fundamental training transformations beyond block-wise training.

Analysis {#sec:ablations}
--------

### Comparison with NoProp {#sec:exp-noprop}

We compare DiffusionBlocks with NoProp as an ablation study, applying to their custom CNN-based architecture to isolate the effect of our continuous-time block-wise training using equi-probability partitioning. Table `\ref{tab:noprop_comparison}`{=latex} shows results on CIFAR-100 classification (details in Appendix `\ref{app:exp-noprop}`{=latex}). DiffusionBlocks outperforms all NoProp variants. Notably, while maintaining comparable performance to the backpropagation, DiffusionBlocks is the only method that successfully combines continuous-time formulation with block-wise training. This demonstrates that our equi-probability partitioning with independent denoisers per block is crucial for continuous-time block-wise training.

### Ablation Studies on Design Choices

We conduct ablation studies to analyze key design choices in DiffusionBlocks. All experiments follow the configurations described ectiveness of each component.[^2]

```{=latex}
\resizebox{.95\columnwidth}{!}{
\begin{tabular}{lcr}
\toprule
\textbf{Partitioning Strategy} & \textbf{Layer Distribution} & \textbf{FID} ($\downarrow$) \\
\midrule
Uniform & [4,4,4] & 43.53 \\
Uniform & [3,6,3] & 43.59 \\
Uniform & [6,4,2] & 47.49 \\
Uniform & [2,4,6] & 42.37 \\
\rowcolor{lightgray}
Equi-Probability & [4,4,4] & \textbf{38.03} \\
Equi-Probability & [3,6,3] & 41.64 \\
Equi-Probability & [6,4,2] & 45.42 \\
Equi-Probability & [2,4,6] & 40.40 \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\hfill
```
```{=latex}
\resizebox{.95\columnwidth}{!}{
\begin{tabular}{lrrr}
\toprule
\textbf{Number of Blocks} & \textbf{FID} ($\downarrow$) & \textbf{L/B} ($\downarrow$) & \textbf{Relative Speed} \\
\midrule
$B=1$ & 12.09 & 24 & 1.0$\times$ \\
\midrule
\rowcolor{lightgray}
$B=2$ & \textbf{9.90} & 12 & 2.0$\times$ \\
\rowcolor{lightgray}
$B=3$ & \textbf{11.11} & 8 & 3.0$\times$ \\
\rowcolor{lightgray}
$B=4$ & \textbf{11.90} & 6 & 4.0$\times$ \\
$B=6$ & 14.43 & 4 & 6.0$\times$ \\
\bottomrule
\end{tabular}
}
```
#### Block partitioning strategy.

Table `\ref{tab:ablation_partitioning}`{=latex} compares our equi-probability partitioning with uniform partitioning on CIFAR-10. Equi-probability partitioning achieves significantly better FID across all layer distributions. The improvement stems from allocating computational resources based on denoising difficulty: equi-probability assigns more blocks to challenging intermediate noise levels where most learning occurs, while uniform partitioning wastes capacity on trivial very high/low noise regions. Notably, within equi-probability partitioning, uniform layer distribution (4-4-4) achieves the best FID, demonstrating that practitioners can simply divide layers equally without tuning since the noise-based partitioning automatically balances learning difficulty across blocks.

#### Number of blocks $B$.

Table `\ref{tab:ablation_blocks}`{=latex} summarizes the effect of varying the number of blocks on ImageNet (see Appendix `\ref{app:block_count_cifar10}`{=latex} for the results on CIFAR-10). It reveals the trade-off between generation quality and efficiency on ImageNet. Notably, moderate block counts ($B$=2 or $B$=3) achieve better FID than end-to-end training ($B$=1), suggesting that moderate block partitioning can actually improve performance through specialization. As $B$ increases further, quality gradually declines due to reduced capacity per block, though inference speed improves linearly. The optimal $B$ varies across tasks (see Appendix `\ref{app:block_count_language}`{=latex} for language modeling results).

Conclusion
==========

We introduced DiffusionBlocks, a theoretically grounded framework that transforms residual networks into independently trainable blocks through continuous-time diffusion interpretation. By recognizing that residual connections naturally implement discretized diffusion steps, we provide a systematic recipe requiring minimal modifications that maintains competitive performance across diverse architectures while achieving $B\times$ memory reduction during training.

#### Future works.

Our work opens several important directions for future research. First, while we consistently used Euler discretization to match residual connections, other diffusion samplers [@song2021ddim; @lu2023dpm++; @zhao2023unipc] could be employed within blocks with modified inter-block connections. Second, `\method{}`{=latex} currently requires matching input-output dimensions, which limits its application to architectures like U-Net [@ronneberger2015unet]. Third, while we demonstrate DiffusionBlocks' effectiveness on models trained from scratch, scaling to even larger models would further demonstrate its practical impact. Particularly, a promising direction is to convert pre-trained large models to `\method{}`{=latex} through fine-tuning rather than training from scratch. Fourth, determining the optimal granularity of block partitioning presents an interesting theoretical and practical challenge. While our experiments demonstrate that treating entire architectural blocks (e.g., complete ViT blocks) as single denoising units works well, a principled method for selecting the ideal partitioning granularity based on architecture and task characteristics could further enhance the framework's applicability. Finally, understanding why moderate block partitioning sometimes outperforms end-to-end training warrants theoretical investigation. We hypothesize two contributing factors: (1) DiffusionBlocks employs a different optimization structure in which each block is directly linked to the target through a denoising objective in Eq.( `\ref{eq:block_loss}`{=latex}), creating a learning signal that differs from standard end-to-end training; and (2) assigning different noise ranges to different blocks may induce beneficial specialization effects. Combined with equi-probability partitioning, this introduces a natural form of curriculum learning [@bengio2009curriculum] by allocating balanced difficulty across blocks. Developing a formal theory and analysis for these effects could reveal new principles for scalable and structured neural network optimization beyond memory efficiency.

DiffusionBlocks represents a step toward democratizing large-scale model training by reducing computational requirements without sacrificing performance, making advanced AI capabilities more accessible.

```{=latex}
\ificlrfinal
```
Author contributions {#author-contributions .unnumbered}
====================

Makoto Shing conceptualized the DiffusionBlocks framework, developed its diffusion-theoretic formulation connecting residual networks and continuous-time diffusion processes, implemented the method, conducted all experiments, and wrote the manuscript. Masanori Koyama provided theoretical insights into the diffusion-based interpretation and contributed to refining both the manuscript and the conceptual positioning of the work. Takuya Akiba supervised the research and provided technical guidance and feedback throughout the project. All authors contributed to the interpretation of results and manuscript revision.

Acknowledgement {#acknowledgement .unnumbered}
===============

The authors would like to thank Stefano Peluchetti for helpful feedback on an earlier version of the draft.

```{=latex}
\else
```
#### Ethics Statement.

We acknowledge that reducing computational barriers in AI research raises both opportunities and responsibilities. While DiffusionBlocks democratizes access to large-scale model training through B$\times$ memory reduction, we recognize that improved accessibility must be balanced with responsible use. Our experiments used only publicly available datasets and models to ensure transparency and reproducibility. We note that models trained with our framework inherit any biases or limitations present in their training data and architectures. The environmental benefits of reduced computational requirements should be weighed against the potential for increased model deployment. Researchers using DiffusionBlocks should consider the broader implications of their specific applications.

#### Reproducibility statement.

We provide comprehensive details to ensure reproducibility of our results. We provide a complete implementation of DiffusionBlocks applied to Vision Transformer for image classification (Section `\ref{sec:exp-vit}`{=latex}) in the supplementary materials, demonstrating the core concepts and practical application of our framework. All experiments use publicly available datasets and open-source model architectures. Detailed experimental configurations are provided in Section `\ref{sec:experiments}`{=latex} and Appendix `\ref{app:exp-details}`{=latex}, including model architectures, hyperparameters, and training protocols. The DiffusionBlocks algorithm is fully described in Section `\ref{sec:method}`{=latex}, with mathematical foundations in Section `\ref{sec:preliminaries}`{=latex}. All baseline results from other methods are either reproduced using their official implementations or taken directly from their published papers as clearly indicated. `\fi`{=latex}

```{=latex}
\bibliographystyle{iclr2026_conference}
```
```{=latex}
\appendix
```
Notations
=========

In this section, we provide the notations that we will be using in the ensuing mathematical formulations and statements.

  **Notation**                                                 **Description**
  ------------------------------------------------------------ --------------------------------------------------------------------------------------------------------------
  $x \sim \mathcal{X}$                                         Conditioning/Input to the network (task-dependent: see below)
  $y \in \mathcal{Y}$                                          Clean target data (task-dependent: see below)
  $\sigma \in \mathbb{R}$                                      Noise level in continuous diffusion.
  $z_\sigma \in \mathbb{R}^d$                                  Noisy data at noise level $\sigma$: $z_\sigma = y + \sigma\epsilon$, where $\epsilon \sim \mathcal{N}(0, 1)$
  $z_\ell \in \mathbb{R}^d$                                    Intermediate activation at layer/block $\ell$
  $D_\theta: \mathbb{R}^d \times \mathbb{R} \to \mathcal{Y}$   Denoiser network with parameters $\theta$
  $f_{\theta_\ell}: \mathbb{R}^d \to \mathbb{R}^d$             Layer/block transformation with parameters $\theta_\ell$
  $B$                                                          Number of blocks
  $L$                                                          Total number of layers
  **Examples of $(x,y)$ on a task:**                           
  Image classification                                         $x$: input image, $y$: class label
  Image generation                                             $x$: noisy image (optionally, and class label), $y$: clean image
  Text Generation (AR)                                         $x$: previous tokens, $y$: next token
  Text Generation (AR)                                         $x$: sequence with mask tokens, $y$: unmasked sequence

Extension to diverse architectures {#app:architectures}
==================================

While we have described DiffusionBlocks for standard residual networks where inputs and outputs naturally live in the same $d$-dimensional space, the framework extends to specialized architectures. Figures `\ref{fig:meta-algo-training}`{=latex} and `\ref{fig:meta-algo-inference}`{=latex} illustrate how different model types can be converted to DiffusionBlocks for training and inference, respectively.

![**Converting different architectures to DiffusionBlocks: Training.** During training, noise is added to target outputs (labels, embeddings, or images) and each block learns to denoise within its assigned noise range. Blocks are sampled randomly and trained independently, requiring gradients for only one block at a time.](figures/meta-algo-training.png){#fig:meta-algo-training width=".8\\textwidth"}

![**Converting different architectures to DiffusionBlocks: Inference.** During inference, blocks are applied sequentially from $\sigma_{\text{max}}$ to $\sigma_{\text{min}}$. The figure shows the first denoising step where block $b=1$ transforms pure noise $\bz_0$ into the next state $\bz_1$. Only the relevant block is active at each noise level, providing memory efficiency. $\oplus$ denotes the Euler step in Eq. (`\ref{eq:new_update}`{=latex}).](figures/meta-algo-inference.png){#fig:meta-algo-inference width=".8\\textwidth"}

For Vision Transformers (ViT) [@dosovitskiy2021vit] in classification tasks (top left), we adapt DiffusionBlocks by adding noise to the class label embeddings while maintaining the standard ViT architecture. Specifically, we create the input sequence by concatenating the `[CLS]` token, patch embeddings $\mathbf{x}$, and the noisy label embedding $\mathbf{z}_\sigma$, where $\mathbf{z}_\sigma = \mathbf{y}_{\text{emb}} + \sigma\boldsymbol{\epsilon}$ and $\mathbf{y}_{\text{emb}} \in \mathbb{R^d}$ is the learnable continuous embeddings for the class label $y$. Each block $b$ learns to denoise this label representation conditioned on the patch embeddings $\mathbf{x}$. The training loss is the standard cross-entropy between the classification head's output logits (applied to the `[CLS]` token) and the true class labels, following the conventional ViT training procedure.

For diffusion models (top right), DiffusionBlocks provides a natural fit: these models already operate by denoising, so partitioning simply assigns different noise ranges to different blocks without architectural modifications. The standard denoiser $D_{\boldsymbol{\theta}}(\mathbf{z}_\sigma, \sigma)$ becomes $D_{\boldsymbol{\theta}_b}(\mathbf{z}_\sigma, \sigma)$ for block $b$.

For discrete output spaces like language modeling (bottom left), we operate in the embedding space following prior works [@dieleman2022cdcd; @li2022diffusionlm; @gulrajani2023plaid; @lovelace2023latent]. Noise is added after the embedding layer: given input tokens $\mathbf{x}$, we compute $\mathbf{z} = f_{\text{in}}(\mathbf{x})$, then add noise $\mathbf{z}_\sigma = \mathbf{z} + \sigma\boldsymbol{\epsilon}$. For autoregressive models, the denoiser $D_{\boldsymbol{\theta}_b}(\mathbf{z}_{i,\sigma}, \mathbf{z}_{<i}, \sigma)$ recovers the clean embedding of token $i$ from its noisy version, conditioned on previous clean token embeddings $\mathbf{z}_{<i}$. We minimize cross-entropy loss instead of L2 loss.

For recurrent-depth architectures that apply the same network $K$ times (bottom right), we interpret the entire recurrence as a diffusion process. Instead of training with $K$ forward passes through recurrent iterations, we train the network as a denoiser $D_\theta(\bz_\sigma, \textbf{x}, \sigma)$ by sampling $\sigma \sim p_\sigma$ and performing a single forward pass to map noisy input to clean output, reducing computational cost by factor $K$ while maintaining the original $K$-iteration inference procedure.

Beyond these adaptations, DiffusionBlocks also applies to diffusion language models [@austin2021structured; @lou2024sedd; @sahoo2024mdlm; @shi2024md4], where the framework provides additional benefits for text generation. We provide a detailed treatment of this application in Appendix `\ref{app:diffusion-lm}`{=latex}. These diverse applications demonstrate that DiffusionBlocks provides a general recipe for transforming various architectures into memory-efficient, independently trainable components.

Implementation details in DiffusionBlocks {#app:implementation}
=========================================

We introduce several practical considerations for effective training and inference.

#### Overlap between blocks.

To smooth transitions across block boundaries, we slightly extend each block's noise interval in log-$\sigma$ space. For a block $b$ responsible for $[\sigma_b, \sigma_{b-1}]$ with $\sigma_{b-1}>\sigma_b$, we define $\alpha_b := \left(\sigma_{b-1}/\sigma_b\right)^{\gamma}$, where $\gamma \ge 0$, and train over the expanded range $\left[\sigma_b/\alpha_b, \alpha_b \sigma_{b-1}\right]$. Here $\gamma$ controls the degree of overlap: $\gamma$=0 recovers non-overlapping intervals, while $\gamma>0$ yields smoother transitions between blocks. In practice, we found $\gamma\in[0.0, 0.1]$ effective, and we use $0.05$ by default and $0.1$ for text generation.

#### Weighting and preconditioning.

Following the EDM framework [@karras2022edm], we use the weighting function: $w(\sigma) = (\sigma^2 + \sigma_{\text{data}}^2)/(  \sigma \cdot \sigma_{\text{data}})^2$ where $\sigma_{\text{data}} = 0.5$ for all experiments. The weighting is crucial for equi-probability partitioning to work effectively, as it counteracts the sampling bias introduced by the log-normal distribution $p_\sigma$. We also adopt EDM's preconditioning scheme, which involves input scaling to ensure stable training dynamics across all noise levels. See @karras2022edm for more details.

#### Normalizing embeddings.

For tasks where the target variables are discrete (e.g. class labels in image classification or token ids in text generation), `\method{}`{=latex} operates the diffusion process in the continuous embedding space (see Appendix `\ref{app:architectures}`{=latex}). A known issue in continuous relaxation of discrete variables is *embedding collapse*, where all learned embeddings correspond to the same vector [@dieleman2022cdcd]. To prevent this, we follow the regularization strategy introduced @dieleman2022cdcd and apply L2 normalization to the embeddings.

#### Training and inference details.

For training efficiency, blocks are randomly sampled per iteration, requiring memory for only $L/B$ layers. Blocks can alternatively be trained in parallel across multiple GPUs when available. During inference, we generate samples by sequentially applying blocks from $\sigma_{\max}$ to $\sigma_{\min}$. While we use Euler steps in our experiments due to the natural correspondence between residual connections and Euler discretization (Section `\ref{sec:residual}`{=latex}), our framework is not limited to this choice. By modifying the inter-block connections to match the discretization scheme of other solvers, any diffusion sampling methods [@song2021ddim; @lu2023dpm++; @zhao2023unipc] can be employed. We leave this exploration for future work.

Masked diffusion language models as DiffusionBlocks {#app:diffusion-lm}
===================================================

Continuous-time formulation
---------------------------

We first recall the continuous-time formulation of masked diffusion language models [@sahoo2024mdlm; @shi2024md4]. Let $\mathbf{x}_0 = (x_{01},\dots,x_{0n})$ denote a sequence of tokens and let $\alpha(t) : [0,1] \rightarrow [1,0]$ denote the masking schedule at continuous time $t \in [0,1]$, where $\alpha(t)$ represents the probability of remaining unmasked. The forward process progressively masks tokens as: $$q(\mathbf{x}_t \mid \mathbf{x}_0) = \prod_{i=1}^n q(x_{ti} \mid x_{0i}) \quad\text{where}\quad 
x_{ti} = 
\begin{cases}
x_{0i}, & \text{with prob. } \alpha(t),\\
\texttt{[MASK]}, & \text{with prob. } 1-\alpha(t).
\end{cases}$$

The training objective in continuous form is: $$\label{eq:mdm_loss}
\mathcal{L}(\boldsymbol{\theta}) = \mathbb{E}_{\mathbf{x}_0}
\int_0^1 \frac{-\alpha'(t)}{1-\alpha(t)}
\mathbb{E}_{\mathbf{x}_t \sim q(\mathbf{x}_t\mid \mathbf{x}_0)} 
\left[
\sum_{i : x_{ti}=\texttt{[MASK]}}
\text{CE}\!\left(f_\theta(\mathbf{x}_t, t)_i, x_{0i}\right)
\right] dt,$$ where $\alpha'(t) = d\alpha/dt < 0$ and CE denotes cross-entropy loss. This form is equivalent to the continuous-time NELBO [@shi2024md4; @sahoo2024mdlm], but expressed with a nonnegative weight multiplying $\text{CE}$, which avoids sign ambiguity.

Partitioning into DiffusionBlocks
---------------------------------

To enable block-wise training, we partition the objective in Eq. (`\ref{eq:mdm_loss}`{=latex}) into $B$ disjoint intervals in $t$. The expected number of masked positions at time $t$ is $n(1-\alpha(t))$, so the effective density of contributions is $$\frac{-\alpha'(t)}{1-\alpha(t)} \cdot (1-\alpha(t)) \;=\; -\alpha'(t).$$ Hence, the contribution of interval $[t_a,t_b]$ is $$\int_{t_a}^{t_b} -\alpha'(t)\,dt = \alpha(t_a)-\alpha(t_b).$$ This shows that the training mass is distributed uniformly in $\alpha$, not in $t$.

Therefore, the natural partition boundaries are defined by equal decrements of $\alpha$: $$\alpha_b = 1 - \tfrac{b}{B}, \quad b=0,\dots,B,$$ with corresponding time boundaries obtained by inversion: $$t_b = \alpha^{-1}\!\left(1 - \tfrac{b}{B}\right).$$ For a linear schedule $\alpha(t)=1-t$, this simply yields $t_b=b/B$. Each block $b$ is then trained independently on its assigned interval: $$\label{eq:block_loss}
\mathcal{L}_b(\boldsymbol{\theta}_b) =
\mathbb{E}_{\mathbf{x}_0}
\int_{t_{b-1}}^{t_b}
\frac{-\alpha'(t)}{1-\alpha(t)}
\mathbb{E}_{\mathbf{x}_t\sim q(\mathbf{x}_t\mid \mathbf{x}_0)}
\left[\sum_{i:\,x_{ti}=\texttt{[MASK]}}
\text{CE} \left(D_{\boldsymbol{\theta}_b}(\mathbf{x}_t,t)_i,\,x_{0i}\right)\right] dt,$$ where $D_{\boldsymbol{\theta}_b}$ denotes the denoiser assigned to block $b$. The global loss decomposes as $\mathcal{L}=\sum_{b=1}^B \mathcal{L}_b$.

This derivation shows that DiffusionBlocks in masked diffusion models amounts to partitioning the masking schedule $\alpha(t)$ rather than time. Each block is responsible for an equal decrement in $\alpha(t)$, i.e. an equal share of the total \`\`demasking work", which ensures balanced parameter utilization and true independence across blocks. This construction is directly analogous to the equi-probability partitioning in continuous diffusion models described in Section `\ref{sec:partitioning}`{=latex}.

Experimental details {#app:exp-details}
====================

Unless otherwise specified, all experiments use the following settings. For DiffusionBlocks, we adopt the EDM framework [@karras2022edm] with default parameters: log-normal noise distribution with $P_{\text{mean}} = -1.2$ and $P_{\text{std}} = 1.2$, noise range $[\sigma_{\min}, \sigma_{\max}] = [0.002, 80]$, and preconditioning following the recommended configuration. Inference uses Euler sampling with 50 steps unless stated otherwise. During training, blocks are sampled uniformly at random for each iteration.

Vision transformers for image classification {#app:vit}
--------------------------------------------

For image classification experiments in Section `\ref{sec:exp-vit}`{=latex}, we use a 12-layer ViT with patch size 4, 128 hidden dimensions, 4 attention heads, and 0.1 dropout, partitioned into $B$=3 blocks (4 layers each). We train for 500 epochs with batch size 128 and AdamW optimizer with learning rate $5\times10^{-4}$. We employ a cosine learning rate scheduler with a 10-epoch linear warmup. As data augmentation, we apply random horizontal flipping ($p=0.5$) and *RandAugment* [@cubuk2020randaug] as data augmentation.

Figure `\ref{fig:meta-algo-training}`{=latex} (top left) illustrates the DiffusionBlocks adaptation for ViT. We add noise to the class label embeddings and concatenate them with the patch embeddings. Each block learns to denoise the label embedding conditioned on the patch embeddings. We use an overlap ratio $\gamma = 0.05$ and perform 4 denoising steps during inference (matching $L/B$= 12/3). The classification head is applied after the final denoising step to produce class predictions. We minimize cross-entropy loss between predicted and true class labels during training. For the Forward-Forward baseline, we adapt the Contrastive Forward-Forward (FF) [@aghagolzadeh2025contrastiveff] implementation to ViT [^3].

Diffusion models for image generation {#app:exp-edm}
-------------------------------------

For image generation experiments in Section `\ref{sec:exp-edm}`{=latex}, we use DiT-S/2 (12 layers) for CIFAR-10 and DiT-L/2 (24 layers) for ImageNet-256. Both models are partitioned into $B=3$ blocks. Training follows the EDM framework with classifier-free guidance [@ho2021classifierfree] (10% label dropout). For CIFAR-10, we train for 100 epochs with batch size 512 and AdamW optimizer with learning rate $10^{-4}$. For ImageNet, we resize to 256$\times$256 and encode images by a pre-trained VAE [@peebles2023dit][^4]. We also train 100 epochs with batch size 512 and AdamW optimizer with learning rate $5 \times 10^{-5}$. Overlap ratio is set to $\gamma=0.05$.

In evaluation, we apply Euler sampling with 50 steps and classifier-free guidance (scale 2.0) on both CIFAR-10 and ImageNet experiments. FID is computed using 50,000 generated samples against the training and test sets, with the minimum of three evaluations reported following @karras2022edm. For the training set, we use the official ADM [@dhariwal2021adm] evaluation suite, which computes FID against the entire training set as the reference distribution. For the test split, we compute FID using `clean-fid` [@parmar2022cleanfid].

Masked diffusion models for text generation {#app:exp-mdm}
-------------------------------------------

In Section `\ref{sec:exp-mdm}`{=latex}, we follow MD4's training protocol with 256 sequence length, AdamW optimizer with learning rate $3 \times 10^{-4}$, weight decay 0.03, and 2,000 linear warmup steps. Training runs for 100 epochs with batch size 256. The 12-layer DiT-based transformer [@lou2024sedd; @sahoo2024mdlm] uses 768 hidden dimensions and 12 attention heads, partitioned into $B$=3 blocks with overlap ratio $\gamma=0.05$. Masking schedule follows MD4's linear schedule. For block partitioning in discrete diffusion, we apply equi-probability partitioning to the masking ratio distribution rather than continuous noise levels in Appendix `\ref{app:diffusion-lm}`{=latex}. Bits-per-character (BPC) is evaluated on the text8 test set following  @shi2024md4.

Autoregressive models for text generation {#app:exp-ar}
-----------------------------------------

In Section `\ref{sec:exp-ar}`{=latex}, we use a 12-layer Llama-2-style transformer [@touvron2023llama2openfoundation] augmented with time conditioning as in DiT [@peebles2023dit] with 768 hidden dimensions, 12 attention heads, and the Llama-2 tokenizer with 32K vocabulary size. The model is partitioned into $B$=4 blocks with an overlap ratio $\gamma$=0.1. Training uses sequence length 256 for LM1B and 3072 for OWT, batch size 256, AdamW with learning rate $3 \times 10^{-4}$, and 2500 warmup steps for 10 epochs.

Since DiffusionBlocks is not derived from ELBO-based objectives, computing traditional perplexity is non-trivial. Instead, we evaluate using MAUVE scores following SEDD [@lou2024sedd], which measures the similarity between generated and real text distributions. For each test sample, we generate 5 continuations of 50 tokens from 1K prompts and compute MAUVE against 1K reference samples with the scaling factor 0.2. Additionally, we report generative perplexity, commonly used in diffusion language models [@lou2024sedd; @sahoo2024mdlm], by computing the perplexity of generated text using teacher models (`Llama-2-7B`[^5] and `GPT2-XL` [@radford2019language][^6]). For generations, we use top-p sampling (0.95) for the baseline and 4 diffusion steps with greedy sampling for DiffusionBlocks. The OWT test set is created by splitting 10% of the data since no official test set exists.

Applying DiffusionBlocks to autoregressive models requires maintaining causal consistency during training. When denoising future tokens, the model must condition on clean past tokens rather than noisy ones to preserve the autoregressive property. Following Block Diffusion [@arriola2025block], we implement this using sequence concatenation: noisy and clean sequences are concatenated with a modified causal attention mask that allows noisy tokens to attend to their corresponding clean past tokens while preventing information leakage. This approach doubles sequence memory but maintains single forward pass efficiency. An alternative implementation computes key-value pairs separately for clean and noisy sequences, combining them during attention computation. This requires two forward passes but uses standard sequence memory. We adopt the concatenation approach for computational efficiency.

Recurrent-depth models {#app:recurrent}
----------------------

For `Huginn` [@geiping2025scalingtesttimecomputelatent] described in Section `\ref{sec:exp-recurrent}`{=latex}, we use the default configuration: 2 prelude layers, 4-layer recurrent block, and 2 coda layers following `Pythia-70M` [@biderman2023pythiasuiteanalyzinglarge][^7] architecture with 512 hidden dimensions and 8 attention heads. Unlike other architectures, recurrent-depth models do not require block partitioning since the entire network is applied recurrently. Instead, we train the full network as a denoiser by sampling different noise levels $\sigma$ at each training step. While baseline `Huginn` uses stochastic recurrence depth (average 32 iterations) with truncated BPTT (8 steps), DiffusionBlocks trains with single-pass diffusion. We train on LM1B for 15 epochs compared to `Huginn`'s 5 epochs. Despite this, our approach uses approximately 10$\times$ less total computation since we avoid the 32$\times$ recurrent iterations during training.

Ablation studies {#app:ablation}
----------------

### Comparison with NoProp {#app:exp-noprop}

We follow the experimental protocol of NoProp [@li2025noprop]. In the absence of publicly available code, we implemented their `NoProp-DT` architecture augmented with time conditioning from `NoProp-CT`, following their specifications (Figure 5 in their paper). Training follows `NoProp-CT`'s hyperparameters with AdamW optimizer, learning rate $10^{-4}$, batch size 128, and 1000 epochs on CIFAR-100. For DiffusionBlocks, we use $B$=3 blocks with overlap ratio $\gamma = 0.1$. Following `NoProp-CT`'s evaluation protocol, we use 1000 Euler sampling steps instead of our default 50.

We attempted to adapt Forward-Forward (FF) algorithm [@hinton2022forwardforward] as an additional baseline to NoProp's architecture for Table `\ref{tab:noprop_comparison}`{=latex}. However, without publicly available code and with no specified adaptation procedure, the implementation requires numerous design decisions. Our attempts achieved only 1% accuracy, highlighting the fundamental incompatibility: NoProp's architecture is specifically designed for their method (type (e) in their Figure 2), while FF requires contrastive positive/negative samples (type (d)). Successfully bridging these paradigms may require innovations beyond straightforward adaptation. This highlights a key distinction between approaches. NoProp does not provide guidance for adapting to other methods or architectures. DiffusionBlocks instead offers a systematic procedure for converting existing Transformer-based networks into block-wise trainable models. This recipe enabled successful application to modern architectures with minimal modifications, demonstrating the generality of our framework.

### Design choice analysis

All ablation studies follow the configurations described in Appendix `\ref{app:exp-edm}`{=latex}. We report FID scores on the test splits. For partitioning experiments, we test both uniform partitioning (equal intervals in log-space) and our equi-probability method. Layer distribution indicates the number of layers in each block. For block count experiments, we vary $B$ from 2 to 6 while keeping total layers fixed at 12. We disabled the block overlap ($\gamma=0.0$) in Section `\ref{app:implementation}`{=latex} to isolate the effectiveness of each component.

Additional experiments
======================

Image Classification Experiment on Tiny ImageNet
------------------------------------------------

```{=latex}
\footnotesize
```
::: {#tab:vit-tinyimagenet}
+-----------------------+---------------------------+
| **Method**            | **Accuracy ($\uparrow$)** |
+:======================+:=========================:+
| ViT                   | 35.32                     |
+-----------------------+---------------------------+
| ```{=latex}           | **36.16**                 |
| \rowcolor{lightgray}  |                           |
| ```                   |                           |
| **+ DiffusionBlocks** |                           |
+-----------------------+---------------------------+

: **ViT results on Tiny-ImageNet.** DiffusionBlocks shows consistent performance on intermediate-scale classification dataset.
:::

To further evaluate the effectiveness of DiffusionBlocks on classification tasks beyond CIFAR-100, we conducted an additional experiment on the Tiny ImageNet dataset [@le2015tinyimagenet]. This dataset consists of 200 classes, 100,000 training images with each image resized to 64$\times$64 resolution. Tiny-ImageNet provides a more challenging and higher-resolution benchmark than CIFAR-100.

We trained a 12-layer Vision Transformer (ViT) with patch size 4, hidden size 768, and 12 attention heads. Both the baseline ViT and DiffusionBlocks models were trained for 100 epochs using a batch size of 256 and the AdamW optimizer with a learning rate of $10^{-4}$. For DiffusionBlocks, we used $B=2$ blocks (each containing 6 layers).

Table `\ref{tab:vit-tinyimagenet}`{=latex} demonstrates that DiffusionBlocks maintains competitive performance relative to the baseline ViT, consistent with our findings on CIFAR-100 as well as our large-scale classification experiments in language modeling (LM1B and OpenWebText in Table `\ref{tab:mdm}`{=latex}, `\ref{tab:lm}`{=latex}, `\ref{tab:recurrent}`{=latex}). These results further indicate that DiffusionBlocks remains effective as a classifier across different data modalities, resolutions, and dataset scales.

Effect of block count on CIFAR-10 {#app:block_count_cifar10}
---------------------------------

To examine whether the design trends observed on ImageNet in Table `\ref{tab:ablation_blocks}`{=latex} generalize to different datasets, we additionally evaluate the effect of the number of blocks on CIFAR-10. This experiment allows us to assess whether the behavior of DiffusionBlocks remains consistent across datasets of different scales and complexities. We use the same DiT-S/2 architecture described in Section `\ref{sec:exp-edm}`{=latex}, training under the EDM framework while varying the number of blocks $B \in \{1,2,3,4,6\}$ and disabling block overlap ($\gamma=0.0$) to isolate the effectivenss of the number of blocks $B$.

```{=latex}
\resizebox{1\linewidth}{!}{
\begin{tabular}{lrrr}
\toprule
Number of Blocks & FID ($\downarrow$) & L/B ($\downarrow$) & Relative Speed \\
\midrule
$B=1$ & 39.83 & 12 & 1.0$\times$ \\
\midrule
\rowcolor{lightgray}
$B=2$ & \textbf{35.47} & 6 & 2.0$\times$ \\
\rowcolor{lightgray}
$B=3$ & \textbf{38.03} & 4 & 3.0$\times$ \\
$B=4$ & 45.43 & 3 & 4.0$\times$ \\
$B=6$ & 53.32 & 2 & 6.0$\times$ \\
\bottomrule
\end{tabular}
}
```
```{=latex}
\hfill
```
```{=latex}
\resizebox{1\linewidth}{!}{
\begin{tabular}{lrrr}
\toprule
Number of Blocks & MAUVE ($\uparrow$) & Layers per Block ($\downarrow$) & Relative Speed \\
\midrule
$B=2$ & 0.61 & 6 & 2.0$\times$ \\
$B=3$ & 0.65 & 4 & 3.0$\times$ \\
\rowcolor{lightgray}
$B=4$ & \textbf{0.67} & 3 & 4.0$\times$ \\
$B=6$ & 0.62 & 2 & 6.0$\times$ \\
\bottomrule
\end{tabular}
}
```
As shown in Table `\ref{tab:ablation_blocks_cifar10}`{=latex}, smaller block counts tend to achieve better FID scores, and $B=2$ or $B=3$ provides strong performance. This trend matches the observations in Table `\ref{tab:ablation_blocks}`{=latex}. These results indicate that the effectiveness of using a moderate number of blocks is consistent across datasets of varying scale, supporting the validity of the design choices analyzed in Section `\ref{sec:ablations}`{=latex}.

Effect of block count on text generation {#app:block_count_language}
----------------------------------------

Table `\ref{tab:ablation_blocks_lm}`{=latex} shows the effect of varying the number of blocks for autoregressive language modeling on LM1B with overlap ratio $\gamma=0.0$.

The optimal number of blocks differs between tasks: image generation achieves best FID with $B$=2 or $B$=3 (Table `\ref{tab:ablation_blocks}`{=latex}), while language modeling achieves best MAUVE with $B$=4. This motivated our choice of $B$=4 for language modeling experiments in the main paper.

Comparison with Activation Checkpointing
========================================

DiffusionBlocks and activation checkpointing (also known as activation recomputation, gradient checkpointing, or rematerialization) offer fundamentally different trade-offs and can be powerfully combined.

The key distinction lies in what each method reduces. Activation checkpointing reduces only activation memory, leaving parameters, gradients, and optimizer states unchanged. In contrast, DiffusionBlocks reduces all memory components by a factor of $B$. This distinction becomes increasingly critical as modern models grow larger.

To illustrate this difference, consider an $L$-layer network where each layer has parameter size $P$ and activation size $A$. With Adam optimizer (requiring $2P$ for momentum and variance), each layer needs $4P$ memory for parameters, gradients, and optimizer states. Standard training thus requires $(4P + A)L$ total memory. Activation checkpointing reduces this to $4PL + A$ by rematerializing activations only when needed (though this is an optimistic estimate that ignores the memory cost of the checkpoints). DiffusionBlocks, by training $B$ independent blocks, requires $(4P + A)(L/B)$. Since $L > B$, combining DiffusionBlocks and activation checkpointing uses the least memory among these four patterns.

Regarding computational costs, it is empirically known that activation checkpointing increases the training time by a factor of approximately 4/3, and this holds true when combined with the proposed method. This is justified as follows. With a forward pass computation cost of $F$, a backward pass requires approximately $2F$ (computing Jacobians and weight gradients). Standard training uses $3F$ cost per iteration, while activation checkpointing increases this to $4F$ due to recomputation. DiffusionBlocks maintains this ratio when combined with checkpointing.

Beyond memory reduction, DiffusionBlocks offers unique advantages regarding training time: each block can be trained in an embarrassingly parallel manner. This means each block can be trained in parallel with absolutely no communication overhead. This provides an additional advantage over activation checkpointing, especially when computational resources are abundant.

Training and inference efficiency
=================================

This section provides a detailed analysis of the computational efficiency and wall-time characteristics of DiffusionBlocks.

#### Training efficiency.

Consider an $L$-layer network trained for $K$ iterations. Standard end-to-end backpropagation performs $K \times L$ layer evaluations. DiffusionBlocks trains only $L/B$ layers at a time; training all $B$ blocks for $K$ iterations each performs $(L/B) \times B \times K = L \times K$ layer evaluations. Thus, DiffusionBlocks requires the *same* total amount of computation as standard training, while reducing memory usage by a factor of $B$.

::: {#tab:walltime-one}
  **Method**                                              **Wall time (sec/iter)**
  ------------------------------------------------------ --------------------------
  ViT                                                              0.0507
  DiffusionBlocks: per-block time (4 layers)                       0.0181
  DiffusionBlocks: aggregated time ($0.0181 \times 3$)             0.0543

  : Wall-time comparison on ViT. The aggregated DiffusionBlocks time is computed by multiplying the measured per-block iteration time by $B=3$.
:::

To validate this theoretical equivalence, we measured the per-iteration wall time using a 12-layer ViT on a single H100 80GB GPU, averaging over 100 iterations. As summarized in Table `\ref{tab:walltime-one}`{=latex}, standard training requires 0.0507 seconds per iteration for all 12 layers. Under DiffusionBlocks with $B=3$, each block (4 layers) takes 0.0181 seconds per iteration (measured). The total per-iteration wall time for DiffusionBlocks is therefore obtained by summing the independently trained blocks, computed as $0.0181 \times 3 = 0.0543$ seconds. The resulting end-to-end wall time is thus comparable to standard training, with the small difference attributable to the noise-level conditioning introduced during the DiffusionBlocks conversion (Section `\ref{sec:method-core}`{=latex}).

#### Inference efficiency.

For inference, we ensure that the total amount of computation matches that of the baseline model. For a 12-layer network, the baseline performs a single forward pass through all 12 layers. Under DiffusionBlocks with $B=3$, we perform three denoising steps, each invoking the corresponding 4-layer block once. The total compute therefore corresponds to the same 12 layer evaluations as in standard inference.

For diffusion models used in image generation, the computational benefit is even more pronounced. Standard diffusion models must apply the full network for every denoising step. With 50 denoising steps, a 12-layer DiT requires $12 \times 50$ layer evaluations. In DiffusionBlocks, each denoising step applies only the block responsible for that noise level, which contains 4 layers when $B=3$. This reduces the total compute to $4 \times 50$, achieving a $B$-fold reduction in inference cost. The 50 denoising steps are assigned to blocks according to the equi-probability partitioning in Section `\ref{sec:partitioning}`{=latex}, so that each block is used approximately the same number of times during inference. Euler sampling is used for simplicity, and, as shown in Section `\ref{sec:residual}`{=latex}, it is computationally equivalent to a residual update, requiring no additional overhead.

[^1]: We use *block-wise training* to encompass all approaches that partition networks into independently trainable components. This includes *layer-wise training* as the special case where each block contains one layer.

[^2]: These ablations disable block overlap in Appendix `\ref{app:implementation}`{=latex} to isolate the effectiveness of each component, resulting in the FID difference from Table `\ref{tab:img-gen}`{=latex}.

[^3]: <https://github.com/HosseinAghagol/ContrastiveFF>

[^4]: [stabilityai/sd-vae-ft-ema](stabilityai/sd-vae-ft-ema)

[^5]: <https://huggingface.co/meta-llama/Llama-2-7b-hf>

[^6]: <https://huggingface.co/openai-community/gpt2-xl>

[^7]: <https://huggingface.co/EleutherAI/pythia-70m>
