---
author:
- |
  `\fontsize{10.5pt}{18pt}`{=latex}\
  `\selectfont`{=latex}\
  Ali Hatamizadeh    Yejin Choi    Jan Kautz\
  `\small `{=latex}`{ahatamizadeh, yejinc, jkautz}@nvidia.com`
bibliography:
- ref.bib
title: |
  Gated DeltaNet-2: Decoupling Erase and Write in\
  Linear Attention
---

```{=latex}
\pdfoutput=1
```
```{=latex}
\let\cite\citep                  
```
```{=latex}
\newcommand{\white}[1]{\textcolor{white}{#1}}
```
```{=latex}
\newcommand{\gray}[1]{\textcolor{gray}{#1}}
```
```{=latex}
\newcommand{\green}[1]{\textcolor{green}{#1}}
```
```{=latex}
\newcommand{\brickred}[1]{\textcolor{brickred}{#1}}
```
```{=latex}
\newcommand{\midnightblue}[1]{\textcolor{midnightblue}{#1}}
```
```{=latex}
\newcommand{\salmon}[1]{\textcolor{salmon}{#1}}
```
```{=latex}
\newcommand{\junglegreen}[1]{\textcolor{junglegreen}{#1}}
```
```{=latex}
\newcommand{\forestgreen}[1]{\textcolor{forestgreen}{#1}}
```
```{=latex}
\newcommand{\pinegreen}[1]{\textcolor{pinegreen}{#1}}
```
```{=latex}
\newcommand{\seagreen}[1]{\textcolor{seagreen}{#1}}
```
```{=latex}
\newcommand{\violet}[1]{\textcolor{violet}{#1}}
```
```{=latex}
\newcommand{\pastelviolet}[1]{\textcolor{pastelviolet}{#1}}
```
```{=latex}
\newcommand{\darkcyan}[1]{\textcolor{darkcyan}{#1}}
```
```{=latex}
\newcommand{\cmark}{\text{\ding{51}}}
```
```{=latex}
\newcommand{\xmark}{\text{\ding{55}}}
```
```{=latex}
\newcommand{\A}{\bm{A}}
```
```{=latex}
\providecommand{\rmA}{\mathbf{A}}
```
```{=latex}
\providecommand{\rmB}{\mathbf{B}}
```
```{=latex}
\providecommand{\rmD}{\mathbf{D}}
```
```{=latex}
\providecommand{\rmE}{\mathbf{E}}
```
```{=latex}
\providecommand{\rmI}{\mathbf{I}}
```
```{=latex}
\providecommand{\rmK}{\mathbf{K}}
```
```{=latex}
\providecommand{\rmM}{\mathbf{M}}
```
```{=latex}
\providecommand{\rmO}{\mathbf{O}}
```
```{=latex}
\providecommand{\rmQ}{\mathbf{Q}}
```
```{=latex}
\providecommand{\rmR}{\mathbf{R}}
```
```{=latex}
\providecommand{\rmS}{\mathbf{S}}
```
```{=latex}
\providecommand{\rmT}{\mathbf{T}}
```
```{=latex}
\providecommand{\rmU}{\mathbf{U}}
```
```{=latex}
\providecommand{\rmV}{\mathbf{V}}
```
```{=latex}
\providecommand{\rmW}{\mathbf{W}}
```
```{=latex}
\providecommand{\rmY}{\mathbf{Y}}
```
```{=latex}
\providecommand{\rmZ}{\mathbf{Z}}
```
```{=latex}
\providecommand{\vk}{\bm{k}}
```
```{=latex}
\providecommand{\vv}{\bm{v}}
```
```{=latex}
\providecommand{\vq}{\bm{q}}
```
```{=latex}
\providecommand{\vo}{\bm{o}}
```
```{=latex}
\providecommand{\vw}{\bm{w}}
```
```{=latex}
\providecommand{\vu}{\bm{u}}
```
```{=latex}
\newcommand{\figleft}{{\em (Left)}}
```
```{=latex}
\newcommand{\figcenter}{{\em (Center)}}
```
```{=latex}
\newcommand{\figright}{{\em (Right)}}
```
```{=latex}
\newcommand{\figtop}{{\em (Top)}}
```
```{=latex}
\newcommand{\figbottom}{{\em (Bottom)}}
```
```{=latex}
\newcommand{\captiona}{{\em (a)}}
```
```{=latex}
\newcommand{\captionb}{{\em (b)}}
```
```{=latex}
\newcommand{\captionc}{{\em (c)}}
```
```{=latex}
\newcommand{\captiond}{{\em (d)}}
```
```{=latex}
\newcommand{\newterm}[1]{{\bf #1}}
```
```{=latex}
\def\figref#1{figure~\ref{#1}}
```
```{=latex}
\def\Figref#1{Figure~\ref{#1}}
```
```{=latex}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
```
```{=latex}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
```
```{=latex}
\def\secref#1{section~\ref{#1}}
```
```{=latex}
\def\Secref#1{Section~\ref{#1}}
```
```{=latex}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
```
```{=latex}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
```
```{=latex}
\def\eqref#1{equation~\ref{#1}}
```
```{=latex}
\def\Eqref#1{Equation~\ref{#1}}
```
```{=latex}
\def\plaineqref#1{\ref{#1}}
```
```{=latex}
\def\chapref#1{chapter~\ref{#1}}
```
```{=latex}
\def\Chapref#1{Chapter~\ref{#1}}
```
```{=latex}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
```
```{=latex}
\def\algref#1{algorithm~\ref{#1}}
```
```{=latex}
\def\Algref#1{Algorithm~\ref{#1}}
```
```{=latex}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
```
```{=latex}
\def\partref#1{part~\ref{#1}}
```
```{=latex}
\def\Partref#1{Part~\ref{#1}}
```
```{=latex}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
```
```{=latex}
\def\ceil#1{\lceil #1 \rceil}
```
```{=latex}
\def\floor#1{\lfloor #1 \rfloor}
```
```{=latex}
\def\1{\bm{1}}
```
```{=latex}
\newcommand{\train}{\mathcal{D}}
```
```{=latex}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
```
```{=latex}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
```
```{=latex}
\def\eps{{\epsilon}}
```
```{=latex}
\def\reta{{\textnormal{$\eta$}}}
```
```{=latex}
\def\ra{{\textnormal{a}}}
```
```{=latex}
\def\rb{{\textnormal{b}}}
```
```{=latex}
\def\rc{{\textnormal{c}}}
```
```{=latex}
\def\rd{{\textnormal{d}}}
```
```{=latex}
\def\re{{\textnormal{e}}}
```
```{=latex}
\def\rf{{\textnormal{f}}}
```
```{=latex}
\def\rg{{\textnormal{g}}}
```
```{=latex}
\def\rh{{\textnormal{h}}}
```
```{=latex}
\def\ri{{\textnormal{i}}}
```
```{=latex}
\def\rj{{\textnormal{j}}}
```
```{=latex}
\def\rk{{\textnormal{k}}}
```
```{=latex}
\def\rl{{\textnormal{l}}}
```
```{=latex}
\def\rn{{\textnormal{n}}}
```
```{=latex}
\def\ro{{\textnormal{o}}}
```
```{=latex}
\def\rp{{\textnormal{p}}}
```
```{=latex}
\def\rq{{\textnormal{q}}}
```
```{=latex}
\def\rr{{\textnormal{r}}}
```
```{=latex}
\def\rs{{\textnormal{s}}}
```
```{=latex}
\def\rt{{\textnormal{t}}}
```
```{=latex}
\def\ru{{\textnormal{u}}}
```
```{=latex}
\def\rv{{\textnormal{v}}}
```
```{=latex}
\def\rw{{\textnormal{w}}}
```
```{=latex}
\def\rx{{\textnormal{x}}}
```
```{=latex}
\def\ry{{\textnormal{y}}}
```
```{=latex}
\def\rz{{\textnormal{z}}}
```
```{=latex}
\def\rvepsilon{{\mathbf{\epsilon}}}
```
```{=latex}
\def\rvtheta{{\mathbf{\theta}}}
```
```{=latex}
\def\rva{{\mathbf{a}}}
```
```{=latex}
\def\rvb{{\mathbf{b}}}
```
```{=latex}
\def\rvc{{\mathbf{c}}}
```
```{=latex}
\def\rvd{{\mathbf{d}}}
```
```{=latex}
\def\rve{{\mathbf{e}}}
```
```{=latex}
\def\rvf{{\mathbf{f}}}
```
```{=latex}
\def\rvg{{\mathbf{g}}}
```
```{=latex}
\def\rvh{{\mathbf{h}}}
```
```{=latex}
\def\rvu{{\mathbf{i}}}
```
```{=latex}
\def\rvj{{\mathbf{j}}}
```
```{=latex}
\def\rvk{{\mathbf{k}}}
```
```{=latex}
\def\rvl{{\mathbf{l}}}
```
```{=latex}
\def\rvm{{\mathbf{m}}}
```
```{=latex}
\def\rvn{{\mathbf{n}}}
```
```{=latex}
\def\rvo{{\mathbf{o}}}
```
```{=latex}
\def\rvp{{\mathbf{p}}}
```
```{=latex}
\def\rvq{{\mathbf{q}}}
```
```{=latex}
\def\rvr{{\mathbf{r}}}
```
```{=latex}
\def\rvs{{\mathbf{s}}}
```
```{=latex}
\def\rvt{{\mathbf{t}}}
```
```{=latex}
\def\rvu{{\mathbf{u}}}
```
```{=latex}
\def\rvv{{\mathbf{v}}}
```
```{=latex}
\def\rvw{{\mathbf{w}}}
```
```{=latex}
\def\rvx{{\mathbf{x}}}
```
```{=latex}
\def\rvy{{\mathbf{y}}}
```
```{=latex}
\def\rvz{{\mathbf{z}}}
```
```{=latex}
\def\erva{{\textnormal{a}}}
```
```{=latex}
\def\ervb{{\textnormal{b}}}
```
```{=latex}
\def\ervc{{\textnormal{c}}}
```
```{=latex}
\def\ervd{{\textnormal{d}}}
```
```{=latex}
\def\erve{{\textnormal{e}}}
```
```{=latex}
\def\ervf{{\textnormal{f}}}
```
```{=latex}
\def\ervg{{\textnormal{g}}}
```
```{=latex}
\def\ervh{{\textnormal{h}}}
```
```{=latex}
\def\ervi{{\textnormal{i}}}
```
```{=latex}
\def\ervj{{\textnormal{j}}}
```
```{=latex}
\def\ervk{{\textnormal{k}}}
```
```{=latex}
\def\ervl{{\textnormal{l}}}
```
```{=latex}
\def\ervm{{\textnormal{m}}}
```
```{=latex}
\def\ervn{{\textnormal{n}}}
```
```{=latex}
\def\ervo{{\textnormal{o}}}
```
```{=latex}
\def\ervp{{\textnormal{p}}}
```
```{=latex}
\def\ervq{{\textnormal{q}}}
```
```{=latex}
\def\ervr{{\textnormal{r}}}
```
```{=latex}
\def\ervs{{\textnormal{s}}}
```
```{=latex}
\def\ervt{{\textnormal{t}}}
```
```{=latex}
\def\ervu{{\textnormal{u}}}
```
```{=latex}
\def\ervv{{\textnormal{v}}}
```
```{=latex}
\def\ervw{{\textnormal{w}}}
```
```{=latex}
\def\ervx{{\textnormal{x}}}
```
```{=latex}
\def\ervy{{\textnormal{y}}}
```
```{=latex}
\def\ervz{{\textnormal{z}}}
```
```{=latex}
\def\rmA{{\mathbf{A}}}
```
```{=latex}
\def\rmB{{\mathbf{B}}}
```
```{=latex}
\def\rmC{{\mathbf{C}}}
```
```{=latex}
\def\rmD{{\mathbf{D}}}
```
```{=latex}
\def\rmE{{\mathbf{E}}}
```
```{=latex}
\def\rmF{{\mathbf{F}}}
```
```{=latex}
\def\rmG{{\mathbf{G}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\rmI{{\mathbf{I}}}
```
```{=latex}
\def\rmJ{{\mathbf{J}}}
```
```{=latex}
\def\rmK{{\mathbf{K}}}
```
```{=latex}
\def\rmL{{\mathbf{L}}}
```
```{=latex}
\def\rmM{{\mathbf{M}}}
```
```{=latex}
\def\rmN{{\mathbf{N}}}
```
```{=latex}
\def\rmO{{\mathbf{O}}}
```
```{=latex}
\def\rmP{{\mathbf{P}}}
```
```{=latex}
\def\rmQ{{\mathbf{Q}}}
```
```{=latex}
\def\rmR{{\mathbf{R}}}
```
```{=latex}
\def\rmS{{\mathbf{S}}}
```
```{=latex}
\def\rmT{{\mathbf{T}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmV{{\mathbf{V}}}
```
```{=latex}
\def\rmW{{\mathbf{W}}}
```
```{=latex}
\def\rmX{{\mathbf{X}}}
```
```{=latex}
\def\rmY{{\mathbf{Y}}}
```
```{=latex}
\def\rmZ{{\mathbf{Z}}}
```
```{=latex}
\def\ermA{{\textnormal{A}}}
```
```{=latex}
\def\ermB{{\textnormal{B}}}
```
```{=latex}
\def\ermC{{\textnormal{C}}}
```
```{=latex}
\def\ermD{{\textnormal{D}}}
```
```{=latex}
\def\ermE{{\textnormal{E}}}
```
```{=latex}
\def\ermF{{\textnormal{F}}}
```
```{=latex}
\def\ermG{{\textnormal{G}}}
```
```{=latex}
\def\ermH{{\textnormal{H}}}
```
```{=latex}
\def\ermI{{\textnormal{I}}}
```
```{=latex}
\def\ermJ{{\textnormal{J}}}
```
```{=latex}
\def\ermK{{\textnormal{K}}}
```
```{=latex}
\def\ermL{{\textnormal{L}}}
```
```{=latex}
\def\ermM{{\textnormal{M}}}
```
```{=latex}
\def\ermN{{\textnormal{N}}}
```
```{=latex}
\def\ermO{{\textnormal{O}}}
```
```{=latex}
\def\ermP{{\textnormal{P}}}
```
```{=latex}
\def\ermQ{{\textnormal{Q}}}
```
```{=latex}
\def\ermR{{\textnormal{R}}}
```
```{=latex}
\def\ermS{{\textnormal{S}}}
```
```{=latex}
\def\ermT{{\textnormal{T}}}
```
```{=latex}
\def\ermU{{\textnormal{U}}}
```
```{=latex}
\def\ermV{{\textnormal{V}}}
```
```{=latex}
\def\ermW{{\textnormal{W}}}
```
```{=latex}
\def\ermX{{\textnormal{X}}}
```
```{=latex}
\def\ermY{{\textnormal{Y}}}
```
```{=latex}
\def\ermZ{{\textnormal{Z}}}
```
```{=latex}
\def\vzero{{\bm{0}}}
```
```{=latex}
\def\vone{{\bm{1}}}
```
```{=latex}
\def\valpha{{\bm{\alpha}}}
```
```{=latex}
\def\vmu{{\bm{\mu}}}
```
```{=latex}
\def\vtheta{{\bm{\theta}}}
```
```{=latex}
\def\va{{\bm{a}}}
```
```{=latex}
\def\vb{{\bm{b}}}
```
```{=latex}
\def\vc{{\bm{c}}}
```
```{=latex}
\def\vd{{\bm{d}}}
```
```{=latex}
\def\ve{{\bm{e}}}
```
```{=latex}
\def\vf{{\bm{f}}}
```
```{=latex}
\def\vg{{\bm{g}}}
```
```{=latex}
\def\vh{{\bm{h}}}
```
```{=latex}
\def\vi{{\bm{i}}}
```
```{=latex}
\def\vj{{\bm{j}}}
```
```{=latex}
\def\vk{{\bm{k}}}
```
```{=latex}
\def\vl{{\bm{l}}}
```
```{=latex}
\def\vm{{\bm{m}}}
```
```{=latex}
\def\vn{{\bm{n}}}
```
```{=latex}
\def\vo{{\bm{o}}}
```
```{=latex}
\def\vp{{\bm{p}}}
```
```{=latex}
\def\vq{{\bm{q}}}
```
```{=latex}
\def\vr{{\bm{r}}}
```
```{=latex}
\def\vs{{\bm{s}}}
```
```{=latex}
\def\vt{{\bm{t}}}
```
```{=latex}
\def\vu{{\bm{u}}}
```
```{=latex}
\def\vv{{\bm{v}}}
```
```{=latex}
\def\vw{{\bm{w}}}
```
```{=latex}
\def\vx{{\bm{x}}}
```
```{=latex}
\def\vy{{\bm{y}}}
```
```{=latex}
\def\vz{{\bm{z}}}
```
```{=latex}
\def\evalpha{{\alpha}}
```
```{=latex}
\def\evbeta{{\beta}}
```
```{=latex}
\def\evepsilon{{\epsilon}}
```
```{=latex}
\def\evlambda{{\lambda}}
```
```{=latex}
\def\evomega{{\omega}}
```
```{=latex}
\def\evmu{{\mu}}
```
```{=latex}
\def\evpsi{{\psi}}
```
```{=latex}
\def\evsigma{{\sigma}}
```
```{=latex}
\def\evtheta{{\theta}}
```
```{=latex}
\def\eva{{a}}
```
```{=latex}
\def\evb{{b}}
```
```{=latex}
\def\evc{{c}}
```
```{=latex}
\def\evd{{d}}
```
```{=latex}
\def\eve{{e}}
```
```{=latex}
\def\evf{{f}}
```
```{=latex}
\def\evg{{g}}
```
```{=latex}
\def\evh{{h}}
```
```{=latex}
\def\evi{{i}}
```
```{=latex}
\def\evj{{j}}
```
```{=latex}
\def\evk{{k}}
```
```{=latex}
\def\evl{{l}}
```
```{=latex}
\def\evm{{m}}
```
```{=latex}
\def\evn{{n}}
```
```{=latex}
\def\evo{{o}}
```
```{=latex}
\def\evp{{p}}
```
```{=latex}
\def\evq{{q}}
```
```{=latex}
\def\evr{{r}}
```
```{=latex}
\def\evs{{s}}
```
```{=latex}
\def\evt{{t}}
```
```{=latex}
\def\evu{{u}}
```
```{=latex}
\def\evv{{v}}
```
```{=latex}
\def\evw{{w}}
```
```{=latex}
\def\evx{{x}}
```
```{=latex}
\def\evy{{y}}
```
```{=latex}
\def\evz{{z}}
```
```{=latex}
\def\mA{{\bm{A}}}
```
```{=latex}
\def\mB{{\bm{B}}}
```
```{=latex}
\def\mC{{\bm{C}}}
```
```{=latex}
\def\mD{{\bm{D}}}
```
```{=latex}
\def\mE{{\bm{E}}}
```
```{=latex}
\def\mF{{\bm{F}}}
```
```{=latex}
\def\mG{{\bm{G}}}
```
```{=latex}
\def\mH{{\bm{H}}}
```
```{=latex}
\def\mI{{\bm{I}}}
```
```{=latex}
\def\mJ{{\bm{J}}}
```
```{=latex}
\def\mK{{\bm{K}}}
```
```{=latex}
\def\mL{{\bm{L}}}
```
```{=latex}
\def\mM{{\bm{M}}}
```
```{=latex}
\def\mN{{\bm{N}}}
```
```{=latex}
\def\mO{{\bm{O}}}
```
```{=latex}
\def\mP{{\bm{P}}}
```
```{=latex}
\def\mQ{{\bm{Q}}}
```
```{=latex}
\def\mR{{\bm{R}}}
```
```{=latex}
\def\mS{{\bm{S}}}
```
```{=latex}
\def\mT{{\bm{T}}}
```
```{=latex}
\def\mU{{\bm{U}}}
```
```{=latex}
\def\mV{{\bm{V}}}
```
```{=latex}
\def\mW{{\bm{W}}}
```
```{=latex}
\def\mX{{\bm{X}}}
```
```{=latex}
\def\mY{{\bm{Y}}}
```
```{=latex}
\def\mZ{{\bm{Z}}}
```
```{=latex}
\def\mBeta{{\bm{\beta}}}
```
```{=latex}
\def\mPhi{{\bm{\Phi}}}
```
```{=latex}
\def\mLambda{{\bm{\Lambda}}}
```
```{=latex}
\def\mSigma{{\bm{\Sigma}}}
```
```{=latex}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
```
```{=latex}
\def\tA{{\tens{A}}}
```
```{=latex}
\def\tB{{\tens{B}}}
```
```{=latex}
\def\tC{{\tens{C}}}
```
```{=latex}
\def\tD{{\tens{D}}}
```
```{=latex}
\def\tE{{\tens{E}}}
```
```{=latex}
\def\tF{{\tens{F}}}
```
```{=latex}
\def\tG{{\tens{G}}}
```
```{=latex}
\def\tH{{\tens{H}}}
```
```{=latex}
\def\tI{{\tens{I}}}
```
```{=latex}
\def\tJ{{\tens{J}}}
```
```{=latex}
\def\tK{{\tens{K}}}
```
```{=latex}
\def\tL{{\tens{L}}}
```
```{=latex}
\def\tM{{\tens{M}}}
```
```{=latex}
\def\tN{{\tens{N}}}
```
```{=latex}
\def\tO{{\tens{O}}}
```
```{=latex}
\def\tP{{\tens{P}}}
```
```{=latex}
\def\tQ{{\tens{Q}}}
```
```{=latex}
\def\tR{{\tens{R}}}
```
```{=latex}
\def\tS{{\tens{S}}}
```
```{=latex}
\def\tT{{\tens{T}}}
```
```{=latex}
\def\tU{{\tens{U}}}
```
```{=latex}
\def\tV{{\tens{V}}}
```
```{=latex}
\def\tW{{\tens{W}}}
```
```{=latex}
\def\tX{{\tens{X}}}
```
```{=latex}
\def\tY{{\tens{Y}}}
```
```{=latex}
\def\tZ{{\tens{Z}}}
```
```{=latex}
\def\gA{{\mathcal{A}}}
```
```{=latex}
\def\gB{{\mathcal{B}}}
```
```{=latex}
\def\gC{{\mathcal{C}}}
```
```{=latex}
\def\gD{{\mathcal{D}}}
```
```{=latex}
\def\gE{{\mathcal{E}}}
```
```{=latex}
\def\gF{{\mathcal{F}}}
```
```{=latex}
\def\gG{{\mathcal{G}}}
```
```{=latex}
\def\gH{{\mathcal{H}}}
```
```{=latex}
\def\gI{{\mathcal{I}}}
```
```{=latex}
\def\gJ{{\mathcal{J}}}
```
```{=latex}
\def\gK{{\mathcal{K}}}
```
```{=latex}
\def\gL{{\mathcal{L}}}
```
```{=latex}
\def\gM{{\mathcal{M}}}
```
```{=latex}
\def\gN{{\mathcal{N}}}
```
```{=latex}
\def\gO{{\mathcal{O}}}
```
```{=latex}
\def\gP{{\mathcal{P}}}
```
```{=latex}
\def\gQ{{\mathcal{Q}}}
```
```{=latex}
\def\gR{{\mathcal{R}}}
```
```{=latex}
\def\gS{{\mathcal{S}}}
```
```{=latex}
\def\gT{{\mathcal{T}}}
```
```{=latex}
\def\gU{{\mathcal{U}}}
```
```{=latex}
\def\gV{{\mathcal{V}}}
```
```{=latex}
\def\gW{{\mathcal{W}}}
```
```{=latex}
\def\gX{{\mathcal{X}}}
```
```{=latex}
\def\gY{{\mathcal{Y}}}
```
```{=latex}
\def\gZ{{\mathcal{Z}}}
```
```{=latex}
\def\sA{{\mathbb{A}}}
```
```{=latex}
\def\sB{{\mathbb{B}}}
```
```{=latex}
\def\sC{{\mathbb{C}}}
```
```{=latex}
\def\sD{{\mathbb{D}}}
```
```{=latex}
\def\sF{{\mathbb{F}}}
```
```{=latex}
\def\sG{{\mathbb{G}}}
```
```{=latex}
\def\sH{{\mathbb{H}}}
```
```{=latex}
\def\sI{{\mathbb{I}}}
```
```{=latex}
\def\sJ{{\mathbb{J}}}
```
```{=latex}
\def\sK{{\mathbb{K}}}
```
```{=latex}
\def\sL{{\mathbb{L}}}
```
```{=latex}
\def\sM{{\mathbb{M}}}
```
```{=latex}
\def\sN{{\mathbb{N}}}
```
```{=latex}
\def\sO{{\mathbb{O}}}
```
```{=latex}
\def\sP{{\mathbb{P}}}
```
```{=latex}
\def\sQ{{\mathbb{Q}}}
```
```{=latex}
\def\sR{{\mathbb{R}}}
```
```{=latex}
\def\sS{{\mathbb{S}}}
```
```{=latex}
\def\sT{{\mathbb{T}}}
```
```{=latex}
\def\sU{{\mathbb{U}}}
```
```{=latex}
\def\sV{{\mathbb{V}}}
```
```{=latex}
\def\sW{{\mathbb{W}}}
```
```{=latex}
\def\sX{{\mathbb{X}}}
```
```{=latex}
\def\sY{{\mathbb{Y}}}
```
```{=latex}
\def\sZ{{\mathbb{Z}}}
```
```{=latex}
\def\emLambda{{\Lambda}}
```
```{=latex}
\def\emA{{A}}
```
```{=latex}
\def\emB{{B}}
```
```{=latex}
\def\emC{{C}}
```
```{=latex}
\def\emD{{D}}
```
```{=latex}
\def\emE{{E}}
```
```{=latex}
\def\emF{{F}}
```
```{=latex}
\def\emG{{G}}
```
```{=latex}
\def\emH{{H}}
```
```{=latex}
\def\emI{{I}}
```
```{=latex}
\def\emJ{{J}}
```
```{=latex}
\def\emK{{K}}
```
```{=latex}
\def\emL{{L}}
```
```{=latex}
\def\emM{{M}}
```
```{=latex}
\def\emN{{N}}
```
```{=latex}
\def\emO{{O}}
```
```{=latex}
\def\emP{{P}}
```
```{=latex}
\def\emQ{{Q}}
```
```{=latex}
\def\emR{{R}}
```
```{=latex}
\def\emS{{S}}
```
```{=latex}
\def\emT{{T}}
```
```{=latex}
\def\emU{{U}}
```
```{=latex}
\def\emV{{V}}
```
```{=latex}
\def\emW{{W}}
```
```{=latex}
\def\emX{{X}}
```
```{=latex}
\def\emY{{Y}}
```
```{=latex}
\def\emZ{{Z}}
```
```{=latex}
\def\emSigma{{\Sigma}}
```
```{=latex}
\newcommand{\etens}[1]{\mathsfit{#1}}
```
```{=latex}
\def\etLambda{{\etens{\Lambda}}}
```
```{=latex}
\def\etA{{\etens{A}}}
```
```{=latex}
\def\etB{{\etens{B}}}
```
```{=latex}
\def\etC{{\etens{C}}}
```
```{=latex}
\def\etD{{\etens{D}}}
```
```{=latex}
\def\etE{{\etens{E}}}
```
```{=latex}
\def\etF{{\etens{F}}}
```
```{=latex}
\def\etG{{\etens{G}}}
```
```{=latex}
\def\etH{{\etens{H}}}
```
```{=latex}
\def\etI{{\etens{I}}}
```
```{=latex}
\def\etJ{{\etens{J}}}
```
```{=latex}
\def\etK{{\etens{K}}}
```
```{=latex}
\def\etL{{\etens{L}}}
```
```{=latex}
\def\etM{{\etens{M}}}
```
```{=latex}
\def\etN{{\etens{N}}}
```
```{=latex}
\def\etO{{\etens{O}}}
```
```{=latex}
\def\etP{{\etens{P}}}
```
```{=latex}
\def\etQ{{\etens{Q}}}
```
```{=latex}
\def\etR{{\etens{R}}}
```
```{=latex}
\def\etS{{\etens{S}}}
```
```{=latex}
\def\etT{{\etens{T}}}
```
```{=latex}
\def\etU{{\etens{U}}}
```
```{=latex}
\def\etV{{\etens{V}}}
```
```{=latex}
\def\etW{{\etens{W}}}
```
```{=latex}
\def\etX{{\etens{X}}}
```
```{=latex}
\def\etY{{\etens{Y}}}
```
```{=latex}
\def\etZ{{\etens{Z}}}
```
```{=latex}
\newcommand{\pdata}{p_{\rm{data}}}
```
```{=latex}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
```
```{=latex}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
```
```{=latex}
\newcommand{\pmodel}{p_{\rm{model}}}
```
```{=latex}
\newcommand{\Pmodel}{P_{\rm{model}}}
```
```{=latex}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
```
```{=latex}
\newcommand{\pencode}{p_{\rm{encoder}}}
```
```{=latex}
\newcommand{\pdecode}{p_{\rm{decoder}}}
```
```{=latex}
\newcommand{\precons}{p_{\rm{reconstruct}}}
```
```{=latex}
\newcommand{\laplace}{\mathrm{Laplace}}
```
```{=latex}
\newcommand{\E}{\mathbb{E}}
```
```{=latex}
\newcommand{\Ls}{\mathcal{L}}
```
```{=latex}
\newcommand{\R}{\mathbb{R}}
```
```{=latex}
\newcommand{\emp}{\tilde{p}}
```
```{=latex}
\newcommand{\lr}{\alpha}
```
```{=latex}
\newcommand{\reg}{\lambda}
```
```{=latex}
\newcommand{\rect}{\mathrm{rectifier}}
```
```{=latex}
\newcommand{\softmax}{\mathrm{softmax}}
```
```{=latex}
\newcommand{\sigmoid}{\sigma}
```
```{=latex}
\newcommand{\softplus}{\zeta}
```
```{=latex}
\newcommand{\KL}{D_{\mathrm{KL}}}
```
```{=latex}
\newcommand{\Var}{\mathrm{Var}}
```
```{=latex}
\newcommand{\standarderror}{\mathrm{SE}}
```
```{=latex}
\newcommand{\Cov}{\mathrm{Cov}}
```
```{=latex}
\newcommand{\normlzero}{L^0}
```
```{=latex}
\newcommand{\normlone}{L^1}
```
```{=latex}
\newcommand{\normltwo}{L^2}
```
```{=latex}
\newcommand{\normlp}{L^p}
```
```{=latex}
\newcommand{\normmax}{L^\infty}
```
```{=latex}
\newcommand{\parents}{Pa}
```
```{=latex}
\DeclareMathOperator*{\argmax}{arg\,max}
```
```{=latex}
\DeclareMathOperator*{\argmin}{arg\,min}
```
```{=latex}
\DeclareMathOperator{\sign}{sign}
```
```{=latex}
\DeclareMathOperator{\Tr}{Tr}
```
```{=latex}
\let\ab\allowbreak
```
```{=latex}
\def\bcdot{{\bm{\cdot}}}
```
```{=latex}
\def\balpha{{\bm{\alpha}}}
```
```{=latex}
\def\bbeta{{\bm{\beta}}}
```
```{=latex}
\def\bgamma{{\bm{\gamma}}}
```
```{=latex}
\def\bmu{{\bm{\mu}}}
```
```{=latex}
\def\blambda{{\bm{\lambda}}}
```
```{=latex}
\def\bLambda{{\mathbf{\Lambda}}}
```
```{=latex}
\def\bGamma{{\mathbf{\Gamma}}}
```
```{=latex}
\def\none{\text{None}}
```
```{=latex}
\newcommand{\C}{\mathbb{C}}
```
```{=latex}
\providecommand{\scan}{\text{SCAN}}
```
```{=latex}
\providecommand{\reduce}{\text{REDUCE}}
```
```{=latex}
\providecommand{\loss}{\mathcal L}
```
```{=latex}
\providecommand{\lnorm}{\operatorname{LN}}
```
```{=latex}
\providecommand{\gla}{\text{GLA}}
```
```{=latex}
\providecommand{\ffn}{\text{FFN}}
```
```{=latex}
\providecommand{\swiglu}{\text{SwiGLU}}
```
```{=latex}
\providecommand{\cumprod}{\text{cumprod}}
```
```{=latex}
\providecommand{\revcum}{\text{revcum}}
```
```{=latex}
\def\rmdC{{\mathbf{dC}}}
```
```{=latex}
\def\rmdO{{\mathbf{dO}}}
```
```{=latex}
\def\rmdQ{{\mathbf{dQ}}}
```
```{=latex}
\def\rmdK{{\mathbf{dK}}}
```
```{=latex}
\def\rmdtQ{{\mathbf{d\tilde Q}}}
```
```{=latex}
\def\rmdtK{{\mathbf{d \tilde K}}}
```
```{=latex}
\def\rmdP{{\mathbf{dP}}}
```
```{=latex}
\def\rmdS{{\mathbf{dS}}}
```
```{=latex}
\def\rmdX{{\mathbf{dX}}}
```
```{=latex}
\def\rmdG{{\mathbf{dG}}}
```
```{=latex}
\def\rmdF{{\mathbf{dF}}}
```
```{=latex}
\def\rmdA{{\mathbf{dA}}}
```
```{=latex}
\def\rmdB{{\mathbf{dB}}}
```
```{=latex}
\def\rmdV{{\mathbf{dV}}}
```
```{=latex}
\def\rmU{{\mathbf{U}}}
```
```{=latex}
\def\rmH{{\mathbf{H}}}
```
```{=latex}
\def\vda{{\mathbf{d}\bm{a}}}
```
```{=latex}
\def\vlogb{{\log\bm{b}}}
```
```{=latex}
\def\vlogalpha{{\log\bm{\alpha}}}
```
```{=latex}
\def\vdlogb{{\mathbf{d}\log\bm{b}}}
```
```{=latex}
\def\vdk{{\mathbf{d}\bm{k}}}
```
```{=latex}
\def\vdx{{\mathbf{d}\bm{x}}}
```
```{=latex}
\def\vw{\mathbf{w}}
```
```{=latex}
\def\vu{\mathbf{u}}
```
```{=latex}
\def\vdb{{\mathbf{d}\bm{b}}}
```
```{=latex}
\def\vdq{{\mathbf{d}\bm{q}}}
```
```{=latex}
\def\vdv{{\mathbf{d}\bm{v}}}
```
```{=latex}
\def\vdo{{\mathbf{d}\bm{o}}}
```
```{=latex}
\def\dbalpha{{\mathbf{d}\bm{\alpha}}}
```
```{=latex}
\def\dblogalpha{{\mathbf{d}\log\bm{\alpha}}}
```
```{=latex}
\newcommand{\sysname}{\textsc{GLA}\xspace}
```
```{=latex}
\maketitle
```
# Introduction {#sec_introduction}

The Transformer architecture has become the dominant backbone for large language models because self-attention gives each token direct access to its history and maps naturally to parallel training on modern accelerators. Its cost, however, still grows quadratically with sequence length. This cost becomes a central obstacle for long-context training and high-throughput inference, where the model must repeatedly process histories that are much longer than the dimension of a single attention head.

Linear recurrent attention takes a different path. It replaces the explicit attention matrix with a fixed-size recurrent state, turning sequence mixing into a linear-time recurrence whose memory does not grow with context length [@katharopoulos_transformers_2020]. The appeal is clear, but so is the constraint. The state is a compressed key-value memory, and long contexts force many associations to share the same finite space, making exact retrieval difficult [@linear-xmr-fastweight; @zoology; @arora_simple_2024; @jelassi_repeat_2024; @wen_rnns_2024; @akyurek_-context_2024]. Recent work has improved this memory by giving the recurrence more control over what persists. Mamba-2 uses data-dependent decay to regulate the memory horizon [@pmlr-v235-dao24a]. DeltaNet replaces additive writes with the delta rule, enabling targeted overwrite of the association addressed by the current key [@widrow_adaptive_1988; @linear-xmr-fastweight; @yang2024parallelizing]. Gated DeltaNet combines the delta rule with a learned decay gate, giving the state both global forgetting and targeted editing [@yang2025gated]. Kimi Delta Attention (KDA) refines the decay side with channel-wise forgetting over the key dimension [@team2025kimi]. In parallel, Mamba-3 advances the state-space route through exponential-trapezoidal discretization, complex-valued state transitions, and a multi-input, multi-output formulation for stronger and more efficient recurrence [@lahoti2026mamba3]. These advances have pushed recurrent linear models forward, while making the remaining bottleneck in delta-rule memory more visible. The active edit still uses one scalar gate to control both erasing old content and writing new content.

We propose *Gated DeltaNet-2*, a recurrent attention layer that decouples erase and write in the delta rule. The scalar tie is a modeling restriction because erasing and writing act on different axes of the state. Erasing is a key-side operation that decides which coordinates of the old read should be removed, while writing is a value-side operation that decides which coordinates of the incoming value should be committed. Gated DeltaNet-2 preserves KDA's channel-wise decay, but replaces the tied scalar delta gate with a channel-wise erase gate on the key axis and a channel-wise write gate on the value axis. The model can clear broad context through decay, remove selected stale associations through erase, and insert only the value channels that should persist through write. When the erase and write gates are tied to the same scalar, Gated DeltaNet-2 recovers KDA. If the decay is tied to a scalar as well, it recovers Gated DeltaNet.

This change preserves the efficient training path. By absorbing cumulative channel-wise decay into the rank-one erase factors, the recurrence admits a compact WY form with the same high-level chunkwise structure used by efficient delta-rule kernels [@bischof_wy_1985; @hua_transformer_2022; @sun2023retentive; @yang_gated_2023]. The main text gives the modeling equations and the chunkwise algorithm. Kernel-level details are deferred to the supplement.

Empirically, Gated DeltaNet-2 improves the recurrent attention frontier, with the clearest gains on long-context retrieval. On the RULER needle-in-a-haystack tasks in Table `\ref{tab_niah_results}`{=latex}, it remains strong as context length grows and is especially effective on the evaluated multi-key case where a fixed-size state must separate competing associations. This advantage also appears in real-world recall, where Gated DeltaNet-2 gives the strongest overall retrieval profile in both recurrent and hybrid settings. Together with gains in language modeling, commonsense reasoning and in-context retrieval, these results suggest that decoupling the active memory edit directly targets the main pressure point of fixed-state recurrence, interference among many compressed associations.

# Preliminary {#sec_preliminary}

## Linear attention as a recurrent state

We work with one attention head and omit layer indices. Let $\vq_t, \vk_t \in \mathbb{R}^{d_k}$ and $\vv_t \in \mathbb{R}^{d_v}$ denote the query, key, and value at position $t$. A recurrent linear attention layer stores a matrix state $\rmS_t \in \mathbb{R}^{d_k \times d_v}$ and reads it with the query, $$\begin{aligned}
    \rmS_t &= \rmS_{t-1} + \vk_t \vv_t^\top,
    &
    \vo_t &= \rmS_t^\top \vq_t .
\label{eq_linear_attention_recurrence}
\end{aligned}$$ This is the recurrent form of linear attention [@katharopoulos_transformers_2020]. Expanding the recurrence over a length $L$ sequence gives the familiar causal matrix form $$\begin{aligned}
    \rmO = (\rmQ\rmK^\top \odot \rmM)\rmV,
\label{eq_linear_attention_parallel}
\end{aligned}$$ where $\rmM$ is the causal mask. The state has fixed size in $L$, and the parallel form replaces tokenwise recurrence with matrix multiplication. The limitation is equally direct. Every outer product is added to the state and none is removed, so old associations remain until they are overwritten indirectly by later superposition.

#### Chunkwise form

Efficient linear recurrent layers use a chunkwise schedule during training [@hua_transformer_2022; @sun2023retentive; @yang_gated_2023]. Split the sequence into chunks of size $C$. For chunk $n$, let $\rmQ_{[n]}, \rmK_{[n]}, \rmV_{[n]}$ be the query, key, and value blocks, and let $\rmS_{[n]}$ be the state at the start of the chunk. Partial expansion gives $$\begin{aligned}
    \rmS_{[n+1]} &= \rmS_{[n]} + \rmK_{[n]}^\top \rmV_{[n]},
    &
    \rmO_{[n]} &= \rmQ_{[n]}\rmS_{[n]} + (\rmQ_{[n]}\rmK_{[n]}^\top \odot \rmM_C)\rmV_{[n]} .
\label{eq_linear_attention_chunk}
\end{aligned}$$ The recurrence remains only across chunks, while all token interactions inside a chunk are expressed as dense matrix products. With a fixed $C$, this keeps linear complexity in sequence length and maps well to tensor cores.

## Forgetting and overwriting

Mamba-2 adds a data-dependent scalar decay before each write [@pmlr-v235-dao24a], $$\begin{aligned}
    \rmS_t = \alpha_t\rmS_{t-1} + \vk_t\vv_t^\top,
    \qquad
    \alpha_t \in (0, 1] .
\label{eq_mamba2_recurrence}
\end{aligned}$$ The decay gives the model a global forgetting operation. If $\gamma_t = \prod_{i=1}^t \alpha_i$, then each earlier write is read at time $t$ with factor $\gamma_t / \gamma_i$. This yields a decay-aware attention mask and preserves the chunkwise structure of Eq. `\ref{eq_linear_attention_chunk}`{=latex}.

DeltaNet instead gives the state an active edit operation [@widrow_adaptive_1988; @linear-xmr-fastweight; @yang2024parallelizing]. Before writing $\vv_t$, the model reads the value currently associated with $\vk_t$ and subtracts it from the state. With a scalar step size $\beta_t \in [0,1]$, the update is $$\begin{aligned}
    \rmS_t
    &= \rmS_{t-1} + \beta_t\vk_t(\vv_t - \rmS_{t-1}^\top\vk_t)^\top
     = (\rmI - \beta_t\vk_t\vk_t^\top)\rmS_{t-1} + \beta_t\vk_t\vv_t^\top .
\label{eq_delta_recurrence}
\end{aligned}$$ When $\|\vk_t\|_2=1$, the matrix $\vk_t\vk_t^\top$ is a projector, so $\beta_t=1$ overwrites the association at key $\vk_t$ and $\beta_t=0$ leaves it unchanged. In the fast-weight view [@Irie2022TheDF; @ttt], Eq. `\ref{eq_delta_recurrence}`{=latex} is one online gradient step on the local regression loss $\frac{1}{2}\|\rmS^\top\vk_t - \vv_t\|_2^2$.

Gated DeltaNet combines these two operations [@yang2025gated], $$\begin{aligned}
    \rmS_t
    = \alpha_t(\rmI - \beta_t\vk_t\vk_t^\top)\rmS_{t-1} + \beta_t\vk_t\vv_t^\top .
\label{eq_gdn_recurrence}
\end{aligned}$$ The decay clears the state uniformly, while the delta rule edits a selected association. This is a useful division of labor, but both gates are scalar per head.

KDA refines the decay side by replacing the scalar $\alpha_t$ with a channel-wise vector $\valpha_t \in (0,1]^{d_k}$ [@team2025kimi]. With $\rmD_t = \operatorname{Diag}(\valpha_t)$, its update can be written as $$\begin{aligned}
    \rmS_t
    = (\rmI - \beta_t\vk_t\vk_t^\top)\rmD_t\rmS_{t-1} + \beta_t\vk_t\vv_t^\top .
\label{eq_kda_recurrence}
\end{aligned}$$ KDA lets each key channel decay at its own rate and retains the efficient WY-based chunkwise algorithm of DeltaNet [@bischof_wy_1985; @yang2024parallelizing]. Yet the active gate $\beta_t$ is still a single scalar. It controls both how much old content is erased from the read direction and how much new value is written. Gated DeltaNet-2 starts from this remaining tie.

# Gated DeltaNet-2 {#sec_method}

## Decoupling erase and write {#subsec_decoupling}

KDA refines Gated DeltaNet by making the decay channel-wise, but the scalar $\beta_t$ in Eq. `\ref{eq_kda_recurrence}`{=latex} still carries two decisions that need not agree. One decision lives on the key side and determines which coordinates of the current read should be erased. The other lives on the value side and determines which coordinates of the candidate value should be written. Treating both decisions as one scalar is a restriction of the update, not a requirement of the delta rule.

Gated DeltaNet-2 separates the two decisions through *Gated Delta Rule-2*. Let $$\begin{aligned}
    \ve_t &= \vb_t \odot \vk_t,
    &
    \vz_t &= \vw_t \odot \vv_t,
\label{eq_gdn2_e_z}
\end{aligned}$$ where $\vb_t \in [0,1]^{d_k}$ is the erase gate and $\vw_t \in [0,1]^{d_v}$ is the write gate. The erase gate weights the key coordinates used to read old content, while the write gate weights the value coordinates being inserted. Let $\rmD_t=\operatorname{Diag}(\valpha_t)$. Applying decay before the active edit gives $$\begin{aligned}
    \bar{\rmS}_t &= \rmD_t\rmS_{t-1},
    &
    \vr_t &= \bar{\rmS}_t^\top\ve_t,
    &
    \rmS_t &= \bar{\rmS}_t + \vk_t(\vz_t - \vr_t)^\top .
\label{eq_gdn2_residual}
\end{aligned}$$ Equivalently, $$\boxed{
    \rmS_t
    = \bigl(\rmI - \vk_t(\vb_t \odot \vk_t)^\top\bigr)\rmD_t\rmS_{t-1}
    + \vk_t(\vw_t \odot \vv_t)^\top
    }
\label{eq_gdn2_recurrence}$$ We refer to Eq. `\ref{eq_gdn2_recurrence}`{=latex} as *Gated Delta Rule-2*. The output is $\vo_t = \rmS_t^\top\vq_t$. The left factor of the erase matrix remains $\vk_t$, which preserves the write direction of the delta rule. The right factor becomes $\vb_t \odot \vk_t$, which makes the read direction channel selective. The write term becomes $\vk_t\vz_t^\top$, which makes the value update channel selective.

Gated Delta Rule-2 recovers KDA exactly when $\vb_t = \beta_t\mathbf{1}_{d_k}$ and $\vw_t = \beta_t\mathbf{1}_{d_v}$. It recovers Gated DeltaNet by further setting $\valpha_t = \alpha_t\mathbf{1}_{d_k}$. Thus the model preserves the known scalar-gated updates as tied subspaces, while learning outside those subspaces when erase and write require different channel structure.

The layer produces the two gates with independent projections of the token representation, $$\begin{aligned}
    \vb_t &= \sigma(\mathbf{W}_b\vx_t),
    &
    \vw_t &= \sigma(\mathbf{W}_w\vx_t) .
\label{eq_gdn2_gate_param}
\end{aligned}$$ The log-decay follows the Gated DeltaNet parameterization, $$\begin{aligned}
    \boldsymbol{g}_t
    = -\exp(\mathbf{a}) \odot \operatorname{softplus}(\mathbf{W}_f\vx_t + \boldsymbol{\delta}),
    \qquad
    \valpha_t = \exp(\boldsymbol{g}_t) .
\label{eq_gdn2_decay_param}
\end{aligned}$$ In practice this decay activation is computed in fp32 before the kernel consumes it, which avoids precision loss in the cumulative log-decay. We also support the negative-eigenvalue variant of [@Grazzi2024UnlockingSI] by scaling only the erase gate to $[0,2]^{d_k}$. The write gate remains in $[0,1]^{d_v}$ because the spectral effect concerns the state transition, not the value magnitude.

## Fast-weight update perspective

We can interpret Gated Delta Rule-2 as an online update of a fast-weight memory state [@longhorn]. The state $\rmS_t$ stores transient key-value associations. At each token, the model first forms a decayed state $\bar{\rmS}_t=\rmD_t\rmS_{t-1}$, reads the old content through the gated erase direction $\ve_t$, and writes a correction toward the gated value target $\vz_t$.

More formally, Eq. `\ref{eq_gdn2_residual}`{=latex} is the solution of the local online problem $$\begin{aligned}
    \rmS_t
    =
    \operatorname*{arg\,min}_{\rmS}
    \boldsymbol{L}_t(\rmS),
    \qquad
    \boldsymbol{L}_t(\rmS)
    =
    \|\rmS-\bar{\rmS}_t\|_F^2
    -
    2
    \left\langle
    \rmS^\top\vk_t,
    \vz_t-\bar{\rmS}_t^\top\ve_t
    \right\rangle .
\label{eq_gdn2_online_objective}
\end{aligned}$$ The first term keeps the new state close to the decayed memory. The second term applies an associative edit whose residual compares the gated write target $\vz_t$ against the content read from $\bar{\rmS}_t$ along $\ve_t$. Since $$\begin{aligned}
    \nabla_{\rmS}\boldsymbol{L}_t(\rmS)
    =
    2(\rmS-\bar{\rmS}_t)
    -
    2\vk_t
    \left(
    \vz_t-\bar{\rmS}_t^\top\ve_t
    \right)^\top ,
\end{aligned}$$ the minimizer is $$\begin{aligned}
    \rmS_t
    =
    \bar{\rmS}_t
    +
    \vk_t
    \left(
    \vz_t-\bar{\rmS}_t^\top\ve_t
    \right)^\top ,
\end{aligned}$$ which is exactly Eq. `\ref{eq_gdn2_residual}`{=latex}.

Table `\ref{tab_gdn2_online_learning}`{=latex} compares this view with Mamba-2, Gated DeltaNet, KDA, and Mamba-3. We write all updates in the state orientation used in this paper, where $\vo_t=\rmS_t^\top\vq_t$. Normalizer terms, kernel maps, output gates, and value projection gates are omitted for readability.

For the Mamba-3 row, we use the SISO exponential-trapezoidal recurrence [@lahoti2026mamba3]. Let $$\begin{aligned}
    \widetilde{\vk}_s &= \rmR_{1:s}^\top\vk_s,
    &
    \eta_t &= (1-\lambda_t)\Delta_t\alpha_t,
    &
    \zeta_t &= \lambda_t\Delta_t .
\end{aligned}$$ Here $\rmR_{1:s}$ is the cumulative data-dependent rotation from the complex SSM view, and the previous-token term is omitted at the beginning of a sequence. The MIMO version replaces each rank-one write with a sum over the MIMO rank and leaves the same online form intact.

```{=latex}
\begin{table*}[t]\caption{
Fast-weight update view of DeltaNet, Mamba-2, Gated DeltaNet, KDA, Mamba-3, and Gated DeltaNet-2.
All updates use the state orientation of this paper, where $\vo_t=\rmS_t^\top\vq_t$.
Mamba-2 and Mamba-3 add gated key-value correlation terms to a decayed state.
DeltaNet, Gated DeltaNet, KDA, and Gated DeltaNet-2 instead write a delta residual, the target value minus the value currently read from memory.
}
\scriptsize
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.35}
\begin{tabular*}{\linewidth}{@{}l@{}p{0.455\linewidth}@{}p{0.355\linewidth}@{}}
\toprule
\textbf{Method}
&
\textbf{Local objective $\boldsymbol{L}_t(\rmS)$}
&
\textbf{State update}
\\
\midrule

DeltaNet~\citep{yang2024parallelizing}
&
\(\displaystyle
\begin{aligned}[t]
&\|\rmS-\rmS_{t-1}\|_F^2
\\
&-
2
\left\langle
\rmS^\top\vk_t,
\beta_t
\left(
\vv_t-\rmS_{t-1}^\top\vk_t
\right)
\right\rangle
\end{aligned}
\)
&
\(\displaystyle
\begin{aligned}[t]
\rmS_t
&=
(\rmI-\beta_t\vk_t\vk_t^\top)\rmS_{t-1}
\\
&\quad+
\beta_t\vk_t\vv_t^\top
\end{aligned}
\)
\\
\addlinespace[0.45em]

Mamba-2~\citep{pmlr-v235-dao24a}
&
\(\displaystyle
\begin{aligned}[t]
&\|\rmS-\alpha_t\rmS_{t-1}\|_F^2
\\
&-
2
\left\langle
\rmS^\top\vk_t,
\vv_t
\right\rangle
\end{aligned}
\)
&
\(\displaystyle
\begin{aligned}[t]
\rmS_t
&=
\alpha_t\rmS_{t-1}
+
\vk_t\vv_t^\top
\end{aligned}
\)
\\
\addlinespace[0.45em]

Gated DeltaNet~\citep{yang2025gated}
&
\(\displaystyle
\begin{aligned}[t]
&\|\rmS-\alpha_t\rmS_{t-1}\|_F^2
\\
&-
2
\left\langle
\rmS^\top\vk_t,
\beta_t
\left(
\vv_t-(\alpha_t\rmS_{t-1})^\top\vk_t
\right)
\right\rangle
\end{aligned}
\)
&
\(\displaystyle
\begin{aligned}[t]
\rmS_t
&=
\alpha_t
(\rmI-\beta_t\vk_t\vk_t^\top)
\rmS_{t-1}
\\
&\quad+
\beta_t\vk_t\vv_t^\top
\end{aligned}
\)
\\
\addlinespace[0.45em]

KDA~\citep{team2025kimi}
&
\(\displaystyle
\begin{aligned}[t]
&\|\rmS-\rmD_t\rmS_{t-1}\|_F^2
\\
&-
2
\left\langle
\rmS^\top\vk_t,
\beta_t
\left(
\vv_t-(\rmD_t\rmS_{t-1})^\top\vk_t
\right)
\right\rangle
\end{aligned}
\)
&
\(\displaystyle
\begin{aligned}[t]
\rmS_t
&=
(\rmI-\beta_t\vk_t\vk_t^\top)
\rmD_t\rmS_{t-1}
\\
&\quad+
\beta_t\vk_t\vv_t^\top
\end{aligned}
\)
\\
\addlinespace[0.45em]

Mamba-3~\citep{lahoti2026mamba3}
&
\(\displaystyle
\begin{aligned}[t]
&\|\rmS-\alpha_t\rmS_{t-1}\|_F^2
\\
&-
2
\left\langle
\rmS^\top\widetilde{\vk}_{t-1},
\eta_t\vv_{t-1}
\right\rangle
\\
&-
2
\left\langle
\rmS^\top\widetilde{\vk}_{t},
\zeta_t\vv_t
\right\rangle
\end{aligned}
\)
&
\(\displaystyle
\begin{aligned}[t]
\rmS_t
&=
\alpha_t\rmS_{t-1}
+
\eta_t\widetilde{\vk}_{t-1}\vv_{t-1}^\top
\\
&\quad+
\zeta_t\widetilde{\vk}_{t}\vv_t^\top
\end{aligned}
\)
\\
\addlinespace[0.45em]

Gated DeltaNet-2
&
\(\displaystyle
\begin{aligned}[t]
&\|\rmS-\rmD_t\rmS_{t-1}\|_F^2
\\
&-
2
\left\langle
\rmS^\top\vk_t,
\vz_t-(\rmD_t\rmS_{t-1})^\top\ve_t
\right\rangle
\end{aligned}
\)
&
\(\displaystyle
\begin{aligned}[t]
\rmS_t
&=
(\rmI-\vk_t\ve_t^\top)
\rmD_t\rmS_{t-1}
\\
&\quad+
\vk_t\vz_t^\top
\end{aligned}
\)
\\

\bottomrule
\end{tabular*}
\label{tab_gdn2_online_learning}
\end{table*}
```
The comparison separates two families. Mamba-2 and Mamba-3 write correlations into a decayed state. Mamba-3 makes this write more expressive through the exponential-trapezoidal input rule and data-dependent rotations, but it does not subtract a current read from the state. Gated DeltaNet and KDA instead perform a residual delta edit. KDA changes the decay from scalar to channel-wise while keeping the scalar residual $\beta_t(\vv_t-\bar{\rmS}_t^\top\vk_t)$. Gated DeltaNet-2 changes the residual itself to $$\begin{aligned}
    \vz_t-\bar{\rmS}_t^\top\ve_t
    =
    \vw_t\odot\vv_t
    -
    (\rmD_t\rmS_{t-1})^\top(\vb_t\odot\vk_t),
\end{aligned}$$ which decouples the coordinates used to erase from the coordinates used to write.

## Chunkwise parallel training {#subsec_chunkwise_training}

We now show that Gated Delta Rule-2 keeps the same chunkwise structure as KDA. Consider one chunk and suppress the chunk index. Let $$\begin{aligned}
    \boldsymbol{G}_r = \sum_{i=1}^{r}\boldsymbol{g}_i,
    \qquad
    \boldsymbol{\gamma}_r = \exp(\boldsymbol{G}_r),
    \qquad
    \boldsymbol{\gamma}_0 = \mathbf{1}_{d_k} .
\label{eq_gdn2_gamma}
\end{aligned}$$ Define the decay-normalized state $\widehat{\rmS}_r$ by $\rmS_r = \operatorname{Diag}(\boldsymbol{\gamma}_r)\widehat{\rmS}_r$. With $\widehat{\rmS}_0=\rmS_{[n]}$, Eq. `\ref{eq_gdn2_recurrence}`{=latex} becomes a pure asymmetric delta recurrence, $$\begin{aligned}
    \widehat{\rmS}_r
    = \bigl(\rmI - \bar{\vk}_r\bar{\ve}_r^\top\bigr)\widehat{\rmS}_{r-1}
    + \bar{\vk}_r\vz_r^\top,
    \qquad
    \bar{\vk}_r = \boldsymbol{\gamma}_r^{-1}\odot\vk_r,
    \qquad
    \bar{\ve}_r = \boldsymbol{\gamma}_r\odot\ve_r .
\label{eq_gdn2_normalized_delta}
\end{aligned}$$ This normalization is the key to the efficient form. The channel-wise decay is absorbed into the two factors of each rank-one erase, while the update remains a product of matrices of the form $\rmI - \bar{\vk}_r\bar{\ve}_r^\top$.

Let $\rmB\in\mathbb{R}^{C\times d_k}$ and $\rmW\in\mathbb{R}^{C\times d_v}$ contain rows $\vb_r^\top$ and $\vw_r^\top$, respectively. For compact matrix notation, let $\boldsymbol{\gamma}\in\mathbb{R}^{C\times d_k}$ contain rows $\boldsymbol{\gamma}_r^\top$. Let $\bar{\rmK}\in\mathbb{R}^{C\times d_k}$, $\bar{\rmE}\in\mathbb{R}^{C\times d_k}$, and $\rmZ\in\mathbb{R}^{C\times d_v}$ contain rows $\bar{\vk}_r^\top$, $\bar{\ve}_r^\top$, and $\vz_r^\top$. Equivalently, $$\begin{aligned}
    \bar{\rmE}
    =
    \boldsymbol{\gamma}\odot(\rmB\odot\rmK),
    \qquad
    \rmZ
    =
    \rmW\odot\rmV .
\label{eq_gdn2_gate_matrices}
\end{aligned}$$ Define the strictly lower triangular matrix $$\begin{aligned}
    \rmT = \operatorname{tril}(\bar{\rmE}\bar{\rmK}^\top, -1),
    \qquad
    \rmA = (\rmI + \rmT)^{-1} .
\label{eq_gdn2_wy_matrix}
\end{aligned}$$ The WY auxiliaries are $$\begin{aligned}
    \rmY = \rmA\bar{\rmE},
    \qquad
    \rmU = \rmA\rmZ .
\label{eq_gdn2_wy_aux}
\end{aligned}$$ Here $\rmY$ is the erase-side auxiliary and $\rmU$ is the write-side auxiliary. Since $\rmT$ is triangular with zero diagonal, $\rmA$ is obtained by a small forward substitution inside each chunk.

The end-of-chunk state is then $$\begin{aligned}
    \rmS_{[n+1]}
    =
    \operatorname{Diag}(\boldsymbol{\gamma}_C)\rmS_{[n]}
    +
    \rmK_{\mathrm{tail}}^\top(\rmU - \rmY\rmS_{[n]}),
\label{eq_gdn2_chunk_state}
\end{aligned}$$ where row $r$ of $\rmK_{\mathrm{tail}}$ is $(\boldsymbol{\gamma}_C / \boldsymbol{\gamma}_r)\odot\vk_r$. The output block is $$\begin{aligned}
    \rmO_{[n]}
    =
    \rmQ_{\gamma}\rmS_{[n]}
    +
    \rmA_{qk}(\rmU - \rmY\rmS_{[n]}),
\label{eq_gdn2_chunk_output}
\end{aligned}$$ where row $r$ of $\rmQ_{\gamma}$ is $\boldsymbol{\gamma}_r\odot\vq_r$ and $$\begin{aligned}
    (\rmA_{qk})_{rs}
    =
    \mathbf{1}_{r\ge s}\,
    \vq_r^\top
    \operatorname{Diag}(\boldsymbol{\gamma}_r / \boldsymbol{\gamma}_s)
    \vk_s .
\label{eq_gdn2_chunk_scores}
\end{aligned}$$ Equations `\ref{eq_gdn2_chunk_state}`{=latex} and `\ref{eq_gdn2_chunk_output}`{=latex} have the same shape as the KDA chunk equations. The only difference is how $\rmY$ and $\rmU$ are formed. The erase gate enters through row $r$ of $\bar{\rmE}$ as $\boldsymbol{\gamma}_r\odot(\vb_r\odot\vk_r)$. The write gate enters through row $r$ of $\rmZ$ as $\vw_r\odot\vv_r$. The rest of the computation is a triangular solve and dense matrix multiplication over fixed-size chunks. We use the UT transform [@Joffrain2006AccumulatingHT] and implement these equations with fused Triton kernels [@tillet_triton_2019]. Kernel schedules and precision choices are deferred to the supplement.

## Gate-aware backward {#subsec_gate_aware_backward}

The backward pass follows the same decomposition as the forward. Gradients first flow through the output equation and the inter-chunk state recurrence, both of which operate only on $\rmA_{qk}$, $\rmK_{\mathrm{tail}}$, $\rmY$, and $\rmU$. The only new accounting is the vector-Jacobian product through Eq. `\ref{eq_gdn2_wy_aux}`{=latex} and Eq. `\ref{eq_gdn2_wy_matrix}`{=latex}.

For scalar-gated delta rules, a factor $\beta_r$ can be moved outside the dot products that accumulate the gradient of $\rmA$. That shortcut breaks for Gated Delta Rule-2. The write side contains a different diagonal gate over value channels, and the erase side contains a different diagonal gate over key channels. Therefore the gate factors must be present at the accumulation sites. Let $\rmB$ and $\rmW$ contain rows $\vb_r^\top$ and $\vw_r^\top$, respectively, and let $\boldsymbol{\gamma}$ denote the row-stacked cumulative-decay vectors. Then $$\begin{aligned}
    \mathrm{d}\rmA &\mathrel{+}= \mathrm{d}\rmU\,\rmZ^\top,
    &
    \rmZ &= \rmW\odot\rmV,
\label{eq_gdn2_backward_u}\\
    \mathrm{d}\rmA &\mathrel{+}= \mathrm{d}\rmY\,\bar{\rmE}^\top,
    &
    \bar{\rmE} &= \boldsymbol{\gamma}\odot(\rmB\odot\rmK) .
\label{eq_gdn2_backward_y}
\end{aligned}$$ The inverse itself has the standard triangular vector-Jacobian product $$\begin{aligned}
    \mathrm{d}\rmT =
    -\operatorname{tril}
    \bigl(
    \rmA^\top\mathrm{d}\rmA\rmA^\top,
    -1
    \bigr).
\label{eq_gdn2_inverse_vjp}
\end{aligned}$$ From there, gradients to $\rmB$, $\rmW$, $\rmK$, $\rmV$, and the cumulative decay follow by ordinary elementwise products and reverse cumulative sums. This gate-aware accumulation is the main mathematical change required for training Gated Delta Rule-2. The remaining backward kernels retain the same matrix shapes as KDA and can reuse the same state and output vector-Jacobian product structure.

## Block design and hybrid models {#subsec_block_design}

```{=latex}
\definecolor{fgate_color}{RGB}{252,224,225}
```
```{=latex}
\definecolor{wgate_color}{RGB}{255,210,210}
```
```{=latex}
\definecolor{delta_color}{RGB}{242,243,193}
```
```{=latex}
\definecolor{swa_color}{RGB}{252,224,225}
```
```{=latex}
\definecolor{add_norm_color}{RGB}{252,226,187}
```
```{=latex}
\definecolor{glu_color}{RGB}{194,232,247}
```
```{=latex}
\definecolor{silu_color}{RGB}{203,231,207}
```
```{=latex}
\definecolor{linear_color}{RGB}{220,223,240}
```
```{=latex}
\definecolor{conv_color}{RGB}{252,224,225}
```
```{=latex}
\definecolor{l2_color}{RGB}{252,226,187}
```
```{=latex}
\definecolor{gray_bbox_color}{RGB}{243,243,244}
```
```{=latex}
\definecolor{oproj_color}{RGB}{220,223,240}
```
```{=latex}
\definecolor{operator_color}{RGB}{252,224,225}
```
<figure id="fig:gated_deltanet2_model">

<figcaption> Visualization of the hybrid architecture and block design of <strong>Gated DeltaNet-2</strong>. The <em>Hybrid Gated DeltaNet-2</em> model repeats a Gated DeltaNet-2 token mixer, an MLP, sliding-window attention (SWA), and another MLP. In the block design, query and key paths use linear projection, short convolution, SiLU, and L2 normalization. The value path uses linear projection, short convolution, and SiLU. The central recurrent operator is <em>Gated Delta Rule-2</em>. The decay branch produces <span class="math inline">$\valpha$</span> from the log-decay projection. The channel-wise erase gate <span class="math inline">$\vb$</span> and channel-wise write gate <span class="math inline">$\vw$</span> each use linear projection followed by sigmoid. The recurrent output is normalized, multiplied by a SiLU output gate, and passed through the output projection. </figcaption>
</figure>

#### Gated DeltaNet-2 token mixer.

Gated DeltaNet-2 is used as the recurrent token mixer in a standard Transformer-style block. Fig. `\ref{fig:gated_deltanet2_model}`{=latex} (right) shows its block design. For the Gated Delta Rule-2 in Eq. `\ref{eq_gdn2_recurrence}`{=latex}, $\{\vq_t,\vk_t,\vv_t\}$ are produced by linear projection, short causal convolution, and SiLU, with L2 normalization applied to $\vq_t$ and $\vk_t$ for stability. Separate branches produce the channel-wise decay $\valpha_t$, erase gate $\vb_t$, and write gate $\vw_t$. The recurrent output is RMS-normalized, multiplied by a separate SiLU output gate, and projected back to the model dimension. Throughout the paper, $\boldsymbol{g}$ denotes the log-decay tensor in Eq. `\ref{eq_gdn2_decay_param}`{=latex}, not the output gate. With grouped value heads, $\vq$, $\vk$, the log-decay tensor $\boldsymbol{g}$, and $\vb$ are repeated across value-head groups, while $\vv$ and $\vw$ remain on the value-head axis.

#### Model families.

We train both recurrent and hybrid models. The recurrent model stacks Gated DeltaNet-2 token mixers and MLPs under the standard residual block, isolating the fixed-state memory of Eq. `\ref{eq_gdn2_recurrence}`{=latex}. The hybrid model inserts Sliding-Window Attention (SWA) after the recurrent mixer, as shown in Fig. `\ref{fig:gated_deltanet2_model}`{=latex} (left). A repeated cell contains Gated DeltaNet-2, an MLP, SWA, and another MLP. Gated DeltaNet-2 compresses long histories into constant-size memory, while SWA handles exact local interactions such as short shifts, comparisons, and local retrieval. With a fixed window, the hybrid retains linear sequence scaling and a bounded attention cache, following the recurrent attention hybrid design pattern [@de_griffin_2024; @ren2024samba].

# Experiments {#sec_exp}

#### Setup

We evaluate each recurrent family in two forms, a recurrent-only model and a hybrid model that pairs the same recurrent token mixer with sliding-window attention as described in Section `\ref{subsec_block_design}`{=latex}. For Mamba-3, we include both SISO and MIMO variants and use MIMO rank $R=4$ following [@lahoti2026mamba3]. All models are trained with the same recipe. Unless stated otherwise, each model has 1.3B parameters and is trained on 100B tokens from FineWeb-Edu [@penedo2024fineweb]. We use AdamW with peak learning rate $4\times10^{-4}$, weight decay $0.1$, gradient clipping at $1.0$, cosine decay, a 1B-token warm-up, and a global batch size of 0.5M tokens. The training length is 4K tokens, and hybrid models use a 2K sliding-window attention size. Evaluation details are given in the appendix.

```{=latex}
\begin{table*}[t!]


\footnotesize
\addtolength{\tabcolsep}{-3.2pt}
\begin{tabular}{l|cc|cccccccccc}
\toprule
\textbf{Model} & \textbf{Wiki.} & \textbf{LMB.} & \textbf{LMB.} & \textbf{PIQA} & \textbf{Hella.} & \textbf{Wino.} & \textbf{ARC-e} & \textbf{ARC-c} & \textbf{OBQA} & \textbf{SIQA} & \textbf{BoolQ} & \textbf{Avg.} \\
 & ppl $\downarrow$ & ppl $\downarrow$ & acc $\uparrow$ & acc $\uparrow$ & acc\_n $\uparrow$ & acc $\uparrow$ & acc $\uparrow$ & acc $\uparrow$ & acc $\uparrow$ & acc $\uparrow$ & acc $\uparrow$ & acc $\uparrow$ \\
\midrule
\midrule
\textit{Recurrent models} \\
 Mamba-2 & 16.79 & 12.38 & 45.24 & \underline{72.58} & 55.51 & 55.33 & 70.68 & 35.26 & \underline{31.00} & 40.63 & \underline{60.19} & 51.82 \\
 Gated DeltaNet & 16.40 & 11.89 & \textbf{49.62} & 72.31 & \underline{56.50} & \underline{56.75} & 68.81 & 35.15 & 30.20 & 40.53 & 58.78 & 52.07 \\
 KDA & 16.81 & 11.68 & \underline{48.13} & 72.09 & 55.75 & 55.72 & 70.83 & 35.92 & 30.40 & \underline{40.99} & \textbf{60.67} & 52.28 \\
 Mamba-3 (SISO) & \underline{16.30} & 12.99 & 45.06 & 72.31 & 55.58 & 56.20 & 70.45 & 34.56 & \underline{31.00} & \textbf{41.76} & 55.90 & 51.42 \\
 Mamba-3 (MIMO) & 16.45 & \underline{11.66} & 47.82 & 72.36 & 56.49 & 55.78 & \underline{72.38} & \underline{38.07} & 30.00 & 40.89 & 57.74 & \underline{52.39} \\
 Gated DeltaNet-2 & \textbf{15.90} & \textbf{11.41} & 48.09 & \textbf{72.80} & \textbf{56.84} & \textbf{57.85} & \textbf{72.43} & \textbf{38.23} & \textbf{31.60} & 40.58 & 59.54 & \textbf{53.11} \\
\midrule
\textit{Attention or hybrid models} \\
 Transformer & 19.22 & 13.72 & 48.32 & 70.21 & 56.12 & 55.85 & 69.23 & 33.84 & 25.00 & 39.74 & 59.42 & 50.86 \\
 Mamba-2 & 17.46 & 11.29 & 48.05 & 71.47 & 57.52 & 56.17 & 70.50 & 34.73 & 29.80 & 40.35 & 59.31 & 51.99 \\
 Gated DeltaNet & 16.00 & 10.82 & 48.71 & 70.06 & 57.50 & 56.83 & 70.41 & 35.15 & 30.60 & 40.97 & 60.00 & 52.25 \\
 KDA & 16.01 & 10.66 & 49.21 & 71.06 & 56.89 & \underline{57.77} & \underline{71.59} & 35.07 & 30.00 & 40.53 & \underline{62.03} & 52.68 \\
 Mamba-3 (SISO) & \textbf{15.54} & \underline{10.65} & 49.19 & 71.01 & \textbf{58.75} & 57.30 & 70.54 & 36.35 & \underline{32.00} & \underline{41.20} & 57.86 & 52.69 \\
 Mamba-3 (MIMO) & 15.81 & 10.92 & \underline{49.82} & \underline{71.98} & 58.19 & 57.06 & 70.54 & \textbf{38.48} & 29.40 & 40.99 & 57.98 & \underline{52.72} \\
 Gated DeltaNet-2 & \underline{15.62} & \textbf{10.43} & \textbf{50.90} & \textbf{72.20} & \underline{58.46} & \textbf{58.56} & \textbf{71.89} & \underline{36.69} & \textbf{33.00} & \textbf{41.50} & \textbf{62.57} & \textbf{53.97} \\
\bottomrule
\end{tabular}
\addtolength{\tabcolsep}{3.2pt}
\caption{
Performance comparison on language modeling and zero-shot common-sense reasoning.
All accuracy values are reported as percentages.
Avg. is computed over LAMBADA accuracy and the listed reasoning accuracies.
}
\label{tab_commonsense_results}
\end{table*}
```
```{=latex}
\begin{table*}[t!]

\footnotesize
\setlength{\tabcolsep}{2.0pt}
\renewcommand{\arraystretch}{1.03}
\newcommand{\niahbest}[1]{\textbf{#1}}
\newcommand{\niahsecond}[1]{\underline{#1}}
\newcommand{\modelname}[1]{#1}
\begin{tabular}{@{}l@{}cccc@{}cccc@{}ccc@{}ccc@{}}
\toprule
\textbf{Model} &
\textbf{S-NIAH-1} &
\textbf{S-NIAH-2} &
\textbf{S-NIAH-3} &
\textbf{MK-NIAH-1} \\
\cmidrule(lr){2-5}
\cmidrule(lr){6-9}
\cmidrule(lr){10-12}
\cmidrule(l){13-15}
& \textbf{1K} & \textbf{2K} & \textbf{4K} & \textbf{8K}
& \textbf{1K} & \textbf{2K} & \textbf{4K} & \textbf{8K}
& \textbf{1K} & \textbf{2K} & \textbf{4K}
& \textbf{1K} & \textbf{2K} & \textbf{4K} \\
\midrule
\textit{Recurrent models} \\
\addlinespace[0.2ex]
\modelname{Mamba-2}
& \niahbest{100.0} & \niahbest{100.0} & 97.0 & 55.8
& 99.6 & \niahsecond{99.6} & 62.6 & 21.0
& 59.2 & 38.6 & 14.4
& 29.0 & 21.2 & 21.4 \\
\modelname{Gated DeltaNet}
& \niahsecond{99.8} & \niahbest{100.0} & \niahbest{100.0} & \niahsecond{97.6}
& \niahbest{100.0} & \niahbest{100.0} & 87.2 & \niahsecond{32.0}
& \niahsecond{89.8} & 54.2 & \niahbest{60.6}
& \niahsecond{58.0} & 37.0 & 27.8 \\
\modelname{KDA}
& \niahbest{100.0} & \niahbest{100.0} & \niahsecond{99.2} & 70.6
& \niahbest{100.0} & \niahbest{100.0} & \niahsecond{89.0} & 30.6
& 77.4 & 63.2 & 26.2
& 54.0 & \niahsecond{44.2} & \niahsecond{28.0} \\
\modelname{Mamba-3 (SISO)}
& \niahbest{100.0} & 99.0 & 63.4 & 27.8
& \niahsecond{99.8} & 99.0 & 59.4 & 25.2
& 60.2 & 35.6 & 12.2
& 44.8 & 27.4 & 20.2 \\
\modelname{Mamba-3 (MIMO)}
& \niahbest{100.0} & \niahsecond{99.8} & 93.0 & 35.6
& \niahsecond{99.8} & 98.8 & 64.2 & 27.2
& 89.2 & \niahsecond{72.4} & 29.2
& 49.4 & 19.2 & 18.0 \\
\modelname{Gated DeltaNet-2}
& \niahbest{100.0} & \niahbest{100.0} & \niahbest{100.0} & \niahbest{97.8}
& \niahbest{100.0} & \niahbest{100.0} & \niahbest{93.0} & \niahbest{39.2}
& \niahbest{92.0} & \niahbest{89.8} & \niahsecond{31.8}
& \niahbest{72.6} & \niahbest{51.4} & \niahbest{37.8} \\
\midrule
\textit{Attention or hybrid models} \\
\addlinespace[0.2ex]
\modelname{Transformer}
& \niahbest{100.0} & \niahbest{100.0} & 51.2 & 0.0
& \niahbest{100.0} & \niahbest{100.0} & 44.2 & 0.0
& 95.8 & 94.8 & 37.0
& 75.6 & 66.6 & 38.2 \\
\modelname{Mamba-2}
& \niahbest{100.0} & \niahbest{100.0} & \niahsecond{51.8} & 25.4
& \niahbest{100.0} & 99.6 & 52.4 & 25.8
& 97.8 & 86.8 & 48.0
& 82.0 & 58.6 & 39.0 \\
\modelname{Gated DeltaNet}
& \niahbest{100.0} & \niahbest{100.0} & 47.2 & 22.4
& \niahbest{100.0} & \niahsecond{99.8} & 57.3 & 25.6
& 94.8 & 91.2 & 47.2
& 91.0 & 78.4 & 44.8 \\
\modelname{KDA}
& \niahbest{100.0} & \niahbest{100.0} & \niahsecond{51.8} & \niahsecond{26.2}
& \niahbest{100.0} & \niahbest{100.0} & 56.0 & 23.0
& 97.2 & 93.4 & 51.6
& \niahsecond{91.4} & \niahsecond{84.0} & 40.4 \\
\modelname{Mamba-3 (SISO)}
& \niahbest{100.0} & \niahbest{100.0} & 49.6 & 26.0
& \niahbest{100.0} & \niahbest{100.0} & \niahbest{58.2} & \niahsecond{27.8}
& 95.0 & 90.4 & 44.0
& 78.8 & 65.6 & 33.6 \\
\modelname{Mamba-3 (MIMO)}
& \niahbest{100.0} & \niahbest{100.0} & 49.0 & 22.8
& \niahbest{100.0} & \niahbest{100.0} & 53.0 & \niahsecond{27.8}
& \niahsecond{99.4} & \niahsecond{98.4} & \niahsecond{54.2}
& 82.4 & 79.0 & \niahsecond{46.6} \\
\modelname{Gated DeltaNet-2}
& \niahbest{100.0} & \niahbest{100.0} & \niahbest{55.2} & \niahbest{27.4}
& \niahbest{100.0} & \niahbest{100.0} & \niahsecond{57.9} & \niahbest{29.2}
& \niahbest{99.6} & \niahbest{99.0} & \niahbest{55.6}
& \niahbest{93.0} & \niahbest{84.6} & \niahbest{48.0} \\
\bottomrule
\end{tabular}
\caption{
Accuracy on Single Needle-In-A-Haystack (S-NIAH) and Multi-Key Needle-In-A-Haystack (MK-NIAH) tasks from RULER.
Best values within each model family and context length are bolded; second-best values are underlined.
}
\label{tab_niah_results}
\end{table*}
```
#### Language modeling and common-sense reasoning

Table `\ref{tab_commonsense_results}`{=latex} reports WikiText and LAMBADA perplexity [@merity2016pointer; @paperno_lambada_2016], zero-shot LAMBADA accuracy, and the common-sense suite from PIQA through BoolQ [@bisk2020piqa; @zellers2019hellaswag; @sakaguchi2021winogrande; @arc-ce; @openbookqa; @sap2019social; @clark2019boolq]. Gated DeltaNet-2 achieves the best average in both recurrent and hybrid settings. Since recurrent state size is matched, the gain points to a stronger update rule rather than a larger memory. The trend persists with SWA, and the model is more balanced than Mamba-3 across perplexity, accuracy, and transfer.

#### In-context retrieval on synthetic data

Table `\ref{tab_niah_results}`{=latex} reports S-NIAH and MK-NIAH from RULER [@hsieh2024ruler], which test retention, interference control, high-entropy value storage, and multi-key discrimination under fixed-state memory. Gated DeltaNet-2 is strongest where memory editing matters most. In the recurrent setting, it leads the interference-heavy S-NIAH-2 cases at 4K and 8K and all MK-NIAH-1 lengths. The hybrid model shows the same pattern, leading the long S-NIAH-1 cases, the 8K S-NIAH-2 case, all S-NIAH-3 lengths, and the longer MK-NIAH-1 settings. These gains match the design of Gated Delta Rule-2. The key-side erase gate $\vb_t$ selectively protects or revises key channels, while the value-side write gate $\vw_t$ controls which value channels enter the state. With SWA handling local evidence, this decoupled recurrent update preserves longer-range associations more effectively than a scalar delta gate.

```{=latex}
\begin{wraptable}[16]{r}{0.62\linewidth}


\footnotesize
\setlength{\tabcolsep}{1.3pt}
\setlength{\abovecaptionskip}{2pt}
\setlength{\belowcaptionskip}{-4pt}
\begin{tabular}{l|ccccccc}
\toprule
\textbf{Models} & \textbf{SWDE} & \textbf{SQD} & \textbf{FDA} & \textbf{TQA} & \textbf{NQ} & \textbf{DROP} & \textbf{Avg.} \\
\midrule
\midrule
\textit{Recurrent models} \\
 Mamba-2 & 17.24 & 32.38 & 14.53 & 58.35 & 18.91 & 19.60 & 26.84 \\
 Gated DeltaNet & 17.90 & 32.67 & \underline{18.52} & \underline{59.60} & \textbf{20.16} & 19.69 & 28.09 \\
 Mamba-3 (SISO) & 17.62 & 35.07 & 11.08 & 58.89 & 18.18 & \underline{21.32} & 27.03 \\
 KDA & \underline{22.49} & 35.10 & 14.90 & 58.12 & 19.58 & \textbf{21.80} & \underline{28.67} \\
 Mamba-3 (MIMO) & 16.68 & \underline{36.65} & 17.44 & 59.06 & 19.16 & 21.08 & 28.35 \\
 Gated DeltaNet-2 & \textbf{23.65} & \textbf{36.75} & \textbf{19.98} & \textbf{61.37} & \underline{19.64} & 17.87 & \textbf{29.88} \\
\midrule
\textit{Attention or hybrid models} \\
 Transformer & 32.21 & 38.67 & 54.78 & 58.09 & 22.49 & 22.18 & 38.07 \\
 Mamba-2 & 34.67 & 40.74 & 52.31 & 60.13 & 25.91 & \textbf{24.68} & 39.74 \\
 Gated DeltaNet & 33.18 & 42.28 & 50.86 & \underline{60.60} & 25.78 & 21.95 & 39.11 \\
 Mamba-3 (SISO) & 35.30 & \textbf{46.42} & \underline{54.95} & 59.54 & 25.91 & \underline{23.96} & \underline{41.01} \\
 KDA & \underline{39.83} & 40.10 & 53.59 & 59.89 & 25.27 & 22.18 & 40.14 \\
 Mamba-3 (MIMO) & 32.33 & \underline{44.70} & \textbf{55.31} & 59.00 & \underline{26.26} & 23.08 & 40.11 \\
 Gated DeltaNet-2 & \textbf{41.96} & \underline{44.70} & 54.68 & \textbf{62.38} & \textbf{26.31} & 23.67 & \textbf{42.28} \\
\bottomrule
\end{tabular}
\caption{
Accuracy on real-world retrieval tasks with input length truncated to 2K tokens.
SQD denotes SQuAD. TQA denotes TriviaQA.
}
\label{tab_recall_results}

\end{wraptable}
```
#### In-context retrieval on real-world tasks

Table `\ref{tab_recall_results}`{=latex} reports recall-heavy real-world tasks from [@arora-2024-jrt], spanning extraction, question answering, and distractor-rich evidence. These tasks are less controlled than synthetic NIAH but better reflect fixed-state memory under realistic context. Gated DeltaNet-2 achieves the best average in both recurrent and hybrid settings. Its recurrent gains are strongest on noisy association recovery, where selective erase and gated write are directly useful. The remaining NQ and DROP gaps point to formats that also need local evidence aggregation, which SWA supplies in the hybrid model.

```{=latex}
\footnotesize
```
```{=latex}
\addtolength{\tabcolsep}{-1.5pt}
```
::: {#tab_ablation}
  ------------------------------------------- ------------------ ------------------ ---------------- ----------------- ----------------- ----------------- ----------------
  **Variant**                                     **Wiki.**           **LMB.**        **Common.**      **S-NIAH-2**      **S-NIAH-3**      **MK-NIAH-1**      **Recall**
                                               ppl $\downarrow$   ppl $\downarrow$   avg $\uparrow$   \@4K $\uparrow$   \@2K $\uparrow$   \@4K $\uparrow$   avg $\uparrow$
  *Channel structure*                                                                                                                                      
  w-only, scalar $\vb_t$, channel $\vw_t$           16.55              11.62             52.45             90.6              71.4              30.6             28.92
  b-only, channel $\vb_t$, scalar $\vw_t$           16.12              11.50             52.79             92.1              84.6              35.2             29.51
  *Erase range*                                                                                                                                            
  Gated DeltaNet-2, $\vb_t \in [0,1]^{d_k}$       **15.90**          **11.41**         **53.11**           93.0            **89.8**          **37.8**         **29.88**
  expanded $\vb_t \in [0,2]^{d_k}$                  15.95              11.44             53.04           **93.1**            89.4              37.6             29.81
  ------------------------------------------- ------------------ ------------------ ---------------- ----------------- ----------------- ----------------- ----------------

  :  Gate structure and erase range ablations in the recurrent-only setting.
:::

```{=latex}
\addtolength{\tabcolsep}{1.5pt}
```
#### Gate structure and erase range ablations. {#para_ablation}

Table `\ref{tab_ablation}`{=latex} evaluates two aspects of the Gated Delta Rule-2 update, the channel structure of the erase and write gates, and the range of the erase gate. For the channel-structure ablations, we average either gate over its channel axis and broadcast the scalar back at runtime, while keeping the original projections unchanged. Thus the parameter count stays fixed and only channel-wise gate variation is removed. Both scalarized variants trail full Gated DeltaNet-2, showing that both gates use their channel degrees of freedom. The asymmetry is clear. Keeping channel structure only in $\vb_t$ recovers most of the full model on language modeling and retrieval, whereas keeping it only in $\vw_t$ recovers less. This matches Eq. `\ref{eq_gdn2_recurrence}`{=latex}, where $\vb_t$ changes the key-side erase factor $\vk_t(\vb_t\odot\vk_t)^\top$, while $\vw_t$ reweights the written value. Finally, expanding the erase range from $[0,1]^{d_k}$ to $[0,2]^{d_k}$ gives no consistent gain at this scale.

```{=latex}
\begin{wrapfigure}[11]{r}{0.54\linewidth}


\setlength{\intextsep}{4pt}
\setlength{\abovecaptionskip}{2pt}
\setlength{\belowcaptionskip}{-4pt}
\definecolor{gdngreen}{RGB}{92,114,50}
\definecolor{gdn2green}{RGB}{50,140,90}
\definecolor{mambaorange}{RGB}{230,108,32}
\definecolor{kdablue}{RGB}{31,119,180}
\begin{tikzpicture}
\begin{axis}[
    width=\linewidth,
    height=5.3cm,
    xlabel={Seq. length $\times$ batch},
    ylabel={Thousands of Tokens Per Second (Kt/s)},
    grid=major,
    major grid style={black!10},
    axis line style={black!60},
    tick style={black!60},
    xmin=0.9, xmax=4.22,
    ymin=15, ymax=48,
    xtick={1,2,3,4},
    xticklabels={2K$\times$8,4K$\times$4,8K$\times$2,16K$\times$1},
    ytick={25,30,35,40,45},
    tick label style={font=\tiny},
    label style={font=\scriptsize},
    every axis plot/.append style={line width=1.4pt, mark size=2.2pt},
    legend columns=2,
    legend cell align=left,
    legend style={
        at={(0.025,0.004)},
        anchor=south west,
        font=\fontsize{5.0}{5.6}\selectfont,
        draw=black!25,
        fill=white,
        fill opacity=0.96,
        text opacity=1,
        rounded corners=1pt,
        inner xsep=2.2pt,
        inner ysep=1.4pt,
        row sep=0.2pt,
        column sep=4.2pt
    },
    clip=true,
]

\addplot[blue, mark=*] coordinates {
    (1,45.83) (2,42.29) (3,36.72) (4,29.36)
};
\addlegendentry{Transformer}

\addplot[color=mambaorange, mark=pentagon*] coordinates {
    (1,44.42) (2,43.46) (3,43.30) (4,43.26)
};
\addlegendentry{Mamba-2}

\addplot[color=mambaorange!60, mark=pentagon*] coordinates {
    (1,44.34) (2,42.94) (3,42.28) (4,40.72)
};
\addlegendentry{Mamba-3 SISO}

\addplot[color=mambaorange!60!red, mark=diamond*] coordinates {
    (1,34.44) (2,31.90) (3,29.37) (4,26.86)
};
\addlegendentry{Mamba-3 MIMO}

\addplot[color=gdngreen, mark=triangle*] coordinates {
    (1,41.85) (2,40.86) (3,40.02) (4,39.49)
};
\addlegendentry{Gated DeltaNet}

\addplot[color=kdablue, mark=square*] coordinates {
    (1,39.81) (2,38.98) (3,38.85) (4,38.50)
};
\addlegendentry{KDA}

\addplot[color=gdn2green, mark=triangle*, very thick, line width=1.7pt] coordinates {
    (1,38.00) (2,37.25) (3,36.87) (4,36.11)
};
\addlegendentry{\textcolor{gdn2green}{Gated DeltaNet-2}}

\end{axis}
\end{tikzpicture}
\caption{Training throughput on a H100 GPU.}
\label{fig:throughput_hybrid}
% 
\end{wrapfigure}
```
#### Throughput comparison.

Fig. `\ref{fig:throughput_hybrid}`{=latex} reports single H100 training throughput for the hybrid 1.3B models under a fixed token budget. Gated DeltaNet-2 preserves the near-flat scaling profile of recurrent mixers as sequence length grows, dropping only mildly from 38.0 to 36.1 Kt/s, while the Transformer degrades sharply. Relative to KDA, the small gap reflects the added channel-wise erase and write gates. Thus Gated DeltaNet-2 retains practical training efficiency while paying a modest constant cost for finer memory control.

# Related Work {#sec_related_work}

Efficient sequence models replace quadratic self-attention with recurrent or linear-time token mixers that maintain a fixed-size state. Early structured state-space and recurrent models used mostly data-independent transitions [@s4; @s5; @Orvieto2023ResurrectingRN; @sun2023retentive], while Mamba and Mamba-2 introduced data-dependent selective dynamics and the SSD framework [@gu_mamba_2023; @pmlr-v235-dao24a]. Gated linear attention and related linear RNNs further improve memory control with learned decay gates [@yang_gated_2023; @qin2024hgrn2]. Delta-rule models take a complementary fast-weight view, where the recurrent state is updated by correcting the current read before writing the new value, improving associative memory over Hebbian-style accumulation [@Gardner1988TheSO; @Prados1989NeuralNC; @irie2021going; @yang2024parallelizing]. Gated DeltaNet adds adaptive forgetting to this update [@yang2025gated], and KDA strengthens it with channel-wise decay and an efficient chunkwise algorithm, but still uses a scalar $\beta_t$ to control both erasing and writing [@team2025kimi]. Mamba-3 advances the SSM line instead, using exponential-trapezoidal discretization, complex-valued transitions implemented through data-dependent rotations, and a MIMO formulation for stronger modeling at efficient decoding latency [@lahoti2026mamba3]. Our work is complementary to these directions. Gated DeltaNet-2 keeps the delta-rule fast-weight structure of GDN and KDA, but replaces the tied scalar update strength with a channel-wise erase gate $\vb_t$ and a channel-wise write gate $\vw_t$. This recovers KDA when both gates are tied to the same scalar, while allowing old content and new values to be controlled along different channel patterns.

# Conclusion {#sec_conclusion}

We introduced Gated DeltaNet-2, a delta-rule recurrent attention layer that decouples the active memory edit into channel-wise erase and write decisions. The erase gate $\vb_t$ selects which key-side coordinates of the decayed state are read and removed, while the write gate $\vw_t$ selects which value-side coordinates are committed. This removes the scalar $\beta_t$ tie in Gated DeltaNet and KDA, recovers both as special cases, and preserves efficient chunkwise training through a WY form with gate-aware kernels. Under matched 1.3B training, Gated DeltaNet-2 improves the recurrent and hybrid frontier across language modeling, commonsense reasoning, synthetic retrieval, and real-world recall, while adding only a small constant throughput overhead. Ablations show that both gates contribute, with the erase gate accounting for most of the gain.

```{=latex}
\newpage
```
```{=latex}
\appendix
```
```{=latex}
\onecolumn
```
```{=latex}
\clearpage
```
```{=latex}
\renewcommand{\thesection}{\Alph{section}}
```
```{=latex}
\renewcommand\thefigure{S.\arabic{figure}}
```
```{=latex}
\setcounter{figure}{0}
```
```{=latex}
\renewcommand\thetable{S.\arabic{table}}
```
```{=latex}
\setcounter{table}{0}
```
# Chunkwise derivation for Gated DeltaNet-2 {#app_wy}

This appendix gives the exact chunkwise form used in Section `\ref{subsec_chunkwise_training}`{=latex}. We work inside one chunk of length $C$ and write $\rmS_0$ for the state at the start of the chunk. All vectors are for a single head. In particular, $\vk_r,\ve_r,\vq_r,\vb_r,\valpha_r,\boldsymbol{\gamma}_r\in\mathbb{R}^{d_k}$, $\vv_r,\vz_r,\vo_r,\vw_r\in\mathbb{R}^{d_v}$, and $\rmS_r\in\mathbb{R}^{d_k\times d_v}$. The chunk index is suppressed.

The Gated DeltaNet-2 recurrence is $$\begin{aligned}
    \rmS_r
    =
    \bigl(\rmI - \vk_r\ve_r^\top\bigr)
    \operatorname{Diag}(\valpha_r)\rmS_{r-1}
    + \vk_r\vz_r^\top,
    \qquad
    \ve_r = \vb_r \odot \vk_r,
    \qquad
    \vz_r = \vw_r \odot \vv_r .
\label{eq_app_gdn2_step}
\end{aligned}$$ Let $$\begin{aligned}
    \boldsymbol{G}_r = \sum_{i=1}^{r}\boldsymbol{g}_i,
    \qquad
    \boldsymbol{\gamma}_r = \exp(\boldsymbol{G}_r),
    \qquad
    \boldsymbol{\gamma}_0 = \mathbf{1}_{d_k},
    \qquad
    \valpha_r = \exp(\boldsymbol{g}_r).
\label{eq_app_gamma}
\end{aligned}$$ All exponentials, products, and ratios involving $\boldsymbol{\gamma}$ are elementwise over the key channel axis.

## Decay-normalized recurrence {#app_decay_normalized}

Define a normalized state $\widehat{\rmS}_r$ by $$\begin{aligned}
    \rmS_r = \operatorname{Diag}(\boldsymbol{\gamma}_r)\widehat{\rmS}_r .
\label{eq_app_normalized_state}
\end{aligned}$$ Because $\boldsymbol{\gamma}_0=\mathbf{1}_{d_k}$, the normalized initial state is also $\widehat{\rmS}_0=\rmS_0$. Substituting Eq. `\ref{eq_app_normalized_state}`{=latex} into Eq. `\ref{eq_app_gdn2_step}`{=latex} and using $\boldsymbol{\gamma}_r=\valpha_r\odot\boldsymbol{\gamma}_{r-1}$ gives $$\begin{aligned}
    \widehat{\rmS}_r
    =
    \bigl(\rmI - \bar{\vk}_r\bar{\ve}_r^\top\bigr)
    \widehat{\rmS}_{r-1}
    + \bar{\vk}_r\vz_r^\top,
    \qquad
    \bar{\vk}_r = \boldsymbol{\gamma}_r^{-1}\odot\vk_r,
    \qquad
    \bar{\ve}_r = \boldsymbol{\gamma}_r\odot\ve_r .
\label{eq_app_normalized_recurrence}
\end{aligned}$$ The channel-wise decay has disappeared from the recurrence. It is now carried by the left and right factors of each rank-one edit.

Let $\rmK$, $\rmV$, $\rmB$, and $\rmW$ contain rows $\vk_r^\top$, $\vv_r^\top$, $\vb_r^\top$, and $\vw_r^\top$, respectively. For compact matrix notation, let $\boldsymbol{\gamma}$ contain rows $\boldsymbol{\gamma}_r^\top$. Let $\bar{\rmK}$, $\bar{\rmE}$, and $\rmZ$ contain rows $\bar{\vk}_r^\top$, $\bar{\ve}_r^\top$, and $\vz_r^\top$. Equivalently, $$\begin{aligned}
    \bar{\rmK}
    =
    \boldsymbol{\gamma}^{-1}\odot\rmK,
    \qquad
    \bar{\rmE}
    =
    \boldsymbol{\gamma}\odot(\rmB\odot\rmK),
    \qquad
    \rmZ
    =
    \rmW\odot\rmV .
\label{eq_app_gate_matrices}
\end{aligned}$$ Define $$\begin{aligned}
    \rmT = \operatorname{tril}(\bar{\rmE}\bar{\rmK}^\top, -1),
    \qquad
    \rmA = (\rmI + \rmT)^{-1},
    \qquad
    \rmY = \rmA\bar{\rmE},
    \qquad
    \rmU = \rmA\rmZ .
\label{eq_app_wy_definitions}
\end{aligned}$$ Since $\rmT$ is strictly lower triangular, $\rmA$ is lower triangular with unit diagonal and is obtained by forward substitution.

## Compact state formula {#app_state_formula}

Define $$\begin{aligned}
    \rmR = \rmU - \rmY\rmS_0 .
\label{eq_app_residual_matrix}
\end{aligned}$$ Let row $r$ of $\rmR$ be $\boldsymbol{\rho}_r^\top$, where $\boldsymbol{\rho}_r\in\mathbb{R}^{d_v}$. Then the normalized state after any prefix of the chunk is $$\begin{aligned}
    \widehat{\rmS}_r
    =
    \rmS_0
    +
    \bar{\rmK}_{\le r}^{\top}\rmR_{\le r},
\label{eq_app_prefix_state}
\end{aligned}$$ where $\bar{\rmK}_{\le r}$ and $\rmR_{\le r}$ denote the first $r$ rows.

To prove Eq. `\ref{eq_app_prefix_state}`{=latex}, write the rank-one increment at step $r$ as $\bar{\vk}_r\boldsymbol{\rho}_r^\top$. From Eq. `\ref{eq_app_normalized_recurrence}`{=latex}, the residual row is $$\begin{aligned}
    \boldsymbol{\rho}_r^\top
    =
    \vz_r^\top
    -
    \bar{\ve}_r^\top\widehat{\rmS}_{r-1}.
\label{eq_app_residual_row}
\end{aligned}$$ Equivalently, in column-vector form, $\boldsymbol{\rho}_r=\vz_r-\widehat{\rmS}_{r-1}^{\top}\bar{\ve}_r$. Using the induction hypothesis $\widehat{\rmS}_{r-1}=\rmS_0+\sum_{s<r}\bar{\vk}_s\boldsymbol{\rho}_s^\top$ gives $$\begin{aligned}
    \boldsymbol{\rho}_r^\top
    =
    \vz_r^\top
    -
    \bar{\ve}_r^\top\rmS_0
    -
    \sum_{s<r}
    \bar{\ve}_r^\top\bar{\vk}_s\,\boldsymbol{\rho}_s^\top .
\label{eq_app_residual_row_expanded}
\end{aligned}$$ Since $\rmT_{rs}=\bar{\ve}_r^\top\bar{\vk}_s$ for $s<r$ and $\rmT_{rs}=0$ otherwise, stacking these residual rows over the chunk yields $$\begin{aligned}
    (\rmI+\rmT)\rmR
    =
    \rmZ-\bar{\rmE}\rmS_0 .
\label{eq_app_residual_linear_system}
\end{aligned}$$ Multiplying by $\rmA$ gives $\rmR=\rmA\rmZ-\rmA\bar{\rmE}\rmS_0=\rmU-\rmY\rmS_0$, which is Eq. `\ref{eq_app_residual_matrix}`{=latex}. Substituting the resulting increments into the normalized recurrence proves Eq. `\ref{eq_app_prefix_state}`{=latex}.

Multiplying Eq. `\ref{eq_app_prefix_state}`{=latex} by $\operatorname{Diag}(\boldsymbol{\gamma}_C)$ gives the end-of-chunk state $$\begin{aligned}
    \rmS_C
    =
    \operatorname{Diag}(\boldsymbol{\gamma}_C)\rmS_0
    +
    \rmK_{\mathrm{tail}}^\top
    \bigl(\rmU-\rmY\rmS_0\bigr),
\label{eq_app_chunk_state}
\end{aligned}$$ where row $r$ of $\rmK_{\mathrm{tail}}$ is $$\begin{aligned}
    (\rmK_{\mathrm{tail}})_{r,:}
    =
    \bigl((\boldsymbol{\gamma}_C/\boldsymbol{\gamma}_r)
    \odot \vk_r\bigr)^\top .
\label{eq_app_ktail}
\end{aligned}$$ This is Eq. `\ref{eq_gdn2_chunk_state}`{=latex} in the main text.

## Compact output formula {#app_output_formula}

The output at token $r$ is the column vector $\vo_r=\rmS_r^\top\vq_r$. It is convenient to write the corresponding row vector. Using Eq. `\ref{eq_app_prefix_state}`{=latex}, $$\begin{aligned}
    \vo_r^\top
    =
    (\boldsymbol{\gamma}_r\odot\vq_r)^\top\rmS_0
    +
    \sum_{s\le r}
    \bigl[
    \vq_r^\top
    \operatorname{Diag}(\boldsymbol{\gamma}_r/\boldsymbol{\gamma}_s)
    \vk_s
    \bigr]
    \boldsymbol{\rho}_s^\top .
\label{eq_app_output_token}
\end{aligned}$$ Define $\rmQ_{\gamma}$ by row $(\rmQ_{\gamma})_{r,:}=(\boldsymbol{\gamma}_r\odot\vq_r)^\top$ and define the causal score matrix $$\begin{aligned}
    (\rmA_{qk})_{rs}
    =
    \mathbf{1}_{r\ge s}
    \vq_r^\top
    \operatorname{Diag}(\boldsymbol{\gamma}_r/\boldsymbol{\gamma}_s)
    \vk_s .
\label{eq_app_aqk}
\end{aligned}$$ Let $\rmO$ contain rows $\vo_r^\top$. Stacking Eq. `\ref{eq_app_output_token}`{=latex} over the chunk gives $$\begin{aligned}
    \rmO
    =
    \rmQ_{\gamma}\rmS_0
    +
    \rmA_{qk}
    \bigl(\rmU-\rmY\rmS_0\bigr),
\label{eq_app_chunk_output}
\end{aligned}$$ which is Eq. `\ref{eq_gdn2_chunk_output}`{=latex} in the main text.

## Row recurrences {#app_row_recurrences}

The matrices $\rmY$ and $\rmU$ can also be written row by row. Let row $r$ of $\rmY$ be $\vy_r^\top$ and row $r$ of $\rmU$ be $\vu_r^\top$. Since $(\rmI+\rmT)\rmY=\bar{\rmE}$ and $(\rmI+\rmT)\rmU=\rmZ$, $$\begin{aligned}
    \vy_r^\top
    &=
    \bar{\ve}_r^\top
    -
    \sum_{s<r}
    \bar{\ve}_r^\top\bar{\vk}_s \, \vy_s^\top,
\label{eq_app_y_row}\\
    \vu_r^\top
    &=
    \vz_r^\top
    -
    \sum_{s<r}
    \bar{\ve}_r^\top\bar{\vk}_s \, \vu_s^\top .
\label{eq_app_u_row}
\end{aligned}$$ Both auxiliaries solve the same lower triangular system with different right-hand sides. This is why the same WY inverse can be shared by the erase-side and write-side computations.

## Tied-gate reductions {#app_reductions}

If $\vb_r=\beta_r\mathbf{1}_{d_k}$ and $\vw_r=\beta_r\mathbf{1}_{d_v}$, then $\ve_r=\beta_r\vk_r$ and $\vz_r=\beta_r\vv_r$. Equation `\ref{eq_app_gdn2_step}`{=latex} becomes the KDA update. If the decay is also tied as $\valpha_r=\alpha_r\mathbf{1}_{d_k}$, the recurrence becomes Gated DeltaNet. Thus KDA and Gated DeltaNet are recovered by tying the channel gates rather than by changing the algorithm.

The same reduction holds for the chunkwise form. Under the KDA tying, the definitions in Eq. `\ref{eq_app_gate_matrices}`{=latex} give $$\begin{aligned}
    \bar{\ve}_r
    =
    \boldsymbol{\gamma}_r\odot(\beta_r\vk_r)
    =
    \beta_r(\boldsymbol{\gamma}_r\odot\vk_r)
    =
    \beta_r(\boldsymbol{\gamma}_r\odot\boldsymbol{\gamma}_r)\odot\bar{\vk}_r,
    \qquad
    \vz_r
    =
    \beta_r\vv_r .
\label{eq_app_tied_gate_factors}
\end{aligned}$$ Thus $\rmZ$ becomes a scalar row scaling of $\rmV$, while $\bar{\rmE}$ becomes the KDA decay-normalized erase factor. In general, $\bar{\rmE}$ is not a scalar row scaling of $\bar{\rmK}$ when the decay is channel-wise, since each key channel carries its own factor from $\boldsymbol{\gamma}_r$. Equations `\ref{eq_app_wy_definitions}`{=latex}, `\ref{eq_app_chunk_state}`{=latex}, and `\ref{eq_app_chunk_output}`{=latex} therefore reduce to the KDA chunk equations after substituting the tied factors in Eq. `\ref{eq_app_tied_gate_factors}`{=latex}. If the decay is further tied as in Gated DeltaNet, then $\boldsymbol{\gamma}_r=\gamma_r\mathbf{1}_{d_k}$ and the erase factor simplifies to $\bar{\ve}_r=\beta_r\gamma_r^2\bar{\vk}_r$, recovering the scalar-decay chunkwise form.

At the gradient level, if a thin wrapper sets $\vb_r=\beta_r\mathbf{1}_{d_k}$ and $\vw_r=\beta_r\mathbf{1}_{d_v}$, the scalar gradient is $$\begin{aligned}
    \frac{\partial \mathcal{L}}{\partial \beta_r}
    =
    \left\langle
    \frac{\partial \mathcal{L}}{\partial \vb_r},
    \mathbf{1}_{d_k}
    \right\rangle
    +
    \left\langle
    \frac{\partial \mathcal{L}}{\partial \vw_r},
    \mathbf{1}_{d_v}
    \right\rangle .
\label{eq_app_beta_gradient}
\end{aligned}$$

# Backward derivation {#app_backward}

We derive the vector-Jacobian products for one chunk using the notation of Appendix `\ref{app_wy}`{=latex}. Let upstream gradients be $\mathrm{d}\rmO$ for the chunk output and $\mathrm{d}\rmS_C$ for the end-of-chunk state. The forward equations are $$\begin{aligned}
    \rmR &= \rmU-\rmY\rmS_0,
\label{eq_app_bwd_R}\\
    \rmO &= \rmQ_{\gamma}\rmS_0+\rmA_{qk}\rmR,
\label{eq_app_bwd_O}\\
    \rmS_C &= \operatorname{Diag}(\boldsymbol{\gamma}_C)\rmS_0+\rmK_{\mathrm{tail}}^\top\rmR,
\label{eq_app_bwd_S}\\
    \rmU &= \rmA\rmZ,
    \qquad
    \rmY = \rmA\bar{\rmE},
    \qquad
    \rmA=(\rmI+\rmT)^{-1},
    \qquad
    \rmT=\operatorname{tril}(\bar{\rmE}\bar{\rmK}^\top,-1).
\label{eq_app_bwd_core}
\end{aligned}$$

## Output and state paths {#app_bwd_paths}

From Eq. `\ref{eq_app_bwd_O}`{=latex}, $$\begin{aligned}
    \mathrm{d}\rmA_{qk}
    &\mathrel{+}=
    \mathrm{d}\rmO\,\rmR^\top,
\label{eq_app_daqk_from_output}\\
    \mathrm{d}\rmR
    &\mathrel{+}=
    \rmA_{qk}^\top\mathrm{d}\rmO,
\label{eq_app_dR_from_output}\\
    \mathrm{d}\rmQ_{\gamma}
    &\mathrel{+}=
    \mathrm{d}\rmO\,\rmS_0^\top,
\label{eq_app_dQg_from_output}\\
    \mathrm{d}\rmS_0
    &\mathrel{+}=
    \rmQ_{\gamma}^\top\mathrm{d}\rmO .
\label{eq_app_dS_from_output}
\end{aligned}$$ The causal mask is applied to $\mathrm{d}\rmA_{qk}$.

From Eq. `\ref{eq_app_bwd_S}`{=latex}, $$\begin{aligned}
    \mathrm{d}\rmR
    &\mathrel{+}=
    \rmK_{\mathrm{tail}}\mathrm{d}\rmS_C,
\label{eq_app_dR_from_state}\\
    \mathrm{d}\rmK_{\mathrm{tail}}
    &\mathrel{+}=
    \rmR\,\mathrm{d}\rmS_C^\top,
\label{eq_app_dKtail_from_state}\\
    \mathrm{d}\rmS_0
    &\mathrel{+}=
    \operatorname{Diag}(\boldsymbol{\gamma}_C)\mathrm{d}\rmS_C,
\label{eq_app_dS_from_state}\\
    \mathrm{d}\boldsymbol{\gamma}_C
    &\mathrel{+}=
    \mathrm{rowsum}\bigl(\mathrm{d}\rmS_C\odot\rmS_0\bigr).
\label{eq_app_dgammaC_from_state}
\end{aligned}$$ Here $\mathrm{rowsum}$ sums over the value dimension and returns a $d_k$-dimensional vector.

The residual relation in Eq. `\ref{eq_app_bwd_R}`{=latex} gives $$\begin{aligned}
    \mathrm{d}\rmU
    &\mathrel{+}=
    \mathrm{d}\rmR,
\label{eq_app_dU_from_R}\\
    \mathrm{d}\rmY
    &\mathrel{+}=
    -\mathrm{d}\rmR\,\rmS_0^\top,
\label{eq_app_dY_from_R}\\
    \mathrm{d}\rmS_0
    &\mathrel{+}=
    -\rmY^\top\mathrm{d}\rmR .
\label{eq_app_dS_from_R}
\end{aligned}$$

## Gate-aware WY inverse path {#app_bwd_wy}

The two auxiliary products yield $$\begin{aligned}
    \mathrm{d}\rmA
    &\mathrel{+}=
    \mathrm{d}\rmU\,\rmZ^\top,
    &
    \mathrm{d}\rmZ
    &\mathrel{+}=
    \rmA^\top\mathrm{d}\rmU,
\label{eq_app_bwd_U}\\
    \mathrm{d}\rmA
    &\mathrel{+}=
    \mathrm{d}\rmY\,\bar{\rmE}^\top,
    &
    \mathrm{d}\bar{\rmE}
    &\mathrel{+}=
    \rmA^\top\mathrm{d}\rmY .
\label{eq_app_bwd_Y}
\end{aligned}$$ Equations `\ref{eq_app_bwd_U}`{=latex} and `\ref{eq_app_bwd_Y}`{=latex} are the gate-aware accumulation emphasized in Section `\ref{subsec_gate_aware_backward}`{=latex}. Since $\rmZ=\rmW\odot\rmV$ and $\bar{\rmE}=\boldsymbol{\gamma}\odot(\rmB\odot\rmK)$, the gates must appear inside the products that accumulate $\mathrm{d}\rmA$. A scalar post-scale is correct only in the tied-gate case.

For the inverse, $$\begin{aligned}
    \mathrm{d}\rmT
    =
    -\operatorname{tril}
    \bigl(
    \rmA^\top\mathrm{d}\rmA\rmA^\top,
    -1
    \bigr).
\label{eq_app_dT}
\end{aligned}$$ The construction of $\rmT$ gives $$\begin{aligned}
    \mathrm{d}\bar{\rmE}
    &\mathrel{+}=
    \mathrm{d}\rmT\,\bar{\rmK},
\label{eq_app_dE_from_T}\\
    \mathrm{d}\bar{\rmK}
    &\mathrel{+}=
    \mathrm{d}\rmT^\top\bar{\rmE}.
\label{eq_app_dKbar_from_T}
\end{aligned}$$ Only the strictly lower triangular part of $\mathrm{d}\rmT$ is used.

## Scores and tail keys {#app_bwd_scores}

The score matrix satisfies $\rmA_{qk}=\operatorname{tril}(\rmQ_{\gamma}\bar{\rmK}^\top)$. Therefore $$\begin{aligned}
    \mathrm{d}\rmQ_{\gamma}
    &\mathrel{+}=
    \mathrm{d}\rmA_{qk}\,\bar{\rmK},
\label{eq_app_dQg_from_scores}\\
    \mathrm{d}\bar{\rmK}
    &\mathrel{+}=
    \mathrm{d}\rmA_{qk}^\top\rmQ_{\gamma}.
\label{eq_app_dKbar_from_scores}
\end{aligned}$$ The tail key matrix satisfies $(\rmK_{\mathrm{tail}})_{r}=\boldsymbol{\gamma}_C\odot\bar{\vk}_r$. Hence $$\begin{aligned}
    \mathrm{d}\bar{\rmK}
    &\mathrel{+}=
    \mathrm{d}\rmK_{\mathrm{tail}}\odot\boldsymbol{\gamma}_C,
\label{eq_app_dKbar_from_tail}\\
    \mathrm{d}\boldsymbol{\gamma}_C
    &\mathrel{+}=
    \sum_{r=1}^{C}
    \mathrm{d}(\rmK_{\mathrm{tail}})_{r}
    \odot
    \bar{\vk}_r .
\label{eq_app_dgammaC_from_tail}
\end{aligned}$$

## Elementwise gates and cumulative decay {#app_bwd_elementwise}

The write-side relation $\rmZ=\rmW\odot\rmV$ gives $$\begin{aligned}
    \mathrm{d}\rmW
    &\mathrel{+}=
    \mathrm{d}\rmZ\odot\rmV,
    &
    \mathrm{d}\rmV
    &\mathrel{+}=
    \mathrm{d}\rmZ\odot\rmW .
\label{eq_app_dW_dV}
\end{aligned}$$ The erase-side relation $\bar{\rmE}=\boldsymbol{\gamma}\odot(\rmB\odot\rmK)$ gives $$\begin{aligned}
    \mathrm{d}\rmB
    &\mathrel{+}=
    \mathrm{d}\bar{\rmE}\odot\boldsymbol{\gamma}\odot\rmK,
\label{eq_app_dB_from_E}\\
    \mathrm{d}\rmK
    &\mathrel{+}=
    \mathrm{d}\bar{\rmE}\odot\boldsymbol{\gamma}\odot\rmB,
\label{eq_app_dK_from_E}\\
    \mathrm{d}\boldsymbol{\gamma}
    &\mathrel{+}=
    \mathrm{d}\bar{\rmE}\odot\rmB\odot\rmK .
\label{eq_app_dgamma_from_E}
\end{aligned}$$ The normalized keys and queries are $$\begin{aligned}
    \bar{\rmK} = \boldsymbol{\gamma}^{-1}\odot\rmK,
    \qquad
    \rmQ_{\gamma} = \boldsymbol{\gamma}\odot\rmQ .
\label{eq_app_Kbar_Qg}
\end{aligned}$$ Their vector-Jacobian products are $$\begin{aligned}
    \mathrm{d}\rmK
    &\mathrel{+}=
    \mathrm{d}\bar{\rmK}\odot\boldsymbol{\gamma}^{-1},
\label{eq_app_dK_from_Kbar}\\
    \mathrm{d}\boldsymbol{\gamma}
    &\mathrel{+}=
    -\mathrm{d}\bar{\rmK}\odot\rmK\odot\boldsymbol{\gamma}^{-2},
\label{eq_app_dgamma_from_Kbar}\\
    \mathrm{d}\rmQ
    &\mathrel{+}=
    \mathrm{d}\rmQ_{\gamma}\odot\boldsymbol{\gamma},
\label{eq_app_dQ_from_Qg}\\
    \mathrm{d}\boldsymbol{\gamma}
    &\mathrel{+}=
    \mathrm{d}\rmQ_{\gamma}\odot\rmQ .
\label{eq_app_dgamma_from_Qg}
\end{aligned}$$ Finally, $\boldsymbol{\gamma}_r=\exp(\boldsymbol{G}_r)$ and $\boldsymbol{G}_r=\sum_{i\le r}\boldsymbol{g}_i$. Therefore $$\begin{aligned}
    \mathrm{d}\boldsymbol{G}_r
    =
    \mathrm{d}\boldsymbol{\gamma}_r
    \odot
    \boldsymbol{\gamma}_r,
    \qquad
    \mathrm{d}\boldsymbol{g}_i
    =
    \sum_{r\ge i}
    \mathrm{d}\boldsymbol{G}_r .
\label{eq_app_dg_reverse_cumsum}
\end{aligned}$$ In implementation this is a reverse cumulative sum over the chunk.

## Why scalar post-scaling is invalid {#app_no_post_scaling}

In KDA, a scalar $\beta_r$ multiplies both the value right hand side and the erase right hand side. For one row, the contribution to $\mathrm{d}\rmA$ from the write auxiliary can be factored as $$\begin{aligned}
    \mathrm{d}\vu_r\,(\beta_s\vv_s)^\top
    =
    \beta_s\,\mathrm{d}\vu_r\vv_s^\top .
\label{eq_app_scalar_factor}
\end{aligned}$$ The scalar factor can be applied after the dot product. Gated DeltaNet-2 replaces $\beta_s\vv_s$ by $\vw_s\odot\vv_s$. Since $\vw_s$ is a different diagonal operator for every row, there is no row scalar or column scalar that can recover $$\begin{aligned}
    \mathrm{d}\vu_r\,(\vw_s\odot\vv_s)^\top
\label{eq_app_vector_factor_write}
\end{aligned}$$ from $\mathrm{d}\vu_r\vv_s^\top$. The erase side has the same issue with $\vb_s\odot\vk_s$. The gates must be baked into the dot products in Eq. `\ref{eq_app_bwd_U}`{=latex} and Eq. `\ref{eq_app_bwd_Y}`{=latex}.

# Layer and kernel implementation {#app_implementation}

This appendix records the implementation choices needed to reproduce Gated DeltaNet-2. The main text keeps the Triton details brief. Here we describe the computation at the level of kernels and tensor shapes.

## Layer parameterization {#app_layer}

The layer computes short-convolutional projections for $\vq$, $\vk$, and $\vv$, followed by head reshaping. The erase and write gates are produced by independent projections, $$\begin{aligned}
    \vb = \sigma(\operatorname{Proj}_{b}(\vx)),
    \qquad
    \vw = \sigma(\operatorname{Proj}_{w}(\vx)).
\label{eq_app_layer_gates}
\end{aligned}$$ The erase projection has shape $d_{\mathrm{model}}\to H d_k$. The write projection has shape $d_{\mathrm{model}}\to H_v d_v$. If grouped value attention is used with $H_v>H$, the key-side tensors $\vq$, $\vk$, the log-decay tensor $\boldsymbol{g}$, and $\vb$ are repeated across the value-head group, while $\vv$ and $\vw$ already live on the value-head axis.

The log-decay is computed outside the kernel in fp32, $$\begin{aligned}
    \boldsymbol{g}_t
    =
    -\exp(\mathbf{a})\odot
    \operatorname{softplus}(\operatorname{Proj}_{f}(\vx_t)+\boldsymbol{\delta}).
\label{eq_app_layer_decay}
\end{aligned}$$ The vector $\mathbf{a}$ is stored per key head and broadcast across the $d_k$ channels of that head. The bias $\boldsymbol{\delta}$ is stored per key channel. The kernel consumes $\boldsymbol{g}_t$ directly and forms the local cumulative sums in Eq. `\ref{eq_app_gamma}`{=latex}.

If negative eigenvalues are enabled, only the erase gate is scaled by $2$. This changes $\vb_t\in[0,1]^{d_k}$ into $\vb_t\in[0,2]^{d_k}$. The write gate remains in $[0,1]^{d_v}$.

## Forward kernels {#app_forward_kernels}

The chunk size is fixed to $C=64$. Each chunk is processed by the following steps.

#### Intra-chunk products

The first kernel forms the causal score matrix $\rmA_{qk}$ and the strictly lower matrix $\rmT$. The Gated DeltaNet-2 specific computation is the row factor of $\rmT$, $$\begin{aligned}
    T_{rs}
    =
    \bar{\ve}_r^\top\bar{\vk}_s
    =
    (\boldsymbol{\gamma}_r\odot\vb_r\odot\vk_r)^\top
    (\boldsymbol{\gamma}_s^{-1}\odot\vk_s)
    \qquad
    s<r .
\label{eq_app_kernel_T}
\end{aligned}$$ Thus the erase gate is multiplied into the key tile before the dot product. The score matrix $\rmA_{qk}$ is unchanged apart from the same decay-normalized key factors.

#### WY solve

The second kernel solves $\rmA=(\rmI+\rmT)^{-1}$ by forward substitution. It then exposes the same lower triangular inverse to both right hand sides. This is the compact WY step used in Eq. `\ref{eq_app_wy_definitions}`{=latex}.

#### Auxiliary construction

The third kernel builds $$\begin{aligned}
    \rmU = \rmA(\rmW\odot\rmV),
    \qquad
    \rmY = \rmA\bar{\rmE}.
\label{eq_app_kernel_aux}
\end{aligned}$$ The implementation stores $\rmY$ using the historical buffer name `w` because KDA used the same buffer for its erase-side auxiliary; this buffer is not the write-gate matrix $\rmW$. The mathematical role is $\rmY$ throughout this paper.

#### State and output

The inter-chunk state recurrence consumes $\rmK_{\mathrm{tail}}$, $\rmY$, and $\rmU$, and applies Eq. `\ref{eq_app_chunk_state}`{=latex}. The output kernel consumes $\rmQ_{\gamma}$, $\rmA_{qk}$, and $\rmR=\rmU-\rmY\rmS_0$, and applies Eq. `\ref{eq_app_chunk_output}`{=latex}. These two kernels do not depend on how the gate factors were produced, so they share the same matrix shapes as KDA.

## Backward kernels {#app_backward_kernels}

The backward pass mirrors Appendix `\ref{app_backward}`{=latex}.

#### Output vector-Jacobian product

The first backward kernel computes $\mathrm{d}\rmA_{qk}$ and the output-path contribution to $\mathrm{d}\rmR$ from Eq. `\ref{eq_app_daqk_from_output}`{=latex} and Eq. `\ref{eq_app_dR_from_output}`{=latex}. It is structure-equivalent to the KDA output vector-Jacobian product because it only sees $\rmR$.

#### State vector-Jacobian product

The second backward kernel propagates $\mathrm{d}\rmS_C$ through Eq. `\ref{eq_app_bwd_S}`{=latex}. It is also structure-equivalent to the KDA state vector-Jacobian product because it consumes $\rmK_{\mathrm{tail}}$, $\rmY$, and $\rmU$ as already formed tensors.

#### Gate-aware WY vector-Jacobian product

The third backward kernel implements Eq. `\ref{eq_app_bwd_U}`{=latex}, Eq. `\ref{eq_app_bwd_Y}`{=latex}, and Eq. `\ref{eq_app_dT}`{=latex}. This is the main Gated DeltaNet-2 specific kernel. It accumulates $\mathrm{d}\rmA$ with $\rmZ^\top=(\rmW\odot\rmV)^\top$ and $\bar{\rmE}^\top=(\boldsymbol{\gamma}\odot\rmB\odot\rmK)^\top$. It also emits the direct gradients $$\begin{aligned}
    \mathrm{d}\rmW
    =
    \mathrm{d}\rmZ\odot\rmV,
    \qquad
    \mathrm{d}\rmV
    =
    \mathrm{d}\rmZ\odot\rmW,
    \qquad
    \mathrm{d}\rmB
    \mathrel{+}=
    \mathrm{d}\bar{\rmE}\odot\boldsymbol{\gamma}\odot\rmK .
\label{eq_app_kernel_gate_grads}
\end{aligned}$$ The erase-gate gradient has shape $B\times T\times H\times d_k$. The write-gate gradient has shape $B\times T\times H_v\times d_v$.

#### Intra-chunk vector-Jacobian product

The fourth backward kernel propagates through $\rmA_{qk}$ and $\rmT$. It adds the remaining contributions to $\mathrm{d}\rmQ$, $\mathrm{d}\rmK$, $\mathrm{d}\rmB$, and $\mathrm{d}\boldsymbol{g}$. The dependence on the cumulative decay is reduced by a reverse cumulative sum, as in Eq. `\ref{eq_app_dg_reverse_cumsum}`{=latex}.

## Autotuning and hardware dispatch {#app_autotune}

The fused WY backward kernel uses the same matrix shapes as the forward solve, but it has a denser set of live accumulators because it emits gradients for $\rmB$ and $\rmW$. On Hopper GPUs we restrict the warp search for this kernel to two and four warps, since the eight-warp schedule can trigger a Triton WGMMA layout assertion for the $64\times64$ accumulator. On Ampere GPUs the full search space is retained. This restriction changes only the schedule, not the mathematical operation.

## Recurrent decoding kernel {#app_recurrent_kernel}

A forward-only recurrent kernel is provided for autoregressive decoding at short sequence lengths. It applies Eq. `\ref{eq_app_gdn2_step}`{=latex} token by token. The kernel keeps the state in fp32, multiplies it by $\exp(\boldsymbol{g}_t)$, reads the decayed state along $\vb_t\odot\vk_t$, writes $\vw_t\odot\vv_t$ along $\vk_t$, and returns $\rmS_t^\top\vq_t$. Training uses the chunk kernel.

## Variable-length sequences {#app_varlen}

Packed variable-length batches are represented with cumulative sequence lengths. The chunk index construction resets the recurrent state at every sequence boundary. The same layout is used by the chunk forward, the chunk backward, and the recurrent decoding kernel. Padding is removed before the layer and restored after the output projection.

# Numerical details and verification {#app_numerics}

## Decay precision {#app_decay_precision}

The decay gate in Eq. `\ref{eq_app_layer_decay}`{=latex} is computed in explicit fp32 before entering the kernels. This is important because the local cumulative sum $\boldsymbol{G}_r=\sum_{i\le r}\boldsymbol{g}_i$ is a path-length-dependent quantity. A low precision mantissa can perturb long products of decays even when each tokenwise gate is small. The kernels therefore receive the log-decay tensor and only compute local cumulative sums and exponentials.

## Query and key normalization {#app_qk_norm}

Queries and keys are L2-normalized per head before the recurrent update. With normalized keys, $\vk_t\vk_t^\top$ is a projector in the tied-gate limit. In Gated DeltaNet-2, the erase factor is asymmetric, but normalization still stabilizes the scale of both $\rmA_{qk}$ and $\rmT$. The backward applies the standard L2-normalization vector-Jacobian product.

## State and accumulator dtypes {#app_dtype}

The recurrent state is stored in fp32 across chunks and during recurrent decoding. Matrix multiplication accumulators use fp32. The layer output is cast back to the model dtype at the kernel boundary. The WY auxiliaries may be stored in the model dtype after fp32 accumulation, since they are recomputed when the memory-saving training path is used.

## WY solve precision {#app_solve_precision}

The triangular solve for $\rmA=(\rmI+\rmT)^{-1}$ is the most precision-sensitive part of the chunk computation. Errors in this solve are propagated through dependent forward-substitution steps. The implementation therefore exposes an explicit precision flag for the solve. The conservative IEEE fp32 path is used when required by the hardware check, while the remaining matrix products can use the faster tensor-core path.

## Initialization and output gate {#app_init_output_gate}

All linear layers are initialized with Xavier uniform weights and gain $2^{-2.5}$. Biases are initialized to zero when present. After the attention computation, the output is passed through an RMSNorm and SiLU gate before the final output projection. These choices match the training recipe used for the Gated DeltaNet family and keep the early recurrent state magnitudes controlled.

## Correctness checks {#app_correctness}

We verified the chunkwise forward against a tokenwise recurrent reference for random configurations covering different sequence lengths, head counts, key dimensions, value dimensions, initial states, packed layouts, and dtypes. We verified the backward against autograd through the recurrent reference. In fp64 reference tests, gradients for $\rmQ$, $\rmK$, $\rmV$, $\rmB$, $\rmW$, the log-decay, and the initial state agree to machine precision. In production fp32, differences are at the expected tensor-core accumulation noise level. In bfloat16, the error follows the bfloat16 mantissa.

# Experimental settings {#app_experimental_settings}

## Training

We evaluate Gated DeltaNet-2 against a Transformer baseline and recent recurrent architectures, including Mamba-2 [@pmlr-v235-dao24a], Gated DeltaNet [@yang2025gated], Kimi Delta Attention (KDA) [@team2025kimi], and Mamba-3 [@lahoti2026mamba3]. For each recurrent architecture, we train a recurrent-only model and a hybrid model. The hybrid model follows Section `\ref{subsec_block_design}`{=latex}, using the same recurrent token mixer together with sliding-window attention (SWA) under the same residual block structure. For Mamba-3, we evaluate both SISO and MIMO variants. The Mamba-3 MIMO model uses rank $R=4$.

For fair recurrent comparisons, we match both parameter count and main recurrent state size. Gated DeltaNet, KDA, and Gated DeltaNet-2 use $H=16$ heads with $d_k=128$ and $d_v=128$, giving a per-layer recurrent state of $$\begin{aligned}
    H d_k d_v
    =
    16 \cdot 128 \cdot 128
    =
    262{,}144
\end{aligned}$$ floats per batch element. Since $d_{\mathrm{model}}=2048$, this equals $128d_{\mathrm{model}}$. For Mamba-2 and Mamba-3, we use expansion factor $2$ and head dimension $64$, and set $d_{\mathrm{state}}=64$. Their main recurrent state size is therefore $$\begin{aligned}
    (2d_{\mathrm{model}})d_{\mathrm{state}}
    =
    4096 \cdot 64
    =
    262{,}144 .
\end{aligned}$$ Mamba-3 MIMO keeps the same main recurrent state size as the SISO variant while adding the rank-$R$ MIMO parameterization.

Unless stated otherwise, all models have 1.3B parameters and are trained on 100B tokens sampled from FineWeb-Edu [@penedo2024fineweb]. We use AdamW with peak learning rate $4\times10^{-4}$, weight decay $0.1$, and gradient clipping at $1.0$. The learning rate follows cosine annealing with a 1B-token warm-up. The global batch size is 0.5M tokens. The training sequence length is 4K tokens. Hybrid models use a 2K SWA window.

## Evaluation

#### Language modeling and common-sense reasoning

We use the evaluation suite commonly adopted for pretrained recurrent language models [@gu_mamba_2023]. Language modeling quality is measured by perplexity on WikiText [Wiki. @merity2016pointer] and LAMBADA [LMB. @paperno_lambada_2016]. For zero-shot transfer, we report LAMBADA accuracy together with PIQA [@bisk2020piqa], HellaSwag [Hella. @zellers2019hellaswag], WinoGrande [Wino. @sakaguchi2021winogrande], ARC-Easy and ARC-Challenge [ARC-e and ARC-c @arc-ce], OpenBookQA [OBQA @openbookqa], Social IQa [SIQA @sap2019social], and BoolQ [@clark2019boolq]. This mix covers next-token prediction, physical and social reasoning, commonsense completion, and elementary science QA.

#### In-context retrieval

We evaluate retrieval in both controlled synthetic settings and real-data settings. For synthetic retrieval, we use Single Needle-In-A-Haystack (S-NIAH) and Multi-Key Needle-In-A-Haystack (MK-NIAH) tasks from RULER [@hsieh2024ruler]. The S-NIAH suite contains three progressively harder cases. S-NIAH-1 is passkey retrieval, S-NIAH-2 asks for a numerical needle, and S-NIAH-3 asks for a word-based needle. We additionally evaluate MK-NIAH-1 where several distractor key-value pairs are present and the model must return the value associated with one requested key.

For real-world retrieval, we follow [@arora-2024-jrt]. The suite includes SWDE [@lockard_openceres_2019] for structured relation extraction from HTML, FDA [@arora_language_2023] for key-value retrieval from PDFs, and question-answering datasets including SQuAD [@rajpurkar_know_2018], TriviaQA [@JoshiTriviaQA2017], DROP [@dua2019drop], and Natural Questions [@47761].

```{=latex}
\newpage
```
```{=latex}
\small
```
```{=latex}
\bibliographystyle{unsrt}
```
