---
abstract: |
  Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary---gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.\
  Code: <https://github.com/NVlabs/GatedDeltaNet>
author:
- |
  Songlin Yang [^1]\
  MIT CSAIL\
  `yangsl66@mit.edu`\
  `\And`{=latex} Jan Kautz\
  NVIDIA\
  `jkautz@nvidia.com`\
  `\And`{=latex} Ali Hatamizadeh $^\star$\
  NVIDIA\
  `ahatamizadeh@nvidia.com`\
bibliography:
- ref.bib
title: |
  Gated Delta Networks:\
  Improving Mamba2 with Delta Rule
---

# Introduction

The Transformer architecture has significantly advanced the capabilities of Large Language Models (LLMs), showcasing exceptional performance across a wide range of tasks due to its effective attention mechanism. This mechanism excels in precise sequence modeling and leverages the parallel processing capabilities of modern GPUs during training. However, the self-attention component scales quadratically with sequence length, leading to substantial computational demands that pose challenges for both training and inference.

To mitigate these issues, researchers have explored alternatives such as linear Transformers [@katharopoulos2020transformers], which replace traditional softmax-based attention with kernelized dot-product-based linear attention, substantially reducing memory requirements during inference by reframing as a linear RNN with matrix-valued states. While early versions of linear Transformers underperformed in language modeling tasks compared to standard Transformers, recent enhancements---such as incorporating data-dependent gating mechanisms akin to those in LSTMs, exemplified by models like GLA [@yang_gated_2023] and Mamba2 [@mamba2]---have shown promising improvements. However, challenges persist in managing information over long sequences, particularly for in-context retrieval tasks where traditional Transformers maintain their advantage [@zoology; @arora_simple_2024; @jelassi_repeat_2024; @wen_rnns_2024; @akyurek_-context_2024].

This phenomenon is not surprising: linear Transformers can be interpreted as implementing an outer-product-based key-value association memory, reminiscent of tensor product representation [@DBLP:journals/ai/Smolensky90]. However, the number of orthogonal key-value pairs they can store is *bounded* by the model's dimensionality. When the sequence length exceeds this dimension, \`\`memory collisions\`\` become inevitable, hindering exact retrieval [@linear-xmr-fastweight].

Mamba2 addresses this limitation by introducing a simple gated update rule, $\rmS_t = \alpha_t \rmS_{t-1} + \vv_t\vk_t^\intercal$, which uniformly decays all key-value associations at each time step by a dynamic ratio, $\alpha_t \in (0,1)$. However, this approach does not account for the varying importance of different key-value associations, potentially leading to inefficient memory utilization. If the model needs to forget a specific key-value association, all key-value associations are equally forgotten, making the process less targeted and efficient.

In contrast, the linear Transformer with the delta rule [@widrow_adaptive_1988], known as DeltaNet [@linear-xmr-fastweight; @yang2024parallelizing], selectively updates memory by (softly) replacing an old key-value pair with the incoming one in a sequential manner. This method has demonstrated impressive performance in synthetic benchmarks for in-context retrieval. However, since this process only modifies a single key-value pair at a time, the model lacks the ability to rapidly clear outdated or irrelevant information, especially during context switches where previous data needs to be erased. Consequently, DeltaNet has been found to perform moderately on real-world tasks [@yang2024parallelizing], likely due to the absence of a robust memory-clearing mechanism.

Recognizing the complementary advantages of the gated update rule and the delta rule in memory management, we propose the *gated delta rule*, a simple and intuitive mechanism that combines both approaches. This unified rule enables flexible memory control: it can promptly clear memory by setting $\alpha_t \rightarrow 0$, while selectively updating specific content without affecting other information by setting $\alpha_t \rightarrow 1$ (effectively switching to the pure delta rule).

The remaining challenge lies in implementing the gated delta rule in a hardware-efficient manner. Building upon @yang2024parallelizing's efficient algorithm that parallelizes the delta rule computation using the WY representation [@bischof_wy_1985], we carefully extend their approach to incorporate the gating terms. Our extension preserves the benefits of chunkwise parallelism [@hua_transformer_2022; @sun2023retentive; @yang_gated_2023; @yang2024parallelizing], enabling hardware-efficient training.

Our resulting architecture, Gated DeltaNet, consistently outperforms both Mamba2 and DeltaNet across a comprehensive suite of benchmarks, including language modeling, commonsense reasoning, in-context retrieval, length extrapolation, and long-context understanding. Building on these results, we also develop hybrid architectures that strategically combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, further enhancing both training efficiency and model performance.

# Preliminary

## Mamba2: Linear Attention with decay

It is known that the linear transformer [@katharopoulos_transformers_2020] can be formulated as the following linear recurrence when excluding normalization and query/key activations: $$\begin{aligned}
    \rmS_t = \rmS_{t-1} + \vv_t \vk_t^\intercal  \in \mathbb{R}^{d_v \times d_k}, \qquad \qquad
    \vo_t = \rmS_t \vq_t \in \mathbb{R}^{d_v}
\end{aligned}$$ where $d_k$ and $d_v$ represent the (head) dimensions for query/key and value, respectively. By expanding the recurrence, we can express it in both vector form (left) and matrix form (right) as follows: $$\begin{aligned}
    \vo_t = \sum_{i=1}^t (\vv_i \vk_i^\intercal) \vq_t = \sum_{i=1}^t \vv_i (\vk_i^\intercal \vq_t) \in \mathbb{R}^{d_v},  \qquad
    \rmO = (\rmQ \rmK^\intercal  \odot \rmM) \rmV \in \mathbb{R}^{L \times d_v}
\end{aligned}$$ where $L$ is the sequence length, and $\rmM \in \mathbb{R}^{L\times L}$ is the causal mask defined by $\rmM_{ij} = 0$ when $i < j$, and $1$ otherwise.

However, this vanilla linear attention underperforms Transformers in language modeling by a large margin. To address this, it is common to add a decay term to forget historical information. Here we take Mamba2 [@mamba2] as an example, which can be represented by the following linear recurrence (up to specific parameterization): $$\rmS_t = {\color{blue}\alpha_t} \rmS_{t-1} + \vv_t \vk_t^\intercal, \qquad \vo_t = \rmS_t \vq_t$$ where ${\color{blue}\alpha_t \in (0,1)}$ is a data-dependent scalar-valued decay term that varies with $t$. Define the cumulative decay product $\color{blue}{\gamma_j = \prod_{i=1}^j \alpha_i}$, and by expanding the recurrence, we can express the result in both a vector form (left) and a matrix parallel form (right): $$\vo_t = \sum_{i=1}^t \left({ {\color{blue}\frac{\gamma_t}{\gamma_i}}} \vv_i \vk_i^\intercal  \right) \vq_t = \sum_{i=1}^t \vv_i \left( {\color{blue} \frac{\gamma_t}{\gamma_i}} \vk_i^\intercal \vq_t \right), \qquad
\rmO = \left( \left(\rmQ \rmK^\intercal \right) \odot { {\color{blue}\Gamma}} \right) \rmV$$ Here, ${\color{blue}\Gamma \in \mathbb{R}^{L\times L}}$ is a decay-aware causal mask where $\color{blue}{\Gamma_{ij} = \frac{\gamma_i}{\gamma_j}}$ `\text{if}`{=latex} $i \ge j$ and ${\color{blue}\Gamma_{ij} = 0}$ otherwise. The equivalence between these parallel and recurrent forms is also referred to as the state space duality (SSD) described in @mamba2. This recurrence structure appears in several other architectures including Gated RFA [@peng_random_2021], xLSTM [@beck2024xlstm], and Gated RetNet [@Sun2024YouOC]. When $\gamma_t$ is data-independent, the formulation reduces to RetNet [@sun2023retentive] and Lightning-Attention [@lightning2]. Furthermore, if $\gamma_t$ is extended to be matrix-valued rather than scalar-valued, efficient training algorithms remain possible when parameterized with an outer-product structure, as demonstrated by @yang_gated_2023 and used by @yang_gated_2023 [@peng_eagle_2024; @qin2024hgrn2; @Zhang2024GatedSA; @chou2024metala; @he2025rodimus; @lu2025reglarefininggatedlinear].

#### Chunkwise training

However, both the recurrent and parallel forms are not ideal for efficient training [@hua_transformer_2022; @yang_gated_2023], which motivates the use of the chunkwise parallel form [@hua_transformer_2022; @sun2023retentive] for hardware-efficient, linear-time training, as introduced below. To summarize, the chunkwise parallel form splits inputs and outputs into several chunks of size $C$, and computes outputs for each chunk based on the final state of the previous chunk and the query/key/value blocks of the current chunk. Following the notation of @sun_retentive_2023 [@yang_gated_2023; @yang2024parallelizing], we take the query block, $\vq$, as an example. We denote $\rmQ_{[t]} := \vq_{tC+1:(t+1)C+1}$ as the query block for chunk $t$, and $\vq_{[t]}^r := \vq_{tC+r}$ as the $r$-th query within chunk $t$. The initial state of chunk $t$ is defined as $\rmS_{[t]} := \rmS_{[t]}^0 = \rmS_{[t-1]}^C$. By partially expanding the recurrence, we have

$$\begin{aligned}
     \rmS_{[t]}^r = \rmS_{[t]} + \sum_{i=1}^r \vv_{[t]}^{i} \vk_{[t]}^{i\intercal} \in \mathbb{R}^{d_v\times d_k}, \qquad
     \vo_{[t]}^r = \rmS_{[t]}^r\vq_{[t]}^r = \rmS_{[t]}\vq_{[t]}^r + \sum_{i=1}^r \vv_{[t]}^{i} \left(\vk_{[t]}^{i\intercal} \vq_{[t]}^{r} \right)  \in \mathbb{R}^{d_v}

\end{aligned}$$

Equivalently, in matrix form:

$$\begin{aligned}
\rmS_{[t+1]} = \rmS_{[t]} + \rmV_{[t]} \rmK_{[t]}^\intercal \in \mathbb{R}^{d_v \times d_k}, \qquad
\rmO_{[t]} = \rmQ_{[t]} \rmS_{[t]}^\intercal  + \left(\rmQ_{[t]}\rmK_{[t]}^\intercal \odot \rmM\right) \rmV_{[t]}  \in \mathbb{R}^{C \times d_v}
\end{aligned}$$ where $\rmM \in \mathbb{R}^{C\times C}$ is the causal mask. The above equations are rich in matrix multiplications (matmuls), allowing for tensor-core-based hardware optimization. This chunkwise algorithm could be easily extended to linear attention with decay: $$\begin{aligned}
    \rmS_{[t+1]} = {\color{blue}
\overrightarrow{\rmS_{[t]}}}
+ \rmV_{[t]}^\intercal
    {\color{blue}
\overrightarrow{\rmK_{[t]}}} \in \mathbb{R}^{d_v \times d_k}
,  &&
    \rmO_{[t]} = {\color{blue}{\overleftarrow{ \rmQ_{[t]}}}} \rmS_{[t]}^\intercal + \left(\rmQ_{[t]} \rmK_{[t]}^\intercal \odot {\color{blue}\Gamma_{[t]}}\right)\rmV_{[t]} \in \mathbb{R}^{C\times d_v}
    \label{eq:mamba2-update-o}
\end{aligned}$$ where ${\color{blue}(\Gamma_{[t]})_{ij} = \frac{\gamma_{[t]}^i}{\gamma_{[t]}^j}, \gamma_{[t]}^j = \prod_{j=tC+1}^{tC+j} \alpha_j}$. [^2] Here we use the left arrow ($\overleftarrow{\cdot}$) or the right arrow ($\overrightarrow{\cdot}$) to denote a variable decaying to the first position and the last position of each chunk, respectively, $$\begin{aligned}
    {\color{blue}\overleftarrow{\vq_{[t]}^r}} &= {\color{blue}\gamma_{[t]}^r} \vq_{[t]}^r && \text{decaying each vector to the first position of  chunk $t$} \nonumber \\
{\color{blue}\overrightarrow{\vk_{[t]}^r}} &= {\color{blue}\frac{\gamma_{[t
]}^{C}}{\gamma_{[t]}^r}} \vk_{[t]}^r  && \text{decaying each vector to the last position of  chunk $t$} \nonumber \\
{\color{blue}\overrightarrow{\rmS_{[t]}}} &= {\color{blue}\gamma_{[t]}^C}\rmS_{[t]}  && \text{decaying the state matrix over the entire chunk $t$}
\label{eq:def_notation}
\end{aligned}$$ and likewise for other variables (e.g., ${\color{blue}\overrightarrow{\vv}}$). The SSD decomposition algorithm introduced in Mamba2 is largely equivalent to this chunkwise algorithm. For a more generalized approach, @yang_gated_2023 proposed an extended chunkwise algorithm for linear attention that incorporates fine-grained decay mechanisms.

## Delta Networks: Linear Attention with Delta Rule

The delta update rule [@widrow_adaptive_1988; @schlag_linear_2021] *dynamically* erases the value ($\vv_t^{\text{old}}$) associated with the current input key ($\vk_t$) and writes a new value ($\vv_t^{\text{new}}$), which is a linear combination of the current input value and the old value based on the \`\`writing strength" $\beta_t \in (0,1)$.[^3] $$\begin{aligned}
    \rmS_t &= \rmS_{t-1} - \underbrace{\left(\rmS_{t-1} \vk_t\right)}_{\vv_{t}^{\text{old}}}  \vk_t^\intercal + \underbrace{\left(\beta_t \vv_t + (1-\beta_t)\rmS_{t-1}\vk_t)\right)}_{\vv_{t}^{\text{new}}}  \vk_t^\intercal
    % = \rmS_{t-1} + \underbrace{\vu_t}_{\vv_{t}^{\text{new}} - \vv_{t}^{\text{old}}} \vk_t^\intercal
= \rmS_{t-1} \left(\rmI - \beta_t \vk_t \vk_t^\intercal \right)  + \beta_t  \vv_t \vk_t^\intercal
% \in \mathbb{R}^{d_v\times d_k}
\end{aligned}$$ As shown above, DeltaNet implements a first-order linear recurrence with generalized Householder transition matrices $\left(\rmI - \beta_t \vk_t \vk_t^\intercal \right)$. Despite demonstrating superior associative recall and language modeling performance [@linear-xmr-fastweight], DeltaNet received limited attention due to computational inefficiency until @yang2024parallelizing introduced a hardware-efficient chunkwise training algorithm, as detailed below.

#### Chunkwise parallel form.

By partially expanding the recurrence, we have $$\begin{aligned}
 \rmS_{[t]}^r = \rmS_{[t]} \underbrace{\left(\prod_{i=1}^r \rmI - \beta_{[t]}^i \vk_{[t]}^i \vk_{[t]}^{i\intercal} \right)}_{:= \rmP_{[t]}^r} + \underbrace{\sum_{i=1}^{r} \left( \beta^i_{[t]} \vv^i_{[t]} \vk_{[t]}^{i\intercal}\prod_{j=i+1}^{r}  \left(\rmI - \beta_{[t]}^j \vk^j_{[t]} \vk_{[t]}^{j\intercal} \right) \right)}_{:= \rmH_{[t]}^r}
 \label{eq:delta_rule_expand}
\end{aligned}$$ where $\rmP_{[t]}^j$ involves cumulative products of generalized Householder matrices, which could be optimized by the classical WY representation [@bischof_wy_1985]:$$\begin{aligned}
    \rmP_{[t]}^{r} &= \rmI - \sum_{i=1}^{r}\vw_{[t]}^i\vk_{[t]}^{i\intercal}  \in \mathbb{R}^{d_k \times d_k}
     &&\vw_{[t]}^r = \beta_{[t]}^r \left(\vk_{[t]}^r -  \sum_{i=1}^{r-1} \left(\vw_{[t]}^i (\vk_{[t]}^{i\intercal}\vk_{[t]}^r) \right) \right) \in \mathbb{R}^{d_k}
     \label{eq:wy-pw}

\end{aligned}$$ Likewise, $\rmH_{[t]}^r$ could be represented as:

$$\begin{aligned}
 \rmH_{[t]}^{r} &= \sum_{i=1}^{r} \vu_{[t]}^i \vk_{[t]}^{i\intercal}  \in \R^{d_v \times d_k}  &&   \vu_{[t]}^r = \beta_{[t]}^r \left(\vv_{[t]}^r -  \sum_{i=1}^{r-1} \left(\vu_{[t]}^i (\vk_{[t]}^{i\intercal}\vk_{[t]}^r) \right) \right)\in \mathbb{R}^{d_v}
 \label{eq:wy-ph}

\end{aligned}$$ and in matrix form: $\rmP_{[t]}=\rmI-\rmW_{[t]}^\top\rmK_{[t]} \in \mathbb{R}^{d_k \times d_k}$, $\rmH_{[t]}=\rmU_{[t]}^\top\rmK_{[t]} \in \mathbb{R}^{d_v\times d_k}$. By using the UT transform [@Joffrain2006AccumulatingHT], we can further write $\rmW$ and $\rmU$ in matrix form:$$\begin{aligned}
   \rmT_{[t]} = \left[\rmI + \operatorname{strictLower}\left(\operatorname{diag}(\beta_{[t]})\rmK_{[t]} \rmK_{[t]}^\intercal\right)\right]^{-1}\operatorname{diag}\left(\beta_{[t]}\right)
   \in \mathbb{R}^{C \times C}
   \label{eq:inverse}\\
   \rmW_{[t]}= \rmT_{[t]} \rmK_{[t]}
   \in \mathbb{R}^{C \times d_k}, \qquad
   \rmU_{[t]}=\rmT_{[t]}\rmV_{[t]}
   \in \mathbb{R}^{C \times d_v}
   \label{eq:wu=akv}
\end{aligned}$$ Substituting these back into Eq. `\ref{eq:delta_rule_expand}`{=latex} yields a hardware-efficient chunkwise algorithm for DeltaNet that leverages matmuls, enabling tensor core based GPU optimization:$$\begin{aligned}
\rmS_{[t+1]} &= \rmS_{[t]}\rmP_{[t]}+\rmH_{[t]} =  \rmS_{[t]} + \left(\rmU_{[t]} - \rmW_{[t]}\rmS_{[t]}^{\intercal}\right)^\intercal \rmK_{[t]} \label{eq:delta_chunk_h} & \in \mathbb{R}^{d_v \times d_k} \\
    \rmO_{[t]} &= \rmQ_{[t]} \rmS_{[t]}^\intercal + (\rmQ_{[t]} \rmK_{[t]}^{\intercal} \odot \rmM) \left(\rmU_{[t]} - \rmW_{[t]} \rmS_{[t]}^\intercal\right)  &\in \mathbb{R}^{C \times d_v}\label{eq:delta_chunk_o}
\end{aligned}$$

# Gated Delta Networks

## Formulation: Gated Delta Rule {#sec:online-learning}

The proposed gated delta rule is simple yet effective: $$\begin{aligned}
\rmS_t = \rmS_{t-1} \left( {\color{blue}{\alpha_t}}  (\rmI - \beta_t \vk_t\vk_t^\intercal) \right) + \beta_t \vv_t \vk_t^\intercal
\label{eq:gated_delta_rule}
\end{aligned}$$ where the data-dependent gating term $\color{blue}{\alpha_t} \in (0,1)$ controls state decay. This formulation unifies the advantages of both gating mechanisms and the delta rule: the gating term enables adaptive memory management, while the delta update structure facilitates effective key-value association learning.

We present a formal analysis of the gated delta rule through the lens of the online learning framework introduced by [@longhorn]. In this framework, recurrent state updates emerge as *closed-form* solutions to an online learning problem, as shown in Table `\ref{tab:online-learning-rnn}`{=latex}. Recent linear RNN architectures typically incorporate a regularization term in their online learning objective to prevent state divergence from previous values, thereby enabling memory retention. However, this retention mechanism becomes problematic when the state becomes saturated with information. In such cases, each state would encode a superposition of multiple information pieces, making precise retrieval challenging. To address this limitation, Mamba2 and Gated DeltaNet introduce an adaptive scaling factor $\alpha_t$ that relaxes the regularization term, allowing controlled deviations between $\rmS_t$ and $\rmS_{t-1}$. This modification enables dynamic memory management through selective forgetting, which could be useful in filtering out irrelevant information (see §`\ref{sec:case_study}`{=latex}).

On the other hand, Linear Attention (LA) and Mamba2 use a simple negative inner-product loss -$\langle\rmS_t \vk_t, \vv_t\rangle$, while Longhorn [@longhorn] uses a more expressive online regression objective $\|\rmS_t\vk_t - \vv_t\|^2$ for better modeling of key-value associations. The resulting Longhorn's update rule closely resembles the delta update rule, [^4] suggesting the superiority of the (gated) delta rule over Mamba2 in in-context associative recall.

From the perspective of fast weight programming [@Irie2022TheDF] and test-time training [@ttt] and regression [@wang2025testtimeregressionunifyingframework], the hidden state $\rmS$ can be interpreted as a (fast) weight matrix, with the delta rule optimizing the online regression objective $\mathcal{L}(\rmS_t)=\frac{1}{2} \| \rmS_t\vk_t - \vv_t \|^2$ via *test-time* stochastic gradient descent (SGD): $$\begin{aligned}
\rmS_{t+1} &= \rmS_{t} - \beta_t \nabla \mathcal{L}(\rmS_t)
= \rmS_{t} - \beta_t (\rmS_t\vk_t - \vv_t)\vk_t^\intercal = \rmS_{t}\left(\rmI-\beta_t\vk_t\vk_t^\intercal\right) + \beta_t \vv_t\vk_t^\intercal
\end{aligned}$$ where $\beta_t$ represents the (adaptive) learning rate. From this perspective, the gated delta rule can be viewed as incorporating an adaptive weight decay term $\alpha_t$ into the SGD update, a technique widely used in deep learning [@Krogh1991ASW; @Andriushchenko2023WhyDW]. Concurrently, Titans [@behrouz2024titanslearningmemorizetest] demonstrated the effectiveness of incorporating weight decay mechanisms in RNN test-time SGD updates.

```{=latex}
\scriptsize
```
```{=latex}
\setlength{\tabcolsep}{4pt}
```
```{=latex}
\renewcommand{\arraystretch}{1.5}
```
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  **Method**        **Online Learning Objective**                                                                                                             **Online Update**
  ----------------- ----------------------------------------------------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------------------------------------------------------------
  `\text`{=latex}   $\displaystyle  \|\rmS_t - \rmS_{t-1}\|_F^2 - 2\langle\rmS_t \vk_t, \vv_t\rangle$                                                         $\displaystyle \rmS_t = \rmS_{t-1} + \vv_t \vk_t^T$

  `\text`{=latex}   $\displaystyle  \|\rmS_t - \alpha_t \rmS_{t-1}\|_F^2 - 2\langle\rmS_t \vk_t, \vv_t\rangle$                                                $\displaystyle \rmS_t = \alpha_t \rmS_{t-1} + \vv_t \vk_t^T$

  `\text`{=latex}   $\displaystyle \|\rmS_t - \rmS_{t-1}\|_F^2 - \beta_t \|\rmS_t \vk_t - \vv_t \|^2$                                                         $\displaystyle \rmS_t = \rmS_{t-1}(\rmI - \epsilon \vk_t \vk_t^T) + \epsilon_t \vv_t \vk_t^T,  \epsilon_t=\frac{\beta_t}{1+\beta_t\vk_t^\top\vk_t}$

  `\text`{=latex}   $\displaystyle  \|\rmS_t - \rmS_{t-1}\|_F^2 - 2\langle\rmS_t \vk_t, \beta_t\left(\vv_t- \rmS_{t-1}\vk_t \right)\rangle$                   $\displaystyle \rmS_t = \rmS_{t-1}(\rmI - \beta_t \vk_t \vk_t^T) + \beta_t \vv_t \vk_t^T$

  `\text`{=latex}   $\displaystyle \|\rmS_t - \alpha_t \rmS_{t-1}\|_F^2 - 2\langle\rmS_t \vk_t, \beta_t\left(\vv_t- \alpha_t\rmS_{t-1}\vk_t \right)\rangle$   $\displaystyle \rmS_t = \rmS_{t-1}\left(\alpha_t(\rmI - \beta_t \vk_t \vk_t^T)\right) + \beta_t \vv_t \vk_t^T$
  -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

  : Comparison of different linear RNN models and their corresponding online learning objectives using the framework from [@longhorn]. For convenience, we simplify Longhorn's vector-valued $\bbeta$ to scalar $\beta$.

`\label{tab:online-learning-rnn}`{=latex}

```{=latex}
\resizebox{0.7\textwidth}{!}{%
\begin{tabular}{ll|cccc|cccc|ccc}
\toprule
& & S-NIAH-1 & S-NIAH-2 & S-NIAH-3 \\
& & (pass-key retrieval) & (number in haystack) & (uuid in haystack) \\
\cmidrule{3-13}
Model & & 1K & 2K & 4K & 8K & 1K & 2K & 4K & 8K & 1K & 2K & 4K \\
\midrule
DeltaNet & & 97.4 & 96.8 & \textbf{99.0} & \textbf{98.8} & 98.4 & 45.6 & 18.6 & 14.4 & 85.2 & 47.0 & 22.4 \\
Mamba2 & & \textbf{99.2} & \textbf{98.8} & 65.4 & 30.4 & 99.4 & 98.8 & 56.2 & 17.0 & 64.4 & 47.6 & 4.6 \\
\textbf{Gated DeltaNet} & & 98.4 & 88.4 & 91.4 & 91.8 & \textbf{100.0} & \textbf{99.8} & \textbf{92.2} & \textbf{29.6} & \textbf{86.6} & \textbf{84.2} & \textbf{27.6} \\
\bottomrule
\end{tabular}
}
```
`\label{tab:niah-results}`{=latex}

## Case study: Single Needle in a Haystack (S-NIAH) {#sec:case_study}

To better understand the complementary strength between the delta rule and the gated rule, we present a case study on the Single Needle-In-A-Haystack (S-NIAH) benchmark suite from RULER [@hsieh2024ruler], where a key-value pair acts as a needle in the haystack (context) and the model must recall the value when given the key. Table `\ref{tab:niah-results}`{=latex} presents the results and we draw three main observations:

#### Decay hurts memory retention.

In the simplest S-NIAH-1 setting with repeated synthetic context, models memorize minimal information, testing long-term retention. DeltaNet achieves near-perfect performance across all sequence lengths. Mamba2 degrades significantly beyond 2K sequences since it decays historical information too quickly, while Gated DeltaNet's degradation is less severe thanks to the use of delta rule.

#### Gating facilitates filtering.

In S-NIAH-2/3 with real-world-essay context, models store all potentially relevant information, testing efficient memory management. With fixed state size, lack of clearance causes memory collision---information becomes superimposed and indistinguishable. DeltaNet's performance drops significantly at longer sequences due to poor memory clearance. Mamba2 and Gated DeltaNet maintain better performance through gating mechanisms that filter irrelevant information.

#### Delta rule helps memorization.

In S-NIAH-3, values change from numbers to UUIDs, testing complex pattern memorization. Mamba2's performance drops quickly, while Gated DeltaNet performs better, verifying that the delta rule indeed has better memorization ability.

## Algorithm: Hardware-efficient Chunkwise training

In this subsection, we derive a hardware-efficient chunkwise algorithm for training Gated DeltaNet. By partially expanding the recurrence in Eq. `\ref{eq:gated_delta_rule}`{=latex}, we have $$\begin{aligned}
 \rmS_{[t]}^r = \rmS_{[t]} \underbrace{\left(\prod_{i=1}^r
{\color{blue}{\alpha_{[t]}^i}}\left(\rmI - \beta_{[t]}^i \vk_{[t]}^i \vk_{[t]}^{i\intercal} \right)\right)}_{:= \mathbf{F}_{[t]}^r} + \underbrace{\sum_{i=1}^{r} \left( \beta^i_{[t]} \vv^i_{[t]} \vk_{[t]}^{i\intercal}\prod_{j=i+1}^{r} {\color{blue}{\alpha_{[t]}^j}} \left(\rmI - \beta_{[t]}^j \vk^j_{[t]} \vk_{[t]}^{j\intercal} \right) \right)}_{:= \rmG_{[t]}^r}
\end{aligned}$$

It is easy to see that $\mathbf{F}_{[t]}^r = {\color{blue}\gamma_{[t]}^r} \rm{P}_{[t]}^r = {\color{blue}\overleftarrow{\rm{P}_{[t]}^r}}$. As for $\rmG_{[t]}^r$, we adapt Eq. `\ref{eq:wy-ph}`{=latex} as follows, $$\begin{aligned}
\rmG_{[t]}^r = \sum_{i=1}^r {\color{blue} \frac{\gamma_{[t]}^r}{\gamma_{[t]}^i} } \tilde{\vu}_{[t]}^i \vk_{[t]}^{i\intercal} \in\mathbb{R}^{d_v \times d_k}
&&\tilde{\vu}_{[t]}^r = \beta_{[t]}^r \left(\vv_{[t]}^r - \sum_{i=1}^{r-1} \left( \tilde{\vu}_{[t]}^i ({\color{blue}\frac{\gamma_{[t]}^{r}}{\gamma_{[t]}^i}} \vk_{[t]}^{i\intercal}\vk_{[t]}^r)\right)\right) \in \mathbb{R}^{d_v}
\label{eq:wy_recurrent}
\end{aligned}$$ (see §`\ref{sec:extended_wy_proof}`{=latex} for a proof). By UT transform, we have the matrix form:$$\begin{aligned}
\widetilde{\rmU_{[t]}} = \left[\rmI + \operatorname{strictLower} \left(\operatorname{diag}\left(\beta_{[t]}\right) ({\color{blue}\Gamma_{[t]} } \odot \rmK_{[t]} \rmK_{[t]}^\intercal )\right) \right]^{-1} \operatorname{diag}\left(\beta_{[t]}\right) \rmV_{[t]} && \in \mathbb{R}^{C \times d_v}
\end{aligned}$$ Similar to how Mamba2 extends linear attention (Eq. `\ref{eq:mamba2-update-o}`{=latex}), we can adapt DeltaNet's chunkwise algorithm (Eq. `\ref{eq:delta_chunk_h}`{=latex}-`\ref{eq:delta_chunk_o}`{=latex}) for Gated DeltaNet to enable hardware-efficient training as follows: $$\begin{aligned}
\rmS_{[t+1]} &= {\color{blue}  \overrightarrow{\rmS_{[t]}}} +  \left({ \widetilde{\rmU_{[t]}}} - {\color{blue}
\overleftarrow{\rmW_{[t]}}} \rmS_{[t]}^\intercal\right)^\intercal {\color{blue}
\overrightarrow{\rmK_{[t]}}} &&\in \mathbb{R}^{d_v \times d_k}
% \label{eq:gated_delta_chunk_s}
\\
    \rmO_{[t]} &=
    {\color{blue}
\overleftarrow{\rmQ_{[t]}}}
    \rmS_{[t]}^\intercal + (\rmQ_{[t]} \rmK_{[t]}^{\intercal} \odot \mathbf{M})
\left({{\widetilde{\rmU^{}_{[t]}}}} -
    {\color{blue}     \overleftarrow{\rmW_{[t]}}}\rmS_{[t]}^\intercal\right) &&\in \mathbb{R}^{C \times d_v}
% \label{eq:gated_delta_chunk_o}
\end{aligned}$$ where $\color{blue} \overleftarrow{\vq_{[t]}^r}=\gamma_{[t]}^r \color{black}\vq_{[t]}^r$, $\color{blue} \overleftarrow{\vw_{[t]}^r}=\gamma_{[t]}^r \color{black}\vw_{[t]}^r$, $\color{blue} \overrightarrow{\vk_{[t]}^r}=\frac{\gamma_{[t]}^C}{\gamma_{[t]}^r} \color{black}\vk_{[t]}^r$, and $\color{blue} \overrightarrow{\rmS_{[t]}}= \gamma_{[t]}^C \color{black}\rmS_{[t]}$ like we defined in Eq. `\ref{eq:def_notation}`{=latex}.

## Gated Delta Networks and Hybrid Models {#sec:arch_design}

#### Token mixer block.

The basic Gated DeltaNet follows Llama's macro architecture, stacking token mixer layers with SwiGLU MLP layers, but replaces self-attention with gated delta rule token mixing. Fig. `\ref{fig:gated_deltanet_model}`{=latex} (right) shows its block design. For the gated delta rule (Eq. `\ref{eq:gated_delta_rule}`{=latex}), queries, keys and values $\{\vq, \vk, \vv\}$ are generated through linear projection, short convolution and SiLU, with L2 normalization applied to $\vq, \vk$ for training stability. $\alpha, \beta$ use linear projection only.[^5] Following @sun2023retentive, the output is processed through normalization and gating before applying output projection.

```{=latex}
\definecolor{fgate_color}{RGB}{252,224,225}
```
```{=latex}
\definecolor{delta_color}{RGB}{242,243,193}
```
```{=latex}
\definecolor{swa_color}{RGB}{252,224,225}
```
```{=latex}
\definecolor{add_norm_color}{RGB}{252,226,187}
```
```{=latex}
\definecolor{glu_color}{RGB}{194,232,247}
```
```{=latex}
\definecolor{silu_color}{RGB}{203,231,207}
```
```{=latex}
\definecolor{linear_color}{RGB}{220,223,240}
```
```{=latex}
\definecolor{conv_color}{RGB}{252,224,225}
```
```{=latex}
\definecolor{l2_color}{RGB}{252,226,187}
```
```{=latex}
\definecolor{gray_bbox_color}{RGB}{243,243,244}
```
```{=latex}
\definecolor{oproj_color}{RGB}{220,223,240}
```
```{=latex}
\definecolor{operator_color}{RGB}{252,224,225}
```
<figure id="fig:gated_deltanet_model">

<figcaption> Visualization of the (hybrid) architecture and block design of Gated DeltaNet models. Gated DeltaNet-H1 and H2 use Gated DeltaNet + SWA and Mamba2 + Gated DeltaNet + SWA patterns, respectively. In the block design, query/key paths consist of linear proj., shortconv., SiLU and L2 norm; value path includes linear proj., shortconv. and SiLU; alpha/beta use linear proj.; and output gate applies linear proj. with SiLU. </figcaption>
</figure>

#### Hybrid models.

Linear transformers have limitations in modeling local shifts and comparisons, and their fixed state size makes it hard for retrieval tasks [@arora_simple_2024]. Following recent hybrid architectures like Griffin [@de_griffin_2024] and Samba [@ren2024samba], we combine linear recurrent layers with sliding window attention (SWA), resulting in GatedDeltaNet-H1. We also stack Mamba2, GatedDeltaNet and SWA, resulting in GatedDeltaNet-H2.

# Experiments {#sec:exp}

#### Setup {#sec:setup}

Our experiments encompass a comprehensive comparison of recent state-of-the-art architectures, including pure Transformer models, RNN-based approaches, and hybrid architectures. We evaluate against the following baselines: RetNet [@sun2023retentive], HGRN2 [@qin2024hgrn2], Mamba [@gu_mamba_2023], Mamba2 [@pmlr-v235-dao24a], Samba [@ren2024samba], and DeltaNet [@yang2024parallelizing]. For fair comparison, all models are trained under identical conditions with 1.3B parameters on 100B tokens sampled from the FineWeb-Edu dataset [@penedo2024fineweb]. We use the AdamW optimizer with a peak learning rate of 4e-4, weight decay of 0.1, and gradient clipping of 1.0. The learning rate follows a cosine annealing schedule with a 1B token warm-up period and batch size of 0.5M tokens. All models employ the Llama2 tokenizer with a vocabulary size of 32,000. For sequence modeling, we set the training length to 4K tokens, with Samba and our hybrid models using a sliding window size of 2K. See § `\ref{sec:evaluation}`{=latex} for evaluation settings and § `\ref{sec:ablation_study}`{=latex} for ablation studies.

```{=latex}
\begin{table*}[t!]


\scriptsize
\addtolength{\tabcolsep}{-2.5pt}
\begin{tabular}{l|cc|ccccccccc}
\toprule
% \text{} & & \midrule \\
\textbf{Model}  & \textbf{Wiki.}  &  \textbf{LMB.} &  \textbf{LMB.} & \textbf{PIQA} &    \textbf{Hella.} & \textbf{Wino.} & \textbf{ARC-e} &  \textbf{ARC-c} &  \textbf{SIQA}  & \textbf{BoolQ} &  \textbf{Avg.} \\
 & ppl $\downarrow$  &  ppl $\downarrow$  &  acc $\uparrow$  & acc $\uparrow$ &   acc\_n $\uparrow$  & acc $\uparrow$  & acc $\uparrow$ & acc\_n $\uparrow$ &  acc $\uparrow$  & acc $\uparrow$ &     \\
\midrule
\midrule
\textit{Recurrent models} \\
 RetNet & 19.08 & 17.27 & 40.52 & 70.07 & 49.16 &   54.14 & 67.34   & 33.78 &  \textbf{40.78}  & \underline{60.39}  & 52.02 \\
 HGRN2 & 19.10 & 17.69 & 39.54 & 70.45 & 49.53 &    52.80 & 69.40   & 35.32 &  \underline{40.63}  & 56.66  & 51.79 \\
 Mamba & 17.92 & 15.06 & 43.98 & 71.32 & 52.91 &    52.95 & 69.52   & 35.40 &  37.76  & \textbf{61.13}  & 53.12 \\
 Mamba2 & \underline{16.56} & \underline{12.56} & \underline{45.66} & \underline{71.87} & \underline{55.67} &   \underline{55.24} & \textbf{72.47}  & \underline{37.88} &  40.20  & 60.13  & \underline{54.89} \\
 DeltaNet & 17.71 & 16.88 & 42.46 & 70.72 & 50.93 & 53.35 & 68.47   & 35.66 &  40.22  & 55.29  & 52.14 \\
 Gated DeltaNet & \textbf{16.42} & \textbf{12.17} & \textbf{46.65} & \textbf{72.25} & \textbf{55.76} &\textbf{57.45} & \underline{71.21}    & \textbf{38.39} &  \underline{40.63}  & 60.24  & \textbf{55.32} \\
\midrule
\textit{Attention or hybrid models} \\
 Transformer++ & 18.53 & 18.32 & 42.60 & 70.02 & 50.23 &    53.51 & 68.83   & 35.10 &  40.66  & 57.09  & 52.25 \\
 Samba & 16.13 & 13.29 & 44.94 & 70.94 & 53.42 &    55.56 & 68.81   & 36.17 &  39.96  & \underline{62.11}  & 54.00 \\
 Gated DeltaNet-H1 & \underline{16.07} & \textbf{12.12} & \underline{47.73} & \textbf{72.57} & \underline{56.53} &\textbf{58.40} & \underline{71.75}    & \textbf{40.10} &  \underline{41.40}  & \textbf{63.21} &  \textbf{56.40} \\
  Gated DeltaNet-H2 & \textbf{15.91} &\underline{12.55} & \textbf{48.76} & \underline{72.19} & \textbf{56.88} & \underline{57.77} & \underline{71.33}    &\underline{39.07} &  \textbf{41.91}  & 61.55   & \underline{56.18}  \\
\bottomrule
\end{tabular}
\addtolength{\tabcolsep}{2.5pt}

\caption{
Performance comparison on language modeling and zero-shot common-sense reasoning.
}
\label{tab:commonsense_results}

\end{table*}
```
`\label{sec:lang_model}`{=latex}

```{=latex}
\begin{wraptable}{r}{0.55\linewidth}


\scriptsize
\addtolength{\tabcolsep}{-5pt}

\begin{tabular}{l|ccccccc}
\toprule
\textbf{Models} &  \textbf{SWDE} & \textbf{SQD} &    \textbf{{FDA}} & \textbf{TQA} & \textbf{NQ} & \textbf{Drop} & \textbf{Avg} \\
\midrule
\midrule
\textit{Recurrent models} \\
 RetNet  & 14.0 &28.5 & 7.0 & 54.4 & 16.2 & 17.3&22.9 \\
 HGRN2  & 8.3 & 25.3 &  4.8 & 51.2  & 14.2 &16.9& 20.1 \\
 Mamba   & 9.8 & 25.8 &3.7   & 54.3 &14.9 &17.4 & 21.0  \\
 Mamba2 &  \underline{19.1} &  \underline{33.6} & \textbf{25.3} & \textbf{61.0} &  \textbf{20.8} &  \underline{19.2} & \underline{29.8}\\
 DeltaNet  &17.9  &30.9 & 18.4 & 53.9 & 17.3& 18.6  & 26.2\\
 Gated DeltaNet & \textbf{25.4} & \textbf{34.8} & \underline{23.7} & \underline{60.0} & \underline{20.0} & \textbf{19.8}&\textbf{30.6}\\
\midrule
\textit{Attention or hybrid models} \\
 Transformer++ &  \text{29.5} & 38.0 & \textbf{52.2} & 58.3 & 22.5  & \text{21.6} & 37.0  \\
 Samba &  33.0 & 39.2 & 50.5 & 57.7 & 23.5 & 20.2 &37.3 \\
 Gated DeltaNet-H1 &  \underline{35.6} & \underline{39.7}& \underline{52.0} & \underline{60.1} & \underline{24.6} & \underline{22.2} & \underline{39.0}
\\
 Gated DeltaNet-H2 &   \textbf{38.2} & \textbf{40.4} & 50.7& \textbf{63.3}&\textbf{24.8} &\textbf{23.3} & \textbf{40.1}\\
\bottomrule
\end{tabular}
\addtolength{\tabcolsep}{2.5pt}
\caption{Accuracy on recall-world retrieval tasks with input truncated to 2K tokens. SQD: SQUADE. TQA: Trivial QA.
}
\label{tab:recall_results}

\end{wraptable}
```
#### Common-sense reasoning

In Table `\ref{tab:commonsense_results}`{=latex}, we present the language modeling perplexity and **zero-shot** accuracy on commonsense reasoning benchmarks for models with 400M and 1.3B parameters. Gated DeltaNet consistently outperforms other linear models, including RetNet, HGRN2, Mamba, Mamba2, and DeltaNet, at both scales. As expected, the hybrid variant further enhances performance.

#### In-context retrieval on real-world data

Table `\ref{tab:recall_results}`{=latex} presents results on real-world recall-intensive tasks used by @arora-2024-jrt. As expected, linear recurrent models show a significant performance gap compared to Transformers, while hybrid models combining linear recurrence and attention outperform pure attention models in retrieval tasks.

For pure recurrent models, despite DeltaNet's superior performance on synthetic in-context retrieval tasks [@yang2024parallelizing], its real-world retrieval performance lags behind Mamba2, consistent with our observations in S-NIAH-2 and S-NIAH-3 (Table `\ref{tab:niah-results}`{=latex}). Gated DeltaNet outperforms both DeltaNet and Mamba2 thanks to its gated delta rule, though the improvement margin is smaller than in Table `\ref{tab:niah-results}`{=latex}. We attribute this reduced performance gap to instruction-unaligned small language models being prone to repetition errors, which are the primary source of errors in these tasks (cf. @arora-2024-jrt [Appendix E]). Since this issue is largely independent of the update rule choice, the performance differences between models are less pronounced compared to Table `\ref{tab:niah-results}`{=latex}.

<figure id="fig:ppl_ex">

<figcaption>Length extrapolation on six long benchmarks. </figcaption>
</figure>

#### Length extrapolation on long sequences.

As shown in Fig.`\ref{fig:ppl_ex}`{=latex}, we evaluate the models' capacity to extrapolate to sequences of up to 20K tokens across six long-context benchmarks. Gated DeltaNet achieves the lowest overall perplexity across tasks among RNN models. While we observe mixed results in length extrapolation, Gated DeltaNet exhibits relatively more robust performance, suggesting better memory management. The hybrid models further improve upon this by leveraging attention for local context modeling, which reduces the memory management burden on their recurrent components. Future work will explore these models' capabilities on even longer sequences.

#### Long context understanding

As demonstrated in Table `\ref{tab:long_bench}`{=latex}, we evaluated the models' performance on LongBench [@bai2023longbench]. In recurrent models, Gated DeltaNet shows consistent advantages, especially in single-doc QA, few-shot in-context learning, and Code tasks, demonstrating its superior capabilities in retrieval, in-context learning, and state tracking, respectively.

`\setlength{\tabcolsep}{3.5pt}`{=latex} `\scriptsize  `{=latex}

::: {#tab:long_bench}
  ------------------------------ --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------
                                     Single-Doc QA         Multi-Doc QA          Summarization           Few-shot                Code                   Avg
  Model                           `\tiny `{=latex}NQA   `\tiny `{=latex}QQA   `\tiny `{=latex}MFQ   `\tiny `{=latex}HQA   `\tiny `{=latex}2WM   `\tiny `{=latex}Mus   `\tiny `{=latex}GvR   `\tiny `{=latex}QMS   `\tiny `{=latex}MNs   `\tiny `{=latex}TRC   `\tiny `{=latex}TQA   `\tiny `{=latex}SSM   `\tiny `{=latex}LCC   `\tiny `{=latex}RBP
  *Recurrent models*
  RetNet                                 12.1                  10.7                  19.1                  10.7                **18.0**                 5.8                   4.8                  15.8                   7.9                  19.0                  18.0           [12.8]{.underline}           14.1                  17.9                  13.2
  HGRN2                                  10.7           [12.1]{.underline}           19.1                  11.3                  15.7            [6.0]{.underline}            5.2                  15.1                 **9.2**                16.0                  15.8                  10.3           [18.6]{.underline}    [20.8]{.underline}           13.5
  Mamba                           [13.0]{.underline}           10.1                  20.4                  10.1           [16.7]{.underline}     [6.0]{.underline}     [7.2]{.underline}    [15.9]{.underline}     [8.4]{.underline}    [23.1]{.underline}           21.9                  11.2                  17.9                  19.0           [14.6]{.underline}
  DeltaNet                               12.9                  10.8           [21.5]{.underline}    [10.9]{.underline}           13.2                   5.1                   6.5                  13.5                   7.2                  15.5           [23.3]{.underline}           11.6                  17.6                  20.3                  13.6
  Mamba2                                 11.1                  11.3                  18.6                  11.8                  15.1                 **6.7**                 6.7                  14.5                   7.4                  13.0                **23.6**                 8.4                  17.9                  20.6                  13.5
  **Gated DeltaNet**                   **14.1**              **14.0**              **23.3**              **13.7**                14.4                   5.8                 **7.5**              **16.4**                 7.9                **30.0**                22.4                **23.0**              **18.7**              **22.1**              **16.6**
  *Attention or hyrbid models*
  Transformer++                          11.8                   9.3                  10.0                  10.9                   4.2                   6.1                   7.4                  15.8                   6.6                  16.9                  13.5                   3.9                  17.2                  18.7                  11.0
  Samba                                  12.5           [12.9]{.underline}           25.4                  11.2                  19.7            [6.8]{.underline}     [9.1]{.underline}           15.7                  11.0                  20.0           [22.7]{.underline}           22.8           [18.1]{.underline}    [21.1]{.underline}    [15.9]{.underline}
  **Gated DeltaNet-H1**                **14.5**                12.3           [26.6]{.underline}    [12.6]{.underline}         **23.6**                 6.1            [9.1]{.underline}    [16.1]{.underline}    [12.8]{.underline}    [33.5]{.underline}         **23.9**         [26.8]{.underline}           15.5                  19.2                  17.8
  **Gated DeltaNet-H2**           [12.7]{.underline}         **13.0**              **27.1**              **12.7**         [20.6]{.underline}          **7.5**              **10.4**              **16.2**              **13.0**              **40.5**         [22.7]{.underline}         **27.9**              **19.9**              **22.1**              **18.4**
  ------------------------------ --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------- --------------------

  : Accuracy on 14 tasks from LongBench [@bai2023longbench]: Narrative QA, QasperQA, MultiField QA, HotpotQA, 2WikiMulti QA, Musique, GovReport, QMSum, MultiNews, TRec, Trivia QA, SamSum, LCC, and RepoBench-P by order.
:::

<figure id="fig:throughput">
<p>  </p>
<figcaption>Training throughput comparison of 1.3B models on a single H100 GPU.</figcaption>
</figure>

#### Throughput Comparison.

The training throughput comparison across different models is presented in Fig. `\ref{fig:throughput}`{=latex}. As our analysis shows, the proposed gated delta rule introduces only marginal overhead compared to the original delta rule, with Gated DeltaNet achieving essentially the same throughput as DeltaNet. Both are slightly slower than Mamba2 (2-3K tokens/sec) due to their more expressive transition matrices.

The Transformer++ achieves the best performance in the 2K context window domain, thanks to the highly optimized Flash-Attention-2 kernel [@flashattention2]. Consequently, hybrid approaches combining 2K window-size SWA attention with other token mixers demonstrate higher throughput than standalone mixers: Samba outperforms Mamba, while Gated DeltaNet-H1 and -H2 outperform Gated DeltaNet. Notably, Gated DeltaNet-H1 maintains compelling training throughput across all sequence lengths, even on short sequences.

# Related Work

#### Gated linear RNN.

Large linear recurrent language models have attracted significant attention due to their training and inference efficiency. The field of linear RNNs has rapidly evolved from using data-independent decay mechanisms, as exemplified by models like S4 [@s4], S5 [@s5], LRU [@Orvieto2023ResurrectingRN], RWKV4/5 [@peng_rwkv_2023], and RetNet [@sun2023retentive], to incorporating data-dependent decay mechanisms in more recent architectures such as HGRN1/2 [@qin2024hgrn2; @HGRN], Mamba1/2 [@gu_mamba_2023; @mamba2], RWKV6 [@peng_eagle_2024], GSA [@Zhang2024GatedSA]. This transition stems from the proven advantages of gating/forgetting mechanisms (termed selective mechanisms in Mamba)---a classical concept originating in the gated RNN literature [@DBLP:journals/neco/GersSC00] whose significance has been consistently reaffirmed [@Greff2015LSTMAS; @unreasonable-forget-gate; @qin2024hgrn2; @HGRN; @gu_mamba_2023].

Modern forget gates differ from traditional designs like those in LSTM by removing the dependency on the previous hidden state, relying solely on input data. This modification enables efficient parallelism across sequence lengths [@parallel-martin; @HGRN]. The absence of a forget gate has been a notable limitation in DeltaNet, and our gated extension addresses this gap in a natural, effective, and hardware-efficient way. We also note a recent concurrent work RWKV-7 [^6] using a similar idea, but with a more relaxable formalism using diagonal-plus-low-rank transitions: $\rmS_t = \rmS_{t-1} (\operatorname{diag}(\mathbf{d}_t) - \mathbf{a}_t \mathbf{b}_t^\top) + \vv_t \vk_t^\top$ where $\mathbf{d}_t, \mathbf{a}_t, \mathbf{b}_t \in \mathbb{R}^{d_k}$. The chunkwise algorithm could be similarly adapted to this case, as implemented in Flash Linear Attention [@yang_fla_2024]. [^7]

#### Delta rule.

The delta learning rule demonstrates superior memory capacity compared to Hebbian learning [@Gardner1988TheSO; @Prados1989NeuralNC], an advantage DeltaNet leverages while linear transformers rely on Hebbian-like rules. This memory capacity advantage is evident in synthetic in-context learning tasks and extends to language modeling [@irie2021going; @yang2024parallelizing], reinforcement learning [@DBLP:conf/icml/IrieSCS22], and image generation [@DBLP:conf/iclr/IrieS23]. @yang2024parallelizing parallelized delta rule computation and demonstrated how DeltaNet's data-dependent identity-plus-low-rank structure ($\rmI - \beta_t \vk_t \vk_t^\intercal$) offers greater flexibility than Mamba2's data-dependent diagonal matrices ($\alpha_t \rmI$). This structural advantage could enable complex reasoning, including regular language recognition [@fan-etal-2024-advancing; @Grazzi2024UnlockingSI] and state-tracking beyond TC$^0$ complexity [@merrill_illusion_2024]---crucial for coding and reasoning applications.

Despite these significant advantages, the delta rule faces theoretical limitations [@irie-etal-2023-practical] and shows only moderate performance on real-world datasets [@yang2024parallelizing], suggesting room for improvement. Previous attempts to enhance expressiveness through nonlinear recurrence [@irie2021going; @DBLP:conf/icml/IrieSCS22] addressed some limitations but sacrificed training parallelism, creating a performance-efficiency tradeoff. Recent work proposes some enhancements without compromising parallelism for better state tracking performance, including using negative eigenvalues [@Grazzi2024UnlockingSI] and multiple products of householder transition matrices [@siems2025deltaproductincreasingexpressivitydeltanet] which enable high-rank transformations. These methods could be applied to Gated DeltaNet seamlessly.

From a (online) learning objective perspective, alternative formulations could further extend expressiveness: nonlinear regression ($\mathcal{L}(\rmS_t) = \frac{1}{2}||f_{\rmS_t}(\vk_t) - \vv_t||^2$) as in TTT [@ttt] and Titans [@behrouz2024titanslearningmemorizetest], where $f_\rmS$ is a nonlinear function parameterized by $\rmS$; or regression considering the entire history ($\mathcal{L}(\rmS_t) =\frac{1}{2} \sum_{i=1}^t||\rmS_t\vk_i - \vv_i||^2$) as in Mesa layer [@vonoswald2024uncoveringmesaoptimizationalgorithmstransformers]---analogous to the difference between Least Mean Square and Recursive Least Square algorithms. However, these more expressive variants introduce nonlinear recurrence and require workarounds, such as performing nonlinear updates only after processing entire chunks (as in TTT and Titans); or approximating nonlinear recurrence methods like @lim2024parallelizingnonlinearsequentialmodels [@gonzalez2024towards; @schöne2025implicitlanguagemodelsrnns].

#### Hybrid models.

In this work, we explore interleaving hybrid attention layers across layers, which is commonly used such as in MiniMax-01 [@minimax2025minimax01scalingfoundationmodels] and Hybrid Mamba2-Attention [@waleffe2024empiricalstudymambabasedlanguage]. It is also interesting to investigate hybrid linear/softmax attention within a single layer [@GAU; @zancato2024bmojo; @munkhdalai2024leave; @nunez2024expansionspancombiningfading; @dong2025hymba; @zhang2025lolcats].

# Conclusion

In this work, we introduced Gated DeltaNet, which enables better key-value association learning compared to Mamba2 and more adaptive memory clearance than DeltaNet, leading to consistently better empirical results across various tasks. We extended the parallel algorithm from  [@yang2024parallelizing] to enable hardware-efficient training of Gated DeltaNet. Our hybrid Gated DeltaNet model achieves even higher training throughput and overall performance, making it well-suited for practical deployment.

# Acknowledgment {#acknowledgment .unnumbered}

We thank Yu Zhang for assistance with figure creation and model evaluation; Kazuki Irie for providing valuable feedback on the draft; Simeng Sun and Zhixuan Lin for insightful discussions on long-sequence task evaluation settings; and Eric Alcaide and Volodymyr Kyrylov for their helpful discussions on the online learning perspective of DeltaNet.

```{=latex}
\bibliographystyle{iclr2025_conference}
```
```{=latex}
\newpage
```
```{=latex}
\appendix
```
```{=latex}
\renewcommand{\thesection}{\Alph{section}}
```
```{=latex}
\renewcommand\thefigure{S.\arabic{figure}}
```
```{=latex}
\setcounter{figure}{0}
```
```{=latex}
\renewcommand\thetable{S.\arabic{table}}
```
```{=latex}
\setcounter{table}{0}
```
# Extended WY representation for gated delta rule {#sec:extended_wy_proof}

To reduce notation clutter, we only consider the first chunk here.

For $\rmS_t$, the extended WY representation is $$\begin{aligned}
    \rmS_t = \sum_{i=1}^t {\color{blue}\frac{\gamma_{t}}{\gamma_i}} \vu_i \vk_i^\intercal, \qquad \vu_t = \beta_t \left( \vv_t - \sum_{i=1}^{t-1} {\color{blue} \frac{\gamma_{t}}{\gamma_i}} \vu_i \vk_i^T \vk_t \right)
\end{aligned}$$ We proof this by mathmetical induction.

::: proof
*Proof.* $$\begin{aligned}

\rmS_{t+1} &=\rmS_{t} \left({\color{blue}\alpha_{t+1}} (\rmI - \beta_{t+1} \vk_{t+1}\vk_{t+1}^\intercal) \right) + \beta_{t+1} \vv_{t+1}  \vk_{t+1}^\intercal \\ &= {\color{blue}\alpha_{t+1}} (\sum_{i=1}^t {\color{blue}\frac{\gamma_t}{\gamma_i}} \vu_i \vk_i^\intercal) - {\color{blue}\alpha_{t+1}} \beta_{t+1} (\sum_{i=1}^t {\color{blue}\frac{\gamma_t}{\gamma_i}} \vu_i \vk_i^\intercal \vk_i  \vk_{t+1}^\intercal) + \beta_{t+1} \vv_{t+1} \vk_{t+1}^\intercal \\
&= \sum_{i=1}^t {\color{blue}\frac{\gamma_{t+1}}{\gamma_{i}}} \vu_i \vk_i^\intercal + \underbrace{\beta_{t+1} \left( \vv_{t+1} - \sum_{i=1}^t {\color{blue}\frac{\gamma_{t+1}}{\gamma_{i}}}\vu_i \vk_i^T \vk_{t+1} \right)}_{  \vu_{t+1}} \vk_{t+1}^\intercal \\
&= \sum_{i=1}^t {\color{blue}\frac{\gamma_{t+1}}{\gamma_{i}}} \vu_i \vk_i^\intercal + \underbrace{{\color{blue}\frac{\gamma_{t+1}}{\gamma_{t+1}}}}_{1} \vu_{t+1} \vk_{t+1}^\intercal
\\
&=
\sum_{i=1}^{t+1} {\color{blue}\frac{\gamma_{t+1}}{\gamma_i}} \vu_i \vk_i^\intercal
\end{aligned}$$ ◻
:::

# Experiment Contunued

## Evaluation {#sec:evaluation}

#### Commonsense reasoning

Following @gu_mamba_2023, we evaluate our model on multiple commonsense reasoning benchmarks: PIQA [@bisk2020piqa], HellaSwag [Hella.; @zellers2019hellaswag], WinoGrande [Wino.; @sakaguchi2021winogrande], ARC-easy (ARC-e) and ARC-challenge (ARC-c) [@arc-ce], SIQA [@sap2019social], BoolQ [@clark2019boolq], Wikitext [Wiki.; @merity2016pointer], and LAMBADA [LMB.; @paperno_lambada_2016].

#### In-context retrieval

Our evaluation comprises both synthetic and real-world tasks. For synthetic tasks, we utilize the Needle-In-A-Haystack Single (NIAH-S) benchmark suite from RULER [@hsieh2024ruler], which includes three increasingly complex tasks: S-NIAH-1 (passkey retrieval), S-NIAH-2 (numerical needle in haystack), and S-NIAH-3 (word-based needle in haystack). For real-world tasks, following @arora-2024-jrt, we evaluate on diverse datasets: SWDE [@lockard_openceres_2019] for structured HTML relation extraction, FDA [@arora_language_2023] for PDF key-value retrieval, and several question-answering datasets including SQuAD [@rajpurkar_know_2018], TriviaQA [@JoshiTriviaQA2017], Drop [@dua2019drop], and NQ [@47761]. Since our pretrained models lack instruction tuning, we employ the Cloze Completion Formatting prompts provided by @arora-2024-jrt, which better align with our models' next-word-prediction training objective.

#### Long context understanding

We evaluate on 14 tasks from Longbench [@bai2023longbench], encompassing: narrative comprehension (Narrative QA [@kocisky-etal-2018-narrativeqa]), scientific understanding (QasperQA [@dasigi2021qasper]), multi-hop reasoning (MultiField QA, HotpotQA [@yang2018hotpotqa], 2WikiMulti QA [@ho2020constructing], Musique [@trivedi2022musique]), document summarization (GovReport [@huang2021govreport], QMSum [@zhong2021qmsum], MultiNews [@fabbri2019multinews]), and various specialized tasks (TRec [@li2002learning], Trivia QA [@joshi2017triviaqa], SamSum [@gliwa2019samsum], LCC [@guo2023longcoder], and RepoBench-P [@liu2023repobench]).

## Ablation Study {#sec:ablation_study}

```{=latex}
\begin{table*}[t!]
    \caption{\small
        Ablation study on the Gated DeltaNet block. Avg-PPL and Avg-Acc denote average perplexity and zero-shot commonsense reasoning accuracy (as in Table~\ref{tab:commonsense_results}), respectively. All models have 400M parameters and are trained for 15B tokens on the same subset of FineWeb-Edu dataset~\citep{penedo2024fineweb}.
    }

    \small
    \renewcommand{\arraystretch}{1.1}
    % \addtolength{\tabcolsep}{-2pt}
    \resizebox{.53\textwidth}{!}{
    \begin{tabular}{lcc}
        \toprule
                {\textit{Gated DeltaNet Ablations (400M)}}                       & Avg-PPL  (${\downarrow}$) & Avg-Acc  (${\uparrow}$) \\
        \midrule
        Gated DeltaNet \textit{w} Head Dim 128,         & 27.35 & 47.26        \\
        \midrule
        \emph{Macro Design}                    \\
        \quad \textit{w.} naive Delta Rule  & 30.87 & 45.12                 \\
        \quad \textit{w/o.} Short Conv  & 28.95   & 46.16               \\
        \quad \textit{w/o.} Output Gate  & 29.12    & 45.46              \\
        \quad \textit{w/o.} Output Norm  & 27.55  & 47.07                \\
        \midrule
        \emph{Normalization \& Feature Map}                    \\
        \quad \textit{w.} $L_1$-norm \& ReLU  & 30.79    & 45.92              \\
        \quad \textit{w.} $L_1$-norm \& 1+ELU  & 30.34  & 46.05                \\
        \quad \textit{w.} $L_1$-norm \& SiLU  & 30.18  & 46.09                \\
        \quad \textit{w.} $L_2$-norm \& ReLU  & 27.67   & 46.94               \\
        \quad \textit{w.} $L_2$-norm \& 1+ELU  & 27.58  & 47.17                \\ \midrule
        \emph{Model Dimensions}                    \\
        \quad \textit{w.} Head Dim 64   & 28.31      & 46.35            \\
        \quad \textit{w.} Head Dim 256   & 27.13     & 47.38             \\
        \bottomrule
    \end{tabular}
    }
    \label{tab:ablations1}
\end{table*}
```
```{=latex}
\begin{table*}[!h]

\scriptsize
\addtolength{\tabcolsep}{-2.5pt}
\begin{tabular}{l|cc|ccccccccc}
\toprule
% \text{} & & \midrule \\
\textbf{Model}  & \textbf{Wiki.}  &  \textbf{LMB.} &  \textbf{LMB.} & \textbf{PIQA} &    \textbf{Hella.} & \textbf{Wino.} & \textbf{ARC-e} &  \textbf{ARC-c} &  \textbf{SIQA}  & \textbf{BoolQ} &  \textbf{Avg.} \\
 & ppl $\downarrow$  &  ppl $\downarrow$  &  acc $\uparrow$  & acc $\uparrow$ &   acc\_n $\uparrow$  & acc $\uparrow$  & acc $\uparrow$ & acc\_n $\uparrow$ &  acc $\uparrow$  & acc $\uparrow$ &     \\

\midrule
{\textit{Hybrid Ablations (500M/15B)}} \\
\midrule
%Gated DeltaNet + SWA + Mamba2
 Gated DeltaNet + SWA + Mamba2 & 24.02 & 28.20 & 34.77 & 67.08 & 40.84 &    50.74 & 60.35   & 28.83 &  38.94  & 61.49  & 47.88 \\
%Gated Gated DeltaNet + Mamba2 + SWA
 Gated DeltaNet + Mamba2 + SWA & 23.69 & 26.83 & 36.17 & 67.51 & 41.51 &    51.85 & 61.19   & 29.77 &  38.58  & 53.73  & 47.54 \\
%Mamba2 + SWA + Gated DeltaNet
 Mamba2 + SWA + Gated DeltaNet & 24.14 & 25.21 & 36.79 & 64.96 & 41.18 &    52.01 & 60.90   & 30.03 &  38.07  & 59.44  & 47.92 \\
%Mamba2 + Gated DeltaNet + SWA
  Mamba2 + Gated DeltaNet + SWA & \textbf{23.54} & \textbf{24.11} & 36.92 & 66.48 & 41.70 & 52.72 & 61.06   & 30.54 &  39.91  & 60.51  & \textbf{48.73} \\
\bottomrule
\end{tabular}
\addtolength{\tabcolsep}{2.5pt}

%
\caption{Ablation studies of Gated DeltaNet models. All evaluations are performed by using \texttt{lm-evaluation-harness} \citep{eval-harness}. All models use the Llama tokenizer and are trained on the same subset of the FineWeb-Edu dataset~\citep{penedo2024fineweb}.
}
%
\label{tab:hybrid_design_ablations}
\end{table*}
```
Table `\ref{tab:ablations1}`{=latex} presents ablation studies on the Gated DeltaNet block's components. Our experiments demonstrate that both the short convolution and output gate are crucial for model performance, while output normalization yields marginal improvements. Consistent with @yang2024parallelizing, we found L2 normalization to be essential for optimal performance, though the choice of feature map was less influential. Nevertheless, SiLU consistently outperformed other activation functions, aligning with observations from @Qin2023TransNormerLLMAF. Through empirical analysis, we determined that a head dimension of 128 provides an optimal trade-off between performance and computational efficiency. Additionally, Table `\ref{tab:hybrid_design_ablations}`{=latex} demonstrates that among various hybrid architectures, the combination of Mamba2, Gated DeltaNet, and SWA in this specific order produces superior results.

[^1]: Equation contribution. Work done during SY's internship at NVIDIA.

[^2]: Here we slightly abuse the notation of $\gamma$ to denote the cumulative product for each chunk (starting with the first position of each chunk separately) instead of the entire sequence.

[^3]: It is possible to set $\beta_t \in (0, 2)$ to allow negative eigenvalue to unlock the state tracking abilities of DeltaNet [@Grazzi2024UnlockingSI; @siems2025deltaproductincreasingexpressivitydeltanet].

[^4]: The theoretical distinction lies in the optimization approach: Longhorn uses implicit online learning [@Kulis2010ImplicitOL] to derive closed-form globally optimal updates, while DeltaNet optimizes the same objective through one-step explicit gradient descent, as noted by @longhorn.

[^5]: We use Mamba2's parameterization for $\alpha$ but omit it for brevity.

[^6]: <https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v7>

[^7]: <https://github.com/fla-org/flash-linear-attention/tree/main/fla/ops/generalized_delta_rule>.