Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Source

Raw Markdown: paper_gated-deltanet-2-2026.md
PDF: paper_gated-deltanet-2-2026.pdf
Preprint: arXiv 2605.22791v1
Official NVIDIA page: Gated DeltaNet-2 publication page
Official code: NVlabs/GatedDeltaNet-2
Official X announcement: Ali Hatamizadeh post (compact note stored at papers/gated-deltanet-2-2026/x_ahatamiz1_announcement_note.md)
Gonzo ML discussion: Telegram post 5428 (local extract stored at papers/gated-deltanet-2-2026/telegram-post-gonzo-ml-5428.md)
Gonzo-linked review: ArXivIQ review
Podcast pointer: Gonzo ML Podcasts 3748
Local official-artifact metadata: papers/gated-deltanet-2-2026/official_artifacts_metadata.json

Status And Credibility

arXiv lists Gated DeltaNet-2 as a cs.AI technical report first submitted on 2026-05-21. The official NVIDIA Research publication page gives the same publication date and authors: Ali Hatamizadeh, Yejin Choi, and Jan Kautz.

Credibility is strong enough for an important ingest because the paper has an official NVIDIA Research page, an official NVlabs code repository, an author announcement, concrete matched-scale experiments, and implementation-level kernel details. It is still preprint evidence, not a peer-reviewed venue result at ingest time. The released code is under NVIDIA Source Code License-NC, so downstream reuse is research/evaluation oriented rather than permissive commercial reuse.

The Gonzo post and ArXivIQ review list Model: N/A; official artifact search at ingest time found code and paper artifacts but no official released model checkpoints.

Core Claim

Gated DeltaNet-2 argues that fixed-size recurrent linear-attention state is bottlenecked not only by what to forget, but by how to edit a compressed associative memory without scrambling existing key-value associations. Prior delta-rule variants such as Gated DeltaNet and KDA use one scalar gate for both erasing old content and writing new value content. Gated DeltaNet-2 separates those roles into channel-wise key-side erase and value-side write gates while keeping linear-time recurrent inference and efficient chunkwise training.

Mechanism

The Gated Delta Rule-2 update is:

S_{t} = (I - k_{t} (b_{t} ⊙ k_{t})^{⊤}) D_{t} S_{t - 1} + k_{t} (w_{t} ⊙ v_{t})^{⊤} .

Here $D_{t}$ is KDA-style channel-wise decay over the key axis, $b_{t} \in [0, 1]^{d_{k}}$ is a channel-wise erase gate that selects which key-side coordinates of the decayed state are read and removed, and $w_{t} \in [0, 1]^{d_{v}}$ is a channel-wise write gate that selects which value-side coordinates are committed.

flowchart LR
  History[Compressed recurrent state] --> Decay[Channel-wise decay]
  Decay --> Erase[Key-side erase gate b_t]
  Token[Current token state] --> Key[Key k_t]
  Token --> Value[Value v_t]
  Token --> Write[Value-side write gate w_t]
  Key --> Erase
  Value --> Write
  Erase --> Update[Gated Delta Rule-2 update]
  Write --> Update
  Update --> State[Updated fixed-size state]

The update strictly generalizes KDA when $b_{t}$ and $w_{t}$ both collapse to the same scalar gate, and recovers Gated DeltaNet when the decay also collapses to a scalar. The paper derives a fast-weight view, a chunkwise WY training form that absorbs channel-wise decay into asymmetric erase factors, and a gate-aware backward pass implemented with Triton kernels.

Evidence

The main experiments train matched 1.3B-parameter language models on 100B FineWeb-Edu tokens with a 4K training length. Baselines include Mamba-2, Gated DeltaNet, KDA, and Mamba-3 SISO/MIMO, evaluated as recurrent-only models and as hybrids with sliding-window attention.

Key reported numbers:

Setting	Metric	Best baseline called out in paper	Gated DeltaNet-2
Recurrent language/reasoning	average accuracy	Mamba-3 MIMO: 52.39	53.11
Hybrid language/reasoning	average accuracy	Mamba-3 MIMO: 52.72	53.97
Recurrent WikiText perplexity	lower is better	Mamba-3 SISO: 16.30	15.90
Hybrid LAMBADA perplexity	lower is better	Mamba-3 SISO: 10.65	10.43
Recurrent real-world retrieval	average accuracy	KDA: 28.67	29.88
Hybrid real-world retrieval	average accuracy	Mamba-3 SISO: 41.01	42.28
Recurrent MK-NIAH-1 @4K	accuracy	KDA: 28.0	37.8
Hybrid MK-NIAH-1 @4K	accuracy	Mamba-3 MIMO: 46.6	48.0

Ablations support the mechanism: scalarizing either the erase or write gate reduces performance, and retaining channel structure in the erase gate recovers more of the full model than retaining it only in the write gate. Expanding the erase range from $[0, 1]^{d_{k}}$ to $[0, 2]^{d_{k}}$ gives no consistent gain at this scale.

Training-throughput evidence is more mixed but still useful. On a single H100 in the hybrid 1.3B setting, Gated DeltaNet-2 keeps near-flat scaling as sequence length times batch changes from 2K x 8 to 16K x 1, but it is slower than KDA because it adds channel-wise erase and write gates. Read the throughput claim as “modest constant overhead with stable long-context scaling,” not as an unconditional speed win over every recurrent baseline.

Relevance To This Wiki

Gated DeltaNet-2 is upstream language-model architecture evidence, not a time-series model. Its transferable contribution is the memory-editing interface: a compact recurrent state may need separate operations for erasing stale associations and committing new information.

For time-series and world-model work, the closest analogue is streaming latent-state update. A model that maintains state over never-ending multivariate observations, event streams, context, and action history may need to forget obsolete relationships without overwriting rare regimes or intervention-relevant details. Gated DeltaNet-2 suggests that one scalar retention/update gate can be too coarse when read/erase and write/commit live on different state axes.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming latent state / long context	adjacent	Gated DeltaNet-2 maintains a fixed-size recurrent state and improves long-context retrieval under matched language-model scale.	Needs numeric time-series, event-stream, or trajectory tests with continuous updates, explicit state size, and eviction audits.
Native multivariate encoding	adjacent	Key-side and value-side channel gates are a plausible analogy for channel-selective retention in high-dimensional streams.	The paper’s channels are hidden-state coordinates, not observed numeric variables, sensors, or graph nodes.
Dynamic compute under a serving budget	adjacent	Linear-time recurrence and chunkwise training keep the mechanism in the efficient sequence-model family, and H100 throughput is reported.	Needs wall-clock TSFM serving benchmarks with dense observations, long histories, and hardware-aware baselines.
Control and counterfactuals	insufficient evidence	The mechanism could preserve action-relevant state if adapted.	No actions, control inputs, interventions, treatments, or counterfactual rollouts are evaluated.
Benchmark hygiene	warning	Matched 1.3B/100B-token comparisons and ablations are useful.	Language perplexity, commonsense QA, RULER, and recall tasks do not prove dense numeric fidelity, rare-regime preservation, or action-conditioned world-model utility.

Limitations

The paper is a 2026 arXiv technical report, not a peer-reviewed venue result at ingest time.
Evidence is language-model centric: FineWeb-Edu pretraining, language-model perplexity, commonsense reasoning, RULER needle retrieval, and text retrieval tasks.
It does not evaluate numeric time series, irregular observations, graph time series, telemetry, robotics trajectories, or action-conditioned world models.
It reports no official released model checkpoints at ingest time; the official artifact is code, not pretrained weights.
The official code license is NVIDIA Source Code License-NC, so reuse and redistribution need license review.
RULER needle-in-a-haystack and recall tasks are valuable interference probes, but they are not a substitute for generation-heavy, calibration-heavy, or decision-utility benchmarks.
Throughput remains hardware/kernel dependent; the additional channel-wise gates add constant overhead compared with simpler recurrent mixers.

Links Into The Wiki

Open Questions

Does decoupling erase and write help multivariate time-series models preserve rare channel interactions better than scalar retention gates under the same state size?
Can a Gated DeltaNet-2-style update distinguish stale exogenous relationships from stable latent dynamics in always-on telemetry?
What is the right numeric analogue of RULER multi-key retrieval: delayed cross-channel lookup, rare event recall, intervention-effect recall, topology-dependent state, or regime-specific memory?
Does the erase-gate asymmetry remain dominant for numeric streams, or do value/write gates matter more when dense magnitude fidelity is central?
Can the chunkwise WY/triton implementation path support irregular timestamps, missingness masks, and variable-length event chunks without losing the efficiency story?

Alex Open Research Wiki

Explorer

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Source

Status And Credibility

Core Claim

Mechanism

Evidence

Relevance To This Wiki

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks