Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Source
- Raw Markdown: paper_gated-deltanet-2-2026.md
- PDF: paper_gated-deltanet-2-2026.pdf
- Preprint: arXiv 2605.22791v1
- Official NVIDIA page: Gated DeltaNet-2 publication page
- Official code: NVlabs/GatedDeltaNet-2
- Official X announcement: Ali Hatamizadeh post (compact note stored at
papers/gated-deltanet-2-2026/x_ahatamiz1_announcement_note.md) - Gonzo ML discussion: Telegram post 5428 (local extract stored at
papers/gated-deltanet-2-2026/telegram-post-gonzo-ml-5428.md) - Gonzo-linked review: ArXivIQ review
- Podcast pointer: Gonzo ML Podcasts 3748
- Local official-artifact metadata:
papers/gated-deltanet-2-2026/official_artifacts_metadata.json
Status And Credibility
arXiv lists Gated DeltaNet-2 as a cs.AI technical report first submitted on 2026-05-21. The official NVIDIA Research publication page gives the same publication date and authors: Ali Hatamizadeh, Yejin Choi, and Jan Kautz.
Credibility is strong enough for an important ingest because the paper has an official NVIDIA Research page, an official NVlabs code repository, an author announcement, concrete matched-scale experiments, and implementation-level kernel details. It is still preprint evidence, not a peer-reviewed venue result at ingest time. The released code is under NVIDIA Source Code License-NC, so downstream reuse is research/evaluation oriented rather than permissive commercial reuse.
The Gonzo post and ArXivIQ review list Model: N/A; official artifact search at ingest time found code and paper artifacts but no official released model checkpoints.
Core Claim
Gated DeltaNet-2 argues that fixed-size recurrent linear-attention state is bottlenecked not only by what to forget, but by how to edit a compressed associative memory without scrambling existing key-value associations. Prior delta-rule variants such as Gated DeltaNet and KDA use one scalar gate for both erasing old content and writing new value content. Gated DeltaNet-2 separates those roles into channel-wise key-side erase and value-side write gates while keeping linear-time recurrent inference and efficient chunkwise training.
Mechanism
The Gated Delta Rule-2 update is:
Here is KDA-style channel-wise decay over the key axis, is a channel-wise erase gate that selects which key-side coordinates of the decayed state are read and removed, and is a channel-wise write gate that selects which value-side coordinates are committed.
flowchart LR History[Compressed recurrent state] --> Decay[Channel-wise decay] Decay --> Erase[Key-side erase gate b_t] Token[Current token state] --> Key[Key k_t] Token --> Value[Value v_t] Token --> Write[Value-side write gate w_t] Key --> Erase Value --> Write Erase --> Update[Gated Delta Rule-2 update] Write --> Update Update --> State[Updated fixed-size state]
The update strictly generalizes KDA when and both collapse to the same scalar gate, and recovers Gated DeltaNet when the decay also collapses to a scalar. The paper derives a fast-weight view, a chunkwise WY training form that absorbs channel-wise decay into asymmetric erase factors, and a gate-aware backward pass implemented with Triton kernels.
Evidence
The main experiments train matched 1.3B-parameter language models on 100B FineWeb-Edu tokens with a 4K training length. Baselines include Mamba-2, Gated DeltaNet, KDA, and Mamba-3 SISO/MIMO, evaluated as recurrent-only models and as hybrids with sliding-window attention.
Key reported numbers:
| Setting | Metric | Best baseline called out in paper | Gated DeltaNet-2 |
|---|---|---|---|
| Recurrent language/reasoning | average accuracy | Mamba-3 MIMO: 52.39 | 53.11 |
| Hybrid language/reasoning | average accuracy | Mamba-3 MIMO: 52.72 | 53.97 |
| Recurrent WikiText perplexity | lower is better | Mamba-3 SISO: 16.30 | 15.90 |
| Hybrid LAMBADA perplexity | lower is better | Mamba-3 SISO: 10.65 | 10.43 |
| Recurrent real-world retrieval | average accuracy | KDA: 28.67 | 29.88 |
| Hybrid real-world retrieval | average accuracy | Mamba-3 SISO: 41.01 | 42.28 |
| Recurrent MK-NIAH-1 @4K | accuracy | KDA: 28.0 | 37.8 |
| Hybrid MK-NIAH-1 @4K | accuracy | Mamba-3 MIMO: 46.6 | 48.0 |
Ablations support the mechanism: scalarizing either the erase or write gate reduces performance, and retaining channel structure in the erase gate recovers more of the full model than retaining it only in the write gate. Expanding the erase range from to gives no consistent gain at this scale.
Training-throughput evidence is more mixed but still useful. On a single H100 in the hybrid 1.3B setting, Gated DeltaNet-2 keeps near-flat scaling as sequence length times batch changes from 2K x 8 to 16K x 1, but it is slower than KDA because it adds channel-wise erase and write gates. Read the throughput claim as “modest constant overhead with stable long-context scaling,” not as an unconditional speed win over every recurrent baseline.
Relevance To This Wiki
Gated DeltaNet-2 is upstream language-model architecture evidence, not a time-series model. Its transferable contribution is the memory-editing interface: a compact recurrent state may need separate operations for erasing stale associations and committing new information.
For time-series and world-model work, the closest analogue is streaming latent-state update. A model that maintains state over never-ending multivariate observations, event streams, context, and action history may need to forget obsolete relationships without overwriting rare regimes or intervention-relevant details. Gated DeltaNet-2 suggests that one scalar retention/update gate can be too coarse when read/erase and write/commit live on different state axes.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming latent state / long context | adjacent | Gated DeltaNet-2 maintains a fixed-size recurrent state and improves long-context retrieval under matched language-model scale. | Needs numeric time-series, event-stream, or trajectory tests with continuous updates, explicit state size, and eviction audits. |
| Native multivariate encoding | adjacent | Key-side and value-side channel gates are a plausible analogy for channel-selective retention in high-dimensional streams. | The paper’s channels are hidden-state coordinates, not observed numeric variables, sensors, or graph nodes. |
| Dynamic compute under a serving budget | adjacent | Linear-time recurrence and chunkwise training keep the mechanism in the efficient sequence-model family, and H100 throughput is reported. | Needs wall-clock TSFM serving benchmarks with dense observations, long histories, and hardware-aware baselines. |
| Control and counterfactuals | insufficient evidence | The mechanism could preserve action-relevant state if adapted. | No actions, control inputs, interventions, treatments, or counterfactual rollouts are evaluated. |
| Benchmark hygiene | warning | Matched 1.3B/100B-token comparisons and ablations are useful. | Language perplexity, commonsense QA, RULER, and recall tasks do not prove dense numeric fidelity, rare-regime preservation, or action-conditioned world-model utility. |
Limitations
- The paper is a 2026 arXiv technical report, not a peer-reviewed venue result at ingest time.
- Evidence is language-model centric: FineWeb-Edu pretraining, language-model perplexity, commonsense reasoning, RULER needle retrieval, and text retrieval tasks.
- It does not evaluate numeric time series, irregular observations, graph time series, telemetry, robotics trajectories, or action-conditioned world models.
- It reports no official released model checkpoints at ingest time; the official artifact is code, not pretrained weights.
- The official code license is NVIDIA Source Code License-NC, so reuse and redistribution need license review.
- RULER needle-in-a-haystack and recall tasks are valuable interference probes, but they are not a substitute for generation-heavy, calibration-heavy, or decision-utility benchmarks.
- Throughput remains hardware/kernel dependent; the additional channel-wise gates add constant overhead compared with simpler recurrent mixers.
Links Into The Wiki
- Gated DeltaNet-2
- Efficient Recurrent Sequence Models
- Streaming Latent-State Updates
- Latent-State Time-Series Modeling
- Time-Series Scaling And Efficiency
- Foundation Time-Series Model Research Agenda
- Mamba-3
- Mamba-2
- LT2
- RWKV-TS
Open Questions
- Does decoupling erase and write help multivariate time-series models preserve rare channel interactions better than scalar retention gates under the same state size?
- Can a Gated DeltaNet-2-style update distinguish stale exogenous relationships from stable latent dynamics in always-on telemetry?
- What is the right numeric analogue of RULER multi-key retrieval: delayed cross-channel lookup, rare event recall, intervention-effect recall, topology-dependent state, or regime-specific memory?
- Does the erase-gate asymmetry remain dominant for numeric streams, or do value/write gates matter more when dense magnitude fidelity is central?
- Can the chunkwise WY/triton implementation path support irregular timestamps, missingness masks, and variable-length event chunks without losing the efficiency story?