Hybrid Associative Memories

Source

Raw Markdown: paper_hybrid-associative-memories-2026.md
PDF: paper_hybrid-associative-memories-2026.pdf
Preprint: arXiv 2603.22325
Official Zyphra page: Hybrid Associative Memories
Related Zyphra cookbook repository: Zyphra/zcookbook
Local source archive: papers/hybrid-associative-memories-2026/arXiv-2603.22325.tar.gz
X provenance snapshot from the user-provided Oryx discussion: papers/hybrid-associative-memories-2026/x_discussion_ham_in_oryx_thread_2067675034878193787.md and JSON sibling.

Status And Credibility

arXiv lists Hybrid Associative Memories as a cs.LG / cs.AI preprint first submitted on 2026-03-20 and revised to v2 on 2026-03-27. The authors are Leon Lufkin, Tomás Figliolia, Beren Millidge, and Kamesh Krishnamurthy. Zyphra hosts an official research page for the paper.

The source is credible current architecture evidence because it is a 2026 technical report from Zyphra’s hybrid-model research line, has an official research page, includes implementation-level discussion of routing and FlexAttention/FLA machinery, and directly studies the long-context KV-cache versus recurrent-state tradeoff. It is not yet peer-reviewed evidence at ingest time, and no dedicated official HAM implementation repository was found during this pass. The Zyphra cookbook is related background for hybrid-model training, not a direct HAM code release.

The local X snapshot is provenance only: Kevin Li linked HAM in the user-provided Oryx discussion as relevant to token-level routing.

Core Claim

HAM argues that RNN-style recurrence and attention should not merely run in parallel or appear in fixed layer patterns. They should play complementary memory roles inside one layer:

the recurrent state summarizes predictable, compressible context;
the KV cache stores only tokens that are difficult for the recurrent state to predict;
a user-controlled or learned threshold sets the KV-cache growth rate.

This turns KV-cache size into a data-dependent memory budget instead of a fixed linear function of sequence length.

Mechanism

The paper frames sequence mixers as associative memories. A DeltaNet-style recurrent state predicts the current value from the current key:

S_{t} = S_{t - 1} (I - β_{t} k_{t} k_{t}^{⊤}) + β_{t} v_{t} k_{t}^{⊤}, o_{t}^{RNN} = S_{t} q_{t} .

HAM computes a prediction-error or learned routing score. A token is routed to the KV scratchpad when the recurrent state cannot predict it well enough:

e_{t} = h min D (S_{t - 1}^{(h)} k_{t}^{(h)}, v_{t}^{(h)}), m_{t} = I [e_{t} \geq τ] .

The scratchpad attention reads only selected key/value pairs:

o_{t}^{KV} = Attn (q_{t}, {k_{i}}_{m_{i} = 1}, {v_{i}}_{m_{i} = 1}) .

flowchart LR
  Token[current token] --> RNN[recurrent state update]
  RNN --> Predict[predict value from state]
  Token --> Error[prediction error or learned router]
  Predict --> Error
  Error -->|above threshold| KV[write token to KV scratchpad]
  RNN --> Mix[combine recurrent readout]
  KV --> Mix
  Mix --> Output[layer output]

Evidence

The main reported model comparison trains 800M-scale models on 50B tokens with a 16,384-token context. Baselines include a Transformer, pure Gated DeltaNet, and a GDN-GSA stacked hybrid.

Reported evidence that matters for this wiki:

HAM exposes smooth control over KV-cache usage through the threshold and $T_{K V} / T$ sweeps.
At 50% KV-cache usage, learned-router HAM reports the best standard benchmark average in the paper’s table: 49.1 versus 48.8 for the Transformer and 48.4 for GDN-GSA.
Different HAM layers converge to different KV fractions under a global target, suggesting that explicit-memory demand varies by depth.
Long-context results are heterogeneous rather than uniformly best; the method is strongest where selective exact memory helps and weaker where the router or cache budget misses needed details.
The paper includes analysis of routing behavior, including NIAH-style examples where routing scores spike near the hidden needle.

Read this as evidence for selective KV-cache growth and memory routing in language-model settings, not as proof for numeric time-series preservation.

Relevance To This Wiki

HAM is directly relevant to long-context and fixed-budget memory because it makes what to keep exactly a model decision. For time-series systems, the analogous question is whether a model can compress repeated normal behavior into state while preserving rare regimes, exogenous events, topology changes, and action/intervention windows in a higher-resolution memory.

The key transfer pattern is:

predictable spans -> compressed recurrent state
surprising/control-relevant spans -> explicit memory

This should be tested under state-utility probes rather than assumed from language retrieval results.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state / long context	adjacent	HAM combines a fixed-size recurrent state with a selectively growing KV scratchpad.	Needs numeric streams, event timing, high-channel observations, and continuous online update tests.
Dynamic compute and fixed-budget hierarchy	adjacent	A threshold controls the explicit-memory budget and produces a performance/cache tradeoff.	Needs a global expected-FLOPs or latency objective, hard serving measurements, and time-series budget frontiers.
Benchmark hygiene	warning	Long-context gains are task-dependent; average scores hide which details are routed to the cache.	Needs selection-recall probes for rare regimes, cross-channel deviations, actions, and exogenous variables.
Native multivariate encoding	insufficient evidence	Layer-varying cache usage is suggestive for adaptive resolution.	No observed numeric variables, sensors, graph topology, or channel-specific preservation evidence.
Control and counterfactuals	insufficient evidence	A scratchpad could preserve action windows if adapted.	No actions, control inputs, interventions, rewards, or counterfactual rollouts.

Limitations

Fresh 2026 arXiv preprint; no peer-reviewed venue result found at ingest time.
No dedicated official HAM code/model repository was found during this pass.
Evidence is language-model and retrieval evidence, not numeric time-series, observability, robotics, or action-conditioned world-model evidence.
The KV-cache threshold controls cache fraction, but serving cost also depends on sparse attention implementation, dynamic cache layout, batching, and kernel support.
Routing by prediction error can miss details that are predictable but still decision-relevant, such as planned actions, scheduled maintenance, or rare but regular events.

Links Into The Wiki

Open Questions

Can prediction-error routing preserve time-series events that are predictable but operationally critical?
Should a time-series HAM route by surprise, uncertainty, downstream utility, intervention relevance, or a learned mixture?
How should the KV-cache target be allocated across layers, channels, services, topology nodes, and event streams?
Can sparse KV scratchpads beat compact recurrent state, LCLM/CAT-style compression, or sparse attention under the same serving budget?
What preservation probes should replace NIAH for multivariate time series: delayed event recall, cross-channel binding, action-effect recall, rare-regime replay, or topology-dependent lookup?

Alex Open Research Wiki

Explorer

Hybrid Associative Memories

Hybrid Associative Memories

Source

Status And Credibility

Core Claim

Mechanism

Evidence

Relevance To This Wiki

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks