Hybrid Associative Memories
Source
- Raw Markdown: paper_hybrid-associative-memories-2026.md
- PDF: paper_hybrid-associative-memories-2026.pdf
- Preprint: arXiv 2603.22325
- Official Zyphra page: Hybrid Associative Memories
- Related Zyphra cookbook repository: Zyphra/zcookbook
- Local source archive:
papers/hybrid-associative-memories-2026/arXiv-2603.22325.tar.gz - X provenance snapshot from the user-provided Oryx discussion:
papers/hybrid-associative-memories-2026/x_discussion_ham_in_oryx_thread_2067675034878193787.mdand JSON sibling.
Status And Credibility
arXiv lists Hybrid Associative Memories as a cs.LG / cs.AI preprint first submitted on 2026-03-20 and revised to v2 on 2026-03-27. The authors are Leon Lufkin, Tomás Figliolia, Beren Millidge, and Kamesh Krishnamurthy. Zyphra hosts an official research page for the paper.
The source is credible current architecture evidence because it is a 2026 technical report from Zyphra’s hybrid-model research line, has an official research page, includes implementation-level discussion of routing and FlexAttention/FLA machinery, and directly studies the long-context KV-cache versus recurrent-state tradeoff. It is not yet peer-reviewed evidence at ingest time, and no dedicated official HAM implementation repository was found during this pass. The Zyphra cookbook is related background for hybrid-model training, not a direct HAM code release.
The local X snapshot is provenance only: Kevin Li linked HAM in the user-provided Oryx discussion as relevant to token-level routing.
Core Claim
HAM argues that RNN-style recurrence and attention should not merely run in parallel or appear in fixed layer patterns. They should play complementary memory roles inside one layer:
- the recurrent state summarizes predictable, compressible context;
- the KV cache stores only tokens that are difficult for the recurrent state to predict;
- a user-controlled or learned threshold sets the KV-cache growth rate.
This turns KV-cache size into a data-dependent memory budget instead of a fixed linear function of sequence length.
Mechanism
The paper frames sequence mixers as associative memories. A DeltaNet-style recurrent state predicts the current value from the current key:
HAM computes a prediction-error or learned routing score. A token is routed to the KV scratchpad when the recurrent state cannot predict it well enough:
The scratchpad attention reads only selected key/value pairs:
flowchart LR Token[current token] --> RNN[recurrent state update] RNN --> Predict[predict value from state] Token --> Error[prediction error or learned router] Predict --> Error Error -->|above threshold| KV[write token to KV scratchpad] RNN --> Mix[combine recurrent readout] KV --> Mix Mix --> Output[layer output]
Evidence
The main reported model comparison trains 800M-scale models on 50B tokens with a 16,384-token context. Baselines include a Transformer, pure Gated DeltaNet, and a GDN-GSA stacked hybrid.
Reported evidence that matters for this wiki:
- HAM exposes smooth control over KV-cache usage through the threshold and sweeps.
- At 50% KV-cache usage, learned-router HAM reports the best standard benchmark average in the paper’s table: 49.1 versus 48.8 for the Transformer and 48.4 for GDN-GSA.
- Different HAM layers converge to different KV fractions under a global target, suggesting that explicit-memory demand varies by depth.
- Long-context results are heterogeneous rather than uniformly best; the method is strongest where selective exact memory helps and weaker where the router or cache budget misses needed details.
- The paper includes analysis of routing behavior, including NIAH-style examples where routing scores spike near the hidden needle.
Read this as evidence for selective KV-cache growth and memory routing in language-model settings, not as proof for numeric time-series preservation.
Relevance To This Wiki
HAM is directly relevant to long-context and fixed-budget memory because it makes what to keep exactly a model decision. For time-series systems, the analogous question is whether a model can compress repeated normal behavior into state while preserving rare regimes, exogenous events, topology changes, and action/intervention windows in a higher-resolution memory.
The key transfer pattern is:
predictable spans -> compressed recurrent state
surprising/control-relevant spans -> explicit memoryThis should be tested under state-utility probes rather than assumed from language retrieval results.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming state / long context | adjacent | HAM combines a fixed-size recurrent state with a selectively growing KV scratchpad. | Needs numeric streams, event timing, high-channel observations, and continuous online update tests. |
| Dynamic compute and fixed-budget hierarchy | adjacent | A threshold controls the explicit-memory budget and produces a performance/cache tradeoff. | Needs a global expected-FLOPs or latency objective, hard serving measurements, and time-series budget frontiers. |
| Benchmark hygiene | warning | Long-context gains are task-dependent; average scores hide which details are routed to the cache. | Needs selection-recall probes for rare regimes, cross-channel deviations, actions, and exogenous variables. |
| Native multivariate encoding | insufficient evidence | Layer-varying cache usage is suggestive for adaptive resolution. | No observed numeric variables, sensors, graph topology, or channel-specific preservation evidence. |
| Control and counterfactuals | insufficient evidence | A scratchpad could preserve action windows if adapted. | No actions, control inputs, interventions, rewards, or counterfactual rollouts. |
Limitations
- Fresh 2026 arXiv preprint; no peer-reviewed venue result found at ingest time.
- No dedicated official HAM code/model repository was found during this pass.
- Evidence is language-model and retrieval evidence, not numeric time-series, observability, robotics, or action-conditioned world-model evidence.
- The KV-cache threshold controls cache fraction, but serving cost also depends on sparse attention implementation, dynamic cache layout, batching, and kernel support.
- Routing by prediction error can miss details that are predictable but still decision-relevant, such as planned actions, scheduled maintenance, or rare but regular events.
Links Into The Wiki
- Hybrid Associative Memories
- Gated DeltaNet
- Oryx
- Efficient Recurrent Sequence Models
- Extra-Long Context For Time Series
- Streaming Latent-State Updates
- Looped Transformers And Test-Time Memory
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- Hierarchical Modeling with a Fixed FLOPs Budget
- Foundation Time-Series Model Research Agenda
Open Questions
- Can prediction-error routing preserve time-series events that are predictable but operationally critical?
- Should a time-series HAM route by surprise, uncertainty, downstream utility, intervention relevance, or a learned mixture?
- How should the KV-cache target be allocated across layers, channels, services, topology nodes, and event streams?
- Can sparse KV scratchpads beat compact recurrent state, LCLM/CAT-style compression, or sparse attention under the same serving budget?
- What preservation probes should replace NIAH for multivariate time series: delayed event recall, cross-channel binding, action-effect recall, rare-regime replay, or topology-dependent lookup?