Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations
Source
- Raw Markdown: paper_oryx-2026.md
- PDF: paper_oryx-2026.pdf
- Preprint: arXiv 2605.28769
- Local source archive:
papers/oryx-2026/arXiv-2605.28769.tar.gz - X provenance snapshot from the user-provided thread:
papers/oryx-2026/x_thread_kevinyli_atrost3122_2067658710810603549.mdand JSON sibling. - Mentioned/background papers in that discussion: Mamba-2, Gated DeltaNet, and Hybrid Associative Memories.
Status And Credibility
arXiv lists this as a fresh cs.LG preprint first submitted on 2026-05-27. The author list is Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh, and Ziteng Sun; the arXiv HTML notes Google Research / Google DeepMind affiliation for the work context.
The paper is credible current architecture evidence because it comes from a credible Google/CMU research team, includes matched parameter and token-budget language-model experiments up to 1.4B parameters, and directly evaluates two linear recurrent mixer families already tracked in this wiki: Mamba-2 and Gated DeltaNet. It is still preprint evidence: no official code or model release was found during ingest, and similarly named Oryx projects on GitHub/OpenReview are unrelated.
The X thread is provenance for why this source entered the wiki. It is not used as the technical source of truth; arXiv and the paper artifacts are.
Core Claim
Oryx argues that transformer attention and linear recurrent mixers do not only need to be mixed across layers. They can be mixed across the token sequence while sharing most of the model representation. A single Oryx block can run in attention mode or in a linear recurrent mode, maintain compatible states for both, and switch mode by chunk or span.
The paper’s central design claim is that attention and linear recurrence can share key/value representations, with mixer-specific queries and supporting parameters, so the model can allocate richer attention only where the sequence needs it.
Mechanism
Oryx maintains both an attention KV cache and a linear recurrent state. For a selected mixer mode , the block computes
The key and value projections are shared across modes; query projections are mixer-specific; short convolution, gates, and output projection machinery are arranged so linear and attention modes remain compatible.
flowchart LR X[input tokens] --> KV[shared key/value projections] X --> QA[attention query] X --> QR[linear recurrent query] KV --> Cache[attention KV cache] KV --> State[linear recurrent state] QA --> Attn[attention mixer] QR --> Rec[linear mixer: Mamba-2 or GDN] Cache --> Attn State --> Rec Attn --> Select{mode selector} Rec --> Select Select --> Y[block output]
Training uses chunked mixed-mode assignments so the model learns to run under different mixer choices. The paper reports a 1:3 attention-to-linear chunk ratio in its main recipe.
Evidence
The main experiments train Transformer, Mamba-2, Gated DeltaNet, and Oryx variants at 130M, 380M, 810M, and 1.4B scale under matched parameter and training-token budgets.
Reported findings that matter for this wiki:
- Oryx shares more than 90% of parameters across mixer modes.
- At 1.4B scale, all Oryx instances beat their corresponding single-mixer baselines by at least 0.7 percentage points on the paper’s averaged language-model downstream tasks.
- On retrieval evaluations, Oryx can approach Transformer retrieval performance while processing less than 10% of tokens in attention mode.
- With mixed inference, Oryx exceeds linear baselines by at least 8.6 percentage points on real-world retrieval tasks and at least 38.6 percentage points on NIAH-style tests in the paper’s summary.
- The ablations make query sharing a design boundary: sharing keys and values works, while fully shared queries are weaker.
Read the evidence as language-model and retrieval evidence under a fixed token-budget training setup, not as direct multivariate time-series evidence.
Relevance To This Wiki
Oryx is important for long-context and fixed-FLOPs discussions because it exposes mixer choice as a sequence-axis allocation variable. A future time-series model could analogously choose, per span or event window, between exact attention over raw observations and compact recurrent updates.
The useful transfer is not “Oryx solves time-series context.” The useful transfer is a design pattern:
shared representations + multiple state/update mechanisms + budgeted mode routingFor long time-series streams, the analogous state contract would be:
- keep recent high-resolution windows or selected blocks exactly readable when needed;
- update a compact recurrent state for ordinary spans;
- spend attention, retrieval, or explicit memory budget on rare regimes, exogenous events, topology changes, and action or intervention windows;
- evaluate under fixed expected FLOPs, realized latency, state size, and preservation probes.
Foundation TSFM Relevance
| Agenda slot | Verdict | Evidence | Missing pieces |
|---|---|---|---|
| Streaming latent state / long context | adjacent | Oryx jointly maintains attention and recurrent state and can switch mixer mode across sequence chunks. | Needs continuous numeric streams, event histories, action histories, explicit online update costs, and eviction audits. |
| Dynamic compute / fixed-FLOPs hierarchy | adjacent | Mixer choice becomes an allocation variable under matched training-token and parameter budgets. | Needs an explicit expected-FLOPs or latency objective and hard-routing analysis; token budget is not the same as realized serving budget. |
| Benchmark hygiene | warning | Retrieval gains depend on which spans use attention mode; aggregate language-model loss hides the allocation story. | Needs capability slices for rare regimes, exact recall, repeated normal spans, cross-channel binding, and action-conditioned rollouts. |
| Native multivariate encoding | insufficient evidence | Shared key/value representations are a plausible analogy for shared channel/event state. | No numeric variables, sensors, graph nodes, or high-channel telemetry are evaluated. |
| Control and counterfactuals | insufficient evidence | Sequence-axis mode routing could protect action windows if adapted. | No actions, control inputs, interventions, rewards, or counterfactual outcomes. |
Limitations
- Fresh 2026 preprint; no peer-reviewed venue result was found at ingest time.
- No official Oryx code or model checkpoints were found during this pass.
- Evidence is language modeling, retrieval, and synthetic long-context evidence, not numeric time-series, telemetry, robotics, or action-conditioned world-model evidence.
- The paper maintains both attention and linear states; practical serving benefit depends on how state maintenance, mode routing, kernels, cache layout, and batching are implemented.
- The reported fixed token-budget setup is not a full fixed-FLOPs or fixed-latency frontier.
Links Into The Wiki
- Oryx
- Gated DeltaNet
- Mamba-2
- Hybrid Associative Memories
- Efficient Recurrent Sequence Models
- Extra-Long Context For Time Series
- Streaming Latent-State Updates
- Time-Series Scaling And Efficiency
- Time-Series Benchmark Hygiene
- Hierarchical Modeling with a Fixed FLOPs Budget
- Foundation Time-Series Model Research Agenda
Open Questions
- Can sequence-axis mixer routing be trained under a true expected-FLOPs or latency budget rather than a fixed token budget?
- What router signal should choose attention versus recurrence for time-series spans: surprise, uncertainty, anomaly score, action relevance, event type, or downstream control value?
- Can shared key/value representations span numeric observations, events, text context, topology, and action history without forcing one representation to erase another?
- Does maintaining both attention and recurrent states still save memory in production once state-update overhead, KV-cache layout, and batching are counted?
- What is the time-series analogue of “less than 10% attention-mode tokens” that preserves rare regimes and exact intervention history?