Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

Source

Raw Markdown: paper_oryx-2026.md
PDF: paper_oryx-2026.pdf
Preprint: arXiv 2605.28769
Local source archive: papers/oryx-2026/arXiv-2605.28769.tar.gz
X provenance snapshot from the user-provided thread: papers/oryx-2026/x_thread_kevinyli_atrost3122_2067658710810603549.md and JSON sibling.
Mentioned/background papers in that discussion: Mamba-2, Gated DeltaNet, and Hybrid Associative Memories.

Status And Credibility

arXiv lists this as a fresh cs.LG preprint first submitted on 2026-05-27. The author list is Kevin Y. Li, Asher Trockman, Ananda Theertha Suresh, and Ziteng Sun; the arXiv HTML notes Google Research / Google DeepMind affiliation for the work context.

The paper is credible current architecture evidence because it comes from a credible Google/CMU research team, includes matched parameter and token-budget language-model experiments up to 1.4B parameters, and directly evaluates two linear recurrent mixer families already tracked in this wiki: Mamba-2 and Gated DeltaNet. It is still preprint evidence: no official code or model release was found during ingest, and similarly named Oryx projects on GitHub/OpenReview are unrelated.

The X thread is provenance for why this source entered the wiki. It is not used as the technical source of truth; arXiv and the paper artifacts are.

Core Claim

Oryx argues that transformer attention and linear recurrent mixers do not only need to be mixed across layers. They can be mixed across the token sequence while sharing most of the model representation. A single Oryx block can run in attention mode or in a linear recurrent mode, maintain compatible states for both, and switch mode by chunk or span.

The paper’s central design claim is that attention and linear recurrence can share key/value representations, with mixer-specific queries and supporting parameters, so the model can allocate richer attention only where the sequence needs it.

Mechanism

Oryx maintains both an attention KV cache and a linear recurrent state. For a selected mixer mode $m \in {attention, mamba, gdn}$ , the block computes

O = Mixer_{m} (X W^{Q_{m}}, X W^{K}, X W^{V}, X; W^{sup.}) .

The key and value projections are shared across modes; query projections are mixer-specific; short convolution, gates, and output projection machinery are arranged so linear and attention modes remain compatible.

flowchart LR
  X[input tokens] --> KV[shared key/value projections]
  X --> QA[attention query]
  X --> QR[linear recurrent query]
  KV --> Cache[attention KV cache]
  KV --> State[linear recurrent state]
  QA --> Attn[attention mixer]
  QR --> Rec[linear mixer: Mamba-2 or GDN]
  Cache --> Attn
  State --> Rec
  Attn --> Select{mode selector}
  Rec --> Select
  Select --> Y[block output]

Training uses chunked mixed-mode assignments so the model learns to run under different mixer choices. The paper reports a 1:3 attention-to-linear chunk ratio in its main recipe.

Evidence

The main experiments train Transformer, Mamba-2, Gated DeltaNet, and Oryx variants at 130M, 380M, 810M, and 1.4B scale under matched parameter and training-token budgets.

Reported findings that matter for this wiki:

Oryx shares more than 90% of parameters across mixer modes.
At 1.4B scale, all Oryx instances beat their corresponding single-mixer baselines by at least 0.7 percentage points on the paper’s averaged language-model downstream tasks.
On retrieval evaluations, Oryx can approach Transformer retrieval performance while processing less than 10% of tokens in attention mode.
With mixed inference, Oryx exceeds linear baselines by at least 8.6 percentage points on real-world retrieval tasks and at least 38.6 percentage points on NIAH-style tests in the paper’s summary.
The ablations make query sharing a design boundary: sharing keys and values works, while fully shared queries are weaker.

Read the evidence as language-model and retrieval evidence under a fixed token-budget training setup, not as direct multivariate time-series evidence.

Relevance To This Wiki

Oryx is important for long-context and fixed-FLOPs discussions because it exposes mixer choice as a sequence-axis allocation variable. A future time-series model could analogously choose, per span or event window, between exact attention over raw observations and compact recurrent updates.

The useful transfer is not “Oryx solves time-series context.” The useful transfer is a design pattern:

shared representations + multiple state/update mechanisms + budgeted mode routing

For long time-series streams, the analogous state contract would be:

keep recent high-resolution windows or selected blocks exactly readable when needed;
update a compact recurrent state for ordinary spans;
spend attention, retrieval, or explicit memory budget on rare regimes, exogenous events, topology changes, and action or intervention windows;
evaluate under fixed expected FLOPs, realized latency, state size, and preservation probes.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming latent state / long context	adjacent	Oryx jointly maintains attention and recurrent state and can switch mixer mode across sequence chunks.	Needs continuous numeric streams, event histories, action histories, explicit online update costs, and eviction audits.
Dynamic compute / fixed-FLOPs hierarchy	adjacent	Mixer choice becomes an allocation variable under matched training-token and parameter budgets.	Needs an explicit expected-FLOPs or latency objective and hard-routing analysis; token budget is not the same as realized serving budget.
Benchmark hygiene	warning	Retrieval gains depend on which spans use attention mode; aggregate language-model loss hides the allocation story.	Needs capability slices for rare regimes, exact recall, repeated normal spans, cross-channel binding, and action-conditioned rollouts.
Native multivariate encoding	insufficient evidence	Shared key/value representations are a plausible analogy for shared channel/event state.	No numeric variables, sensors, graph nodes, or high-channel telemetry are evaluated.
Control and counterfactuals	insufficient evidence	Sequence-axis mode routing could protect action windows if adapted.	No actions, control inputs, interventions, rewards, or counterfactual outcomes.

Limitations

Fresh 2026 preprint; no peer-reviewed venue result was found at ingest time.
No official Oryx code or model checkpoints were found during this pass.
Evidence is language modeling, retrieval, and synthetic long-context evidence, not numeric time-series, telemetry, robotics, or action-conditioned world-model evidence.
The paper maintains both attention and linear states; practical serving benefit depends on how state maintenance, mode routing, kernels, cache layout, and batching are implemented.
The reported fixed token-budget setup is not a full fixed-FLOPs or fixed-latency frontier.

Links Into The Wiki

Open Questions

Can sequence-axis mixer routing be trained under a true expected-FLOPs or latency budget rather than a fixed token budget?
What router signal should choose attention versus recurrence for time-series spans: surprise, uncertainty, anomaly score, action relevance, event type, or downstream control value?
Can shared key/value representations span numeric observations, events, text context, topology, and action history without forcing one representation to erase another?
Does maintaining both attention and recurrent states still save memory in production once state-update overhead, KV-cache layout, and batching are counted?
What is the time-series analogue of “less than 10% attention-mode tokens” that preserves rare regimes and exact intervention history?

Alex Open Research Wiki

Explorer

Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

Multi-Mixer Models: Flexible Sequence Modeling with Shared Representations

Source

Status And Credibility

Core Claim

Mechanism

Evidence

Relevance To This Wiki

Foundation TSFM Relevance

Limitations

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks