Streaming Latent-State Updates

Summary

Streaming latent-state updates are the serving contract for models that operate on never-ending streams. The model receives new observations or events, updates a retained state, decides whether to emit an output or stay silent, and keeps enough information for future decisions without replaying the full history.

For this wiki, the target is a foundation time-series model that can update at least as fast as wall-clock data arrives while preserving rare regimes, context, event streams, and action history. A real-time model is not proven by a long context window alone; it needs explicit update cost, latency, memory, state-refresh, and abstention or trigger behavior.

Streaming Contract

A minimal contract should name the retained state, the update rule, the output decision, and the serving budget:

Here is the current observation, is context, is an event-stream item, is prior action or control-input history when it exists, is latent state, is a decision such as abstain, alert, respond, ask, forecast, or propose action, and is the emitted output.

flowchart LR
  O[New observation or event] --> U[State update]
  C[Context and schema] --> U
  A[Action or control-input history] --> U
  U --> S[Retained latent state]
  S --> D{Decision}
  D -->|silent or abstain| O
  D -->|answer or forecast| Y[Output]
  D -->|candidate intervention| P[Planner or policy]
  Y --> O
  P --> A

What The Wiki Currently Believes

Moshi is the stronger audio analogue for streaming generation and artifact diagnostics in this corpus. Audio Interaction Model is now demoted to a context-level warning: it makes silence/response serving explicit, but depends on heavy curated-data construction and does not solve retained-state growth.

Moshi is the earlier full-duplex example: 80 ms Mimi codec frames, 160 ms theoretical latency, about 200 ms practical latency, two separate audio streams for user and system, and an Inner Monologue text stream for Moshi’s own speech. Its lesson for metrics work is that streaming generation needs temporal artifact diagnostics, not only aggregate quality scores: the paper uses token-entropy windows to flag repetitive text, background noise during intended silence, gibberish, and noisy audio.

Audio Interaction Model is still a useful explicit trigger-token example: 400 ms chunks, silence/response control tokens, FIFO asynchronous inference, first-chunk latency, stall rate, and proactive trigger evaluation. Its lesson is narrow: real-time benchmarks should score the decision not to emit an output, but its TFJP preprocessing, synthetic silence supervision, history-review prompts, and FIFO queue are mostly workarounds rather than a reusable streaming-state architecture.

Language Models Need Sleep is the strongest current analogy for finite-window eviction. The model spends extra consolidation compute before old context leaves the attention window, then resumes cheaper wake-time prediction. The time-series analogue is a learned state-refresh step over recent numeric observations, event streams, context, and action history before raw samples are dropped.

FADE adds the continual-learning warning: online systems need selective forgetting, not a single fixed retention horizon. The time-series version should forget stale mappings while retaining stable dynamics and rare safety-relevant state.

TurboQuant adds the serving-memory warning: compressed retained state must improve actual latency, throughput, memory pressure, and quality after dequantization or retrieval cost is counted.

Latent Context Language Models add adjacent evidence for prefill-time learned context compression: a smaller encoder compresses static prompt spans into soft latent tokens before decoder prefill, with optional exact expansion for selected chunks. For streaming TSFMs, this sharpens whether compression can be updated online as observations and tool results arrive, or whether it belongs at eviction and consolidation boundaries.

Mamba-3, RWKV-TS, and RATE are architecture background for compact recurrent state, numeric time-series recurrence, and action-trajectory memory. They are not sufficient by themselves because the streaming-state target also needs context schemas, event streams, interventions, state-refresh probes, and serving measurements.

Design Requirements

  • State update cost MUST be reported per sample, per event, or per chunk.
  • Retained state size MUST be explicit: KV cache, recurrent hidden state, memory tokens, fast weights, external memory, or compressed summaries.
  • The model SHOULD expose an abstain/silence/trigger decision when real-time output is optional or costly.
  • Benchmarks SHOULD include benign no-op spans and rare critical spans, so false positives and false negatives are both measured.
  • Context eviction SHOULD be audited: what is lost when raw history leaves the retained window?
  • Compression SHOULD be evaluated against downstream state utility, not only byte reduction or reconstruction error.
  • Action or control-input history MUST be distinguished from passive events and exogenous variables.

Relation To Foundation TSFM Agenda

This page fills the streaming-state gap in the Foundation Time-Series Model Research Agenda. It maps most directly to streaming state and long context, but also touches event streams, context interface, dynamic compute, and control/counterfactuals when action history is present.

The agenda-relevant test is:

Can the model keep useful state under continuous updates,
with bounded latency and memory, while preserving the variables
needed for future observations, alerts, and candidate interventions?

Audio-Interaction does not answer this test. It gives a serving-contract and trigger-evaluation analogy outside numeric time series, but it does not provide bounded memory, eviction audits, learned state compression, multivariate observations, irregular event streams, graph time series, topology, exogenous variables, or typed actions and control inputs.

Open Questions

  • What is the minimal benchmark for always-on numeric time-series state updates?
  • Should a streaming TSFM emit an explicit abstain/alert/action token, or should that decision live in a separate policy head?
  • Should no-op behavior be represented as explicit abstain/alert/action tokens, while avoiding Audio-Interaction-style reliance on hand-curated silence supervision?
  • How should state-refresh quality be measured after a retained window is evicted?
  • Can FIFO-style ingestion/decoding decoupling transfer from audio to telemetry serving without being mistaken for context compression?
  • Which retained-state interface is best under real serving constraints: KV cache, recurrent state, memory tokens, fast weights, learned summaries, or compressed retrieval memory?
  • How should false-positive alerting cost and false-negative safety cost be weighted for rare operational events?
  • Can Moshi-style entropy-over-token-stream diagnostics become useful health checks for generated forecasts, alert streams, or latent-state summaries under quantization and long-running service?