Moshi: a speech-text foundation model for real-time dialogue

Source

Credibility

This is a Kyutai technical report with arXiv v1 submitted on 2024-09-17 and v2 last revised on 2024-10-02. It is older than one year as of 2026-06-12, so it should not be treated as current SOTA by default. Its credibility for this wiki comes from being a historically important released system with official code, demo, weights, Mimi codec artifacts, an official release thread, and detailed ablations on latency, streaming codec design, audio quality, spoken QA, safety, and quantization.

Use it as an engineering and metrics anchor for full-duplex streaming audio, not as proof that the latest audio-language models or time-series world models are solved.

Core Claim

Moshi casts spoken dialogue as end-to-end speech-to-speech generation rather than a cascade of voice activity detection, ASR, text dialogue, and TTS. The model jointly handles:

  • Moshi’s own audio stream;
  • the user’s audio stream;
  • a time-aligned text stream for Moshi’s speech, called Inner Monologue.

This multi-stream interface lets the system listen and speak continuously, including silence, interruptions, overlap, and backchanneling, while targeting 160 ms theoretical latency and about 200 ms practical latency.

Key Contributions

  • Introduces Moshi, a 7B-class full-duplex spoken dialogue framework built on the Helium text language model.
  • Introduces Mimi, a causal streaming neural audio codec that maps 24 kHz speech into 12.5 Hz semantic-acoustic audio tokens at about 1.1 kbps.
  • Uses hierarchical autoregressive generation with a large Temporal Transformer across timesteps and a smaller Depth Transformer across codebooks/streams at a timestep.
  • Adds multi-stream modeling so the user and system speech streams are represented separately rather than forced into turn-segmented dialogue.
  • Adds Inner Monologue: time-aligned text tokens are predicted as a prefix to Moshi’s own semantic and acoustic audio tokens.
  • Shows that changing the delay between text and audio tokens can derive streaming ASR and streaming TTS from the same modeling setup.
  • Provides quantization and artifact analysis, including token-entropy diagnostics for degradation over generated time.

Method Notes

Mimi produces audio tokens every 80 ms. The first token level is trained to carry semantic information via WavLM distillation, while the remaining residual quantizer levels carry acoustic detail. This keeps the codec causal and streaming-compatible while reducing the number of autoregressive steps Moshi must run.

Moshi uses an RQ-Transformer-style factorization. A Temporal Transformer produces the context for timestep :

Then a Depth Transformer predicts each substream token at that timestep:

The final multi-stream sequence contains text, Moshi audio, and user audio:

V_s = [
  W_s,
  A^moshi_{s, semantic},
  A^moshi_{s - tau, acoustic},
  A^user_{s, semantic},
  A^user_{s - tau, acoustic}
]

At inference time, the user audio stream is observed from the microphone, while Moshi’s text and audio streams are sampled. Silence is represented inside the generated text/audio token streams rather than by a separate response/no-response classifier.

Evidence And Results

The paper reports several results that matter for streaming systems:

  • Latency: final Moshi uses an acoustic delay of 1 frame for a theoretical latency of 160 ms; the official code README reports practical overall latency as low as 200 ms on an L4 GPU.
  • Context: the experiments target conversations across several minutes, with 5 minutes used as a core setting.
  • Mimi codec: operates at 12.5 Hz and 1.1 kbps, with causal 80 ms frames; the paper reports much better subjective MUSHRA quality for adversarial-only Mimi training despite weak objective metric agreement.
  • Spoken QA: Moshi with Inner Monologue reports 26.6 on Web Questions, 62.3 on LlaMA Questions, and 22.8 on Audio TriviaQA, substantially above the paper’s listed speech-text baselines but below the text-only Helium topline.
  • Streaming ASR/TTS demonstration: the paper reports 5.7% WER for streaming ASR and 4.7% WER for streaming TTS on LibriSpeech test-clean under the paper’s limited demonstration setup.
  • Dialogue statistics: Moshi’s generated Fisher continuations are evaluated with IPU, pause, gap, and overlap metrics, making turn-taking behavior part of the evaluation surface.

Metrics Lessons

Moshi is especially useful for the wiki’s metrics agenda because the paper repeatedly shows that simple aggregate quality scores are insufficient for streaming generation.

The codec section reports poor correlation between objective audio metrics and subjective listening quality when changing the training objective. In particular, adversarial-only Mimi can look weak under VisQOL while sounding much better to listeners. The generative modeling section also says textless NLP metrics such as sWUGGY and sBLIMP do not consistently guide dialogue-model development under varied acoustic conditions.

The quantization appendix is the most portable metrics idea. The authors compute token entropy over sliding windows for the generated text and audio codebook streams, then classify artifacts such as repetitive text, background noise during intended silence, gibberish, and noisy audio. This is closer to a time-series health monitor than a single endpoint score:

For observability and multivariate time-series work, the lesson is to track temporal artifact structure, false silence, over-triggering, lag, stall, state drift, and degradation-over-time, not only final forecast error or one average generation score.

Relation To Audio Interaction Model And SoundFlow

Moshi and Audio Interaction Model are no longer symmetric streaming examples for the TSFM agenda. Moshi remains the cleaner streaming-generation and artifact-diagnostics anchor; Audio-Interaction is a caveated trigger-policy contrast.

Moshi is the earlier speech-to-speech, full-duplex example. Its strongest contributions are stream representation, low-latency audio tokenization, multi-stream generation, continuous silence/speech generation, and artifact diagnostics over generated token streams.

Audio Interaction Model and SoundFlow are useful but demoted examples of explicit online decision policy. They model chunk-level silent/response decisions, build StreamAudio-2M and Proactive-Sound-Bench around proactive intervention, and use FIFO asynchronous inference to keep ingestion aligned while generation proceeds. The caveat is that this depends on heavy preprocessing, curated silence/noise handling, synthetic stream composition, history-review prompts, and scheduling; it is not a reusable long-context state solution.

Together, the pair suggests a metric stack for always-on models:

stream quality + retained context + abstain/trigger timing
  + output quality + latency/stall + artifact-over-time diagnostics

Boundary With JEPA And World Models

Moshi is not a JEPA system. It is an autoregressive discrete-token generator over audio and text streams. JEPA-style systems instead train predictors in representation space, usually to predict target embeddings rather than reconstructing or sampling the next raw token. The shared intuition is that semantic structure matters more than raw reconstruction, but the objective and serving interface are different.

Moshi is also not an action-conditioned world model in the strict wiki sense. It models dialogue dynamics conditioned on past audio/text streams and live user audio, but it does not learn a transition model for future system state under candidate actions, control inputs, or interventions. User speech is an observed input stream; Moshi’s response is generated behavior, not an evaluated intervention with downstream environment outcomes.

For time-series and metrics work, Moshi should therefore be treated as adjacent evidence for streaming state, event-stream representation, and serving metrics. It does not close the world-model slots that require typed action logs, intervention timing, counterfactual evaluation, or next-state dynamics under alternative actions.

Limitations

  • The report is from 2024 and should be read as a historical full-duplex streaming anchor rather than current SOTA.
  • Moshi remains much weaker than the text-only Helium model on knowledge-heavy QA, especially Audio TriviaQA.
  • Inner Monologue improves speech quality but also makes the architecture partly dependent on a text stream; this is a useful scaffold but not proof that audio-native reasoning is solved.
  • The streaming ASR/TTS experiments are demonstrations of interface flexibility, not specialized SOTA systems.
  • Objective audio metrics are unreliable in several reported settings; human evaluation and artifact-specific temporal diagnostics are still needed.
  • Silence is generated as part of the output stream, not evaluated as a calibrated abstain/response decision with false-positive and false-negative costs.
  • The model has no explicit action, control-input, or intervention channel for action-conditioned world modeling.
  • The safety analysis is mostly text-side and early for audio-specific harms such as voice misuse, re-encoding attacks, and watermark robustness.

Foundation TSFM Relevance

Agenda slotVerdictEvidenceMissing pieces
Streaming state and long contextadjacentMaintains multi-minute audio dialogue context and runs an 80 ms codec / 160 ms theoretical generation loop.Need numeric multivariate time series, graph time series, irregular event streams, state-refresh probes, and bounded-memory tests.
Event streams and abstention/trigger behavioradjacentModels continuous speech/silence and turn-taking without explicit speaker turns; evaluates pauses, gaps, and overlap.Silence is not a calibrated alert/abstain/ask/act decision with costs and outcome labels.
Metrics and benchmark designwarningShows weak correlation between objective audio metrics and subjective quality; adds token-entropy artifact diagnostics over generated time.Need analogous temporal artifact metrics for operational telemetry, false alerts, missed incidents, state drift, and intervention effects.
Unified multimodal interfaceadjacentCombines audio tokens and text tokens in one streaming model, with text as an Inner Monologue prefix to speech generation.Does not cover numeric feature streams, channel metadata, topology, logs, or typed operational context.
Control and counterfactualsinsufficient evidenceThe model generates dialogue behavior conditioned on observed streams.No candidate action rollout, intervention log, control input, reward/outcome model, or counterfactual evaluation.

Open Questions

  • What is the time-series analogue of Moshi’s generated silence: abstain, no alert, wait for more evidence, defer to an operator, or keep updating latent state?
  • Can token-entropy artifact diagnostics transfer to multivariate time-series state health, such as detecting repetitive forecasts, false silence, drift, or degradation under quantization?
  • Should streaming telemetry models use explicit abstain/alert tokens without inheriting Audio-Interaction-style curated silence supervision, or continuous generated no-op spans like Moshi?
  • What benchmark can jointly score latency, state retention, abstention, rare-event detection, and degradation-over-time for operational metric streams?