Moshi

Summary

Moshi is Kyutai’s 2024 full-duplex speech-to-speech dialogue model. It listens to the user audio stream while generating its own text and audio streams, including silence, without requiring explicit speaker turns.

Role In The Wiki

Moshi is an engineering example for continuous streaming data, low-latency serving, stream separation, and temporal artifact metrics. It is not a numeric time-series foundation model and should not be treated as an action-conditioned world model.

Its closest value for metrics work is the combination of:

  • low-latency stream processing;
  • separate observed and generated streams;
  • silence as first-class generated behavior;
  • evaluation of turn-taking and dialogue timing;
  • token-entropy diagnostics for artifact patterns over generated time.

Official Artifacts

Evidence

Relation To Foundation TSFM Agenda

Use the source-level agenda mapping in moshi-2024 rather than duplicating verdict rows here.

At the entity level, Moshi is useful as a full-duplex audio event-stream analogue: it shows how an always-on model can maintain stream context, handle generated no-op/silence behavior, and expose temporal artifact metrics. It does not provide numeric observations, graph time series, topology, typed control inputs, interventions, or counterfactual next-state rollouts.