Moshi
Summary
Moshi is Kyutai’s 2024 full-duplex speech-to-speech dialogue model. It listens to the user audio stream while generating its own text and audio streams, including silence, without requiring explicit speaker turns.
Role In The Wiki
Moshi is an engineering example for continuous streaming data, low-latency serving, stream separation, and temporal artifact metrics. It is not a numeric time-series foundation model and should not be treated as an action-conditioned world model.
Its closest value for metrics work is the combination of:
- low-latency stream processing;
- separate observed and generated streams;
- silence as first-class generated behavior;
- evaluation of turn-taking and dialogue timing;
- token-entropy diagnostics for artifact patterns over generated time.
Official Artifacts
- Preprint: arXiv 2410.00037
- Official technical report PDF: Moshi.pdf
- Official launch blog: Meet Moshi, the first real-time voice AI
- Official open-source release: Moshi open-source release: run Moshi locally!
- Official code: kyutai-labs/moshi
- Official demo: moshi.chat
- Official Hugging Face collection: Moshi v0.1 Release
- Official Mimi codec: kyutai/mimi
- Official X thread: Kyutai release thread
Evidence
Relation To Foundation TSFM Agenda
Use the source-level agenda mapping in moshi-2024 rather than duplicating verdict rows here.
At the entity level, Moshi is useful as a full-duplex audio event-stream analogue: it shows how an always-on model can maintain stream context, handle generated no-op/silence behavior, and expose temporal artifact metrics. It does not provide numeric observations, graph time series, topology, typed control inputs, interventions, or counterfactual next-state rollouts.