---
abstract: |
  \looseness=-1

  We introduce `\ours{}`{=latex}, a speech-text foundation model and full-duplex spoken dialogue framework. Current systems for spoken dialogue rely on pipelines of independent components, namely voice activity detection, speech recognition, textual dialogue and text-to-speech. Such frameworks cannot emulate the experience of real conversations. First, their complexity induces a latency of several seconds between interactions. Second, text being the intermediate modality for dialogue, non-linguistic information that modifies meaning--- such as emotion or non-speech sounds--- is lost in the interaction. Finally, they rely on a segmentation into speaker turns, which does not take into account overlapping speech, interruptions and interjections. `\ours{}`{=latex} solves these independent issues altogether by casting spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, `\ours{}`{=latex} generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. We moreover extend the hierarchical semantic-to-acoustic token generation of previous work to first predict time-aligned text tokens as a prefix to audio tokens. Not only this \`\``\interleaving`{=latex}" method significantly improves the linguistic quality of generated speech, but we also illustrate how it can provide streaming speech recognition and text-to-speech. Our resulting model is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice, and is available at [github.com/kyutai-labs/moshi](https://github.com/kyutai-labs/moshi).
author:
- |
  `\name `{=latex}Alexandre Défossez$^{*}$ `\email `{=latex}alex@kyutai.org\
  `\AND`{=latex} `\name `{=latex}Laurent Mazaré$^{*}$\
  `\AND`{=latex} `\name `{=latex}Manu Orsini\
  `\AND`{=latex} `\name `{=latex}Amélie Royer\
  `\AND`{=latex} `\name `{=latex}Patrick Pérez\
  `\AND`{=latex} `\name `{=latex}Hervé Jégou\
  `\AND`{=latex} `\name `{=latex}Edouard Grave$^{*}$\
  `\AND`{=latex} `\name `{=latex}Neil Zeghidour$^{*}$ `\email `{=latex}neil@kyutai.org\
  `\addr `{=latex}Kyutai\
  `\addr `{=latex}$^*$Equal contribution\
bibliography:
- sample.bib
title: "Moshi: a speech-text foundation model for real-time dialogue"
---

\newcommand{\figleft}{{\em (Left)}}
\newcommand{\figcenter}{{\em (Center)}}
\newcommand{\figright}{{\em (Right)}}
\newcommand{\figtop}{{\em (Top)}}
\newcommand{\figbottom}{{\em (Bottom)}}
\newcommand{\captiona}{{\em (a)}}
\newcommand{\captionb}{{\em (b)}}
\newcommand{\captionc}{{\em (c)}}
\newcommand{\captiond}{{\em (d)}}
\newcommand{\newterm}[1]{{\bf #1}}
\def\figref#1{figure~\ref{#1}}
\def\Figref#1{Figure~\ref{#1}}
\def\twofigref#1#2{figures \ref{#1} and \ref{#2}}
\def\quadfigref#1#2#3#4{figures \ref{#1}, \ref{#2}, \ref{#3} and \ref{#4}}
\def\secref#1{section~\ref{#1}}
\def\Secref#1{Section~\ref{#1}}
\def\twosecrefs#1#2{sections \ref{#1} and \ref{#2}}
\def\secrefs#1#2#3{sections \ref{#1}, \ref{#2} and \ref{#3}}
\def\eqref#1{equation~\ref{#1}}
\def\Eqref#1{Equation~\ref{#1}}
\def\plaineqref#1{\ref{#1}}
\def\chapref#1{chapter~\ref{#1}}
\def\Chapref#1{Chapter~\ref{#1}}
\def\rangechapref#1#2{chapters\ref{#1}--\ref{#2}}
\def\algref#1{algorithm~\ref{#1}}
\def\Algref#1{Algorithm~\ref{#1}}
\def\twoalgref#1#2{algorithms \ref{#1} and \ref{#2}}
\def\Twoalgref#1#2{Algorithms \ref{#1} and \ref{#2}}
\def\partref#1{part~\ref{#1}}
\def\Partref#1{Part~\ref{#1}}
\def\twopartref#1#2{parts \ref{#1} and \ref{#2}}
\def\ceil#1{\lceil #1 \rceil}
\def\floor#1{\lfloor #1 \rfloor}
\def\1{\bm{1}}
\newcommand{\train}{\mathcal{D}}
\newcommand{\valid}{\mathcal{D_{\mathrm{valid}}}}
\newcommand{\test}{\mathcal{D_{\mathrm{test}}}}
\def\eps{{\epsilon}}
\def\reta{{\textnormal{$\eta$}}}
\def\ra{{\textnormal{a}}}
\def\rb{{\textnormal{b}}}
\def\rc{{\textnormal{c}}}
\def\rd{{\textnormal{d}}}
\def\re{{\textnormal{e}}}
\def\rf{{\textnormal{f}}}
\def\rg{{\textnormal{g}}}
\def\rh{{\textnormal{h}}}
\def\ri{{\textnormal{i}}}
\def\rj{{\textnormal{j}}}
\def\rk{{\textnormal{k}}}
\def\rl{{\textnormal{l}}}
\def\rn{{\textnormal{n}}}
\def\ro{{\textnormal{o}}}
\def\rp{{\textnormal{p}}}
\def\rq{{\textnormal{q}}}
\def\rr{{\textnormal{r}}}
\def\rs{{\textnormal{s}}}
\def\rt{{\textnormal{t}}}
\def\ru{{\textnormal{u}}}
\def\rv{{\textnormal{v}}}
\def\rw{{\textnormal{w}}}
\def\rx{{\textnormal{x}}}
\def\ry{{\textnormal{y}}}
\def\rz{{\textnormal{z}}}
\def\rvepsilon{{\mathbf{\epsilon}}}
\def\rvtheta{{\mathbf{\theta}}}
\def\rva{{\mathbf{a}}}
\def\rvb{{\mathbf{b}}}
\def\rvc{{\mathbf{c}}}
\def\rvd{{\mathbf{d}}}
\def\rve{{\mathbf{e}}}
\def\rvf{{\mathbf{f}}}
\def\rvg{{\mathbf{g}}}
\def\rvh{{\mathbf{h}}}
\def\rvu{{\mathbf{i}}}
\def\rvj{{\mathbf{j}}}
\def\rvk{{\mathbf{k}}}
\def\rvl{{\mathbf{l}}}
\def\rvm{{\mathbf{m}}}
\def\rvn{{\mathbf{n}}}
\def\rvo{{\mathbf{o}}}
\def\rvp{{\mathbf{p}}}
\def\rvq{{\mathbf{q}}}
\def\rvr{{\mathbf{r}}}
\def\rvs{{\mathbf{s}}}
\def\rvt{{\mathbf{t}}}
\def\rvu{{\mathbf{u}}}
\def\rvv{{\mathbf{v}}}
\def\rvw{{\mathbf{w}}}
\def\rvx{{\mathbf{x}}}
\def\rvy{{\mathbf{y}}}
\def\rvz{{\mathbf{z}}}
\def\erva{{\textnormal{a}}}
\def\ervb{{\textnormal{b}}}
\def\ervc{{\textnormal{c}}}
\def\ervd{{\textnormal{d}}}
\def\erve{{\textnormal{e}}}
\def\ervf{{\textnormal{f}}}
\def\ervg{{\textnormal{g}}}
\def\ervh{{\textnormal{h}}}
\def\ervi{{\textnormal{i}}}
\def\ervj{{\textnormal{j}}}
\def\ervk{{\textnormal{k}}}
\def\ervl{{\textnormal{l}}}
\def\ervm{{\textnormal{m}}}
\def\ervn{{\textnormal{n}}}
\def\ervo{{\textnormal{o}}}
\def\ervp{{\textnormal{p}}}
\def\ervq{{\textnormal{q}}}
\def\ervr{{\textnormal{r}}}
\def\ervs{{\textnormal{s}}}
\def\ervt{{\textnormal{t}}}
\def\ervu{{\textnormal{u}}}
\def\ervv{{\textnormal{v}}}
\def\ervw{{\textnormal{w}}}
\def\ervx{{\textnormal{x}}}
\def\ervy{{\textnormal{y}}}
\def\ervz{{\textnormal{z}}}
\def\rmA{{\mathbf{A}}}
\def\rmB{{\mathbf{B}}}
\def\rmC{{\mathbf{C}}}
\def\rmD{{\mathbf{D}}}
\def\rmE{{\mathbf{E}}}
\def\rmF{{\mathbf{F}}}
\def\rmG{{\mathbf{G}}}
\def\rmH{{\mathbf{H}}}
\def\rmI{{\mathbf{I}}}
\def\rmJ{{\mathbf{J}}}
\def\rmK{{\mathbf{K}}}
\def\rmL{{\mathbf{L}}}
\def\rmM{{\mathbf{M}}}
\def\rmN{{\mathbf{N}}}
\def\rmO{{\mathbf{O}}}
\def\rmP{{\mathbf{P}}}
\def\rmQ{{\mathbf{Q}}}
\def\rmR{{\mathbf{R}}}
\def\rmS{{\mathbf{S}}}
\def\rmT{{\mathbf{T}}}
\def\rmU{{\mathbf{U}}}
\def\rmV{{\mathbf{V}}}
\def\rmW{{\mathbf{W}}}
\def\rmX{{\mathbf{X}}}
\def\rmY{{\mathbf{Y}}}
\def\rmZ{{\mathbf{Z}}}
\def\ermA{{\textnormal{A}}}
\def\ermB{{\textnormal{B}}}
\def\ermC{{\textnormal{C}}}
\def\ermD{{\textnormal{D}}}
\def\ermE{{\textnormal{E}}}
\def\ermF{{\textnormal{F}}}
\def\ermG{{\textnormal{G}}}
\def\ermH{{\textnormal{H}}}
\def\ermI{{\textnormal{I}}}
\def\ermJ{{\textnormal{J}}}
\def\ermK{{\textnormal{K}}}
\def\ermL{{\textnormal{L}}}
\def\ermM{{\textnormal{M}}}
\def\ermN{{\textnormal{N}}}
\def\ermO{{\textnormal{O}}}
\def\ermP{{\textnormal{P}}}
\def\ermQ{{\textnormal{Q}}}
\def\ermR{{\textnormal{R}}}
\def\ermS{{\textnormal{S}}}
\def\ermT{{\textnormal{T}}}
\def\ermU{{\textnormal{U}}}
\def\ermV{{\textnormal{V}}}
\def\ermW{{\textnormal{W}}}
\def\ermX{{\textnormal{X}}}
\def\ermY{{\textnormal{Y}}}
\def\ermZ{{\textnormal{Z}}}
\def\vzero{{\bm{0}}}
\def\vone{{\bm{1}}}
\def\vmu{{\bm{\mu}}}
\def\vtheta{{\bm{\theta}}}
\def\va{{\bm{a}}}
\def\vb{{\bm{b}}}
\def\vc{{\bm{c}}}
\def\vd{{\bm{d}}}
\def\ve{{\bm{e}}}
\def\vf{{\bm{f}}}
\def\vg{{\bm{g}}}
\def\vh{{\bm{h}}}
\def\vi{{\bm{i}}}
\def\vj{{\bm{j}}}
\def\vk{{\bm{k}}}
\def\vl{{\bm{l}}}
\def\vm{{\bm{m}}}
\def\vn{{\bm{n}}}
\def\vo{{\bm{o}}}
\def\vp{{\bm{p}}}
\def\vq{{\bm{q}}}
\def\vr{{\bm{r}}}
\def\vs{{\bm{s}}}
\def\vt{{\bm{t}}}
\def\vu{{\bm{u}}}
\def\vv{{\bm{v}}}
\def\vw{{\bm{w}}}
\def\vx{{\bm{x}}}
\def\vy{{\bm{y}}}
\def\vz{{\bm{z}}}
\def\evalpha{{\alpha}}
\def\evbeta{{\beta}}
\def\evepsilon{{\epsilon}}
\def\evlambda{{\lambda}}
\def\evomega{{\omega}}
\def\evmu{{\mu}}
\def\evpsi{{\psi}}
\def\evsigma{{\sigma}}
\def\evtheta{{\theta}}
\def\eva{{a}}
\def\evb{{b}}
\def\evc{{c}}
\def\evd{{d}}
\def\eve{{e}}
\def\evf{{f}}
\def\evg{{g}}
\def\evh{{h}}
\def\evi{{i}}
\def\evj{{j}}
\def\evk{{k}}
\def\evl{{l}}
\def\evm{{m}}
\def\evn{{n}}
\def\evo{{o}}
\def\evp{{p}}
\def\evq{{q}}
\def\evr{{r}}
\def\evs{{s}}
\def\evt{{t}}
\def\evu{{u}}
\def\evv{{v}}
\def\evw{{w}}
\def\evx{{x}}
\def\evy{{y}}
\def\evz{{z}}
\def\mA{{\bm{A}}}
\def\mB{{\bm{B}}}
\def\mC{{\bm{C}}}
\def\mD{{\bm{D}}}
\def\mE{{\bm{E}}}
\def\mF{{\bm{F}}}
\def\mG{{\bm{G}}}
\def\mH{{\bm{H}}}
\def\mI{{\bm{I}}}
\def\mJ{{\bm{J}}}
\def\mK{{\bm{K}}}
\def\mL{{\bm{L}}}
\def\mM{{\bm{M}}}
\def\mN{{\bm{N}}}
\def\mO{{\bm{O}}}
\def\mP{{\bm{P}}}
\def\mQ{{\bm{Q}}}
\def\mR{{\bm{R}}}
\def\mS{{\bm{S}}}
\def\mT{{\bm{T}}}
\def\mU{{\bm{U}}}
\def\mV{{\bm{V}}}
\def\mW{{\bm{W}}}
\def\mX{{\bm{X}}}
\def\mY{{\bm{Y}}}
\def\mZ{{\bm{Z}}}
\def\mBeta{{\bm{\beta}}}
\def\mPhi{{\bm{\Phi}}}
\def\mLambda{{\bm{\Lambda}}}
\def\mSigma{{\bm{\Sigma}}}
\newcommand{\tens}[1]{\bm{\mathsfit{#1}}}
\def\tA{{\tens{A}}}
\def\tB{{\tens{B}}}
\def\tC{{\tens{C}}}
\def\tD{{\tens{D}}}
\def\tE{{\tens{E}}}
\def\tF{{\tens{F}}}
\def\tG{{\tens{G}}}
\def\tH{{\tens{H}}}
\def\tI{{\tens{I}}}
\def\tJ{{\tens{J}}}
\def\tK{{\tens{K}}}
\def\tL{{\tens{L}}}
\def\tM{{\tens{M}}}
\def\tN{{\tens{N}}}
\def\tO{{\tens{O}}}
\def\tP{{\tens{P}}}
\def\tQ{{\tens{Q}}}
\def\tR{{\tens{R}}}
\def\tS{{\tens{S}}}
\def\tT{{\tens{T}}}
\def\tU{{\tens{U}}}
\def\tV{{\tens{V}}}
\def\tW{{\tens{W}}}
\def\tX{{\tens{X}}}
\def\tY{{\tens{Y}}}
\def\tZ{{\tens{Z}}}
\def\gA{{\mathcal{A}}}
\def\gB{{\mathcal{B}}}
\def\gC{{\mathcal{C}}}
\def\gD{{\mathcal{D}}}
\def\gE{{\mathcal{E}}}
\def\gF{{\mathcal{F}}}
\def\gG{{\mathcal{G}}}
\def\gH{{\mathcal{H}}}
\def\gI{{\mathcal{I}}}
\def\gJ{{\mathcal{J}}}
\def\gK{{\mathcal{K}}}
\def\gL{{\mathcal{L}}}
\def\gM{{\mathcal{M}}}
\def\gN{{\mathcal{N}}}
\def\gO{{\mathcal{O}}}
\def\gP{{\mathcal{P}}}
\def\gQ{{\mathcal{Q}}}
\def\gR{{\mathcal{R}}}
\def\gS{{\mathcal{S}}}
\def\gT{{\mathcal{T}}}
\def\gU{{\mathcal{U}}}
\def\gV{{\mathcal{V}}}
\def\gW{{\mathcal{W}}}
\def\gX{{\mathcal{X}}}
\def\gY{{\mathcal{Y}}}
\def\gZ{{\mathcal{Z}}}
\def\sA{{\mathbb{A}}}
\def\sB{{\mathbb{B}}}
\def\sC{{\mathbb{C}}}
\def\sD{{\mathbb{D}}}
\def\sF{{\mathbb{F}}}
\def\sG{{\mathbb{G}}}
\def\sH{{\mathbb{H}}}
\def\sI{{\mathbb{I}}}
\def\sJ{{\mathbb{J}}}
\def\sK{{\mathbb{K}}}
\def\sL{{\mathbb{L}}}
\def\sM{{\mathbb{M}}}
\def\sN{{\mathbb{N}}}
\def\sO{{\mathbb{O}}}
\def\sP{{\mathbb{P}}}
\def\sQ{{\mathbb{Q}}}
\def\sR{{\mathbb{R}}}
\def\sS{{\mathbb{S}}}
\def\sT{{\mathbb{T}}}
\def\sU{{\mathbb{U}}}
\def\sV{{\mathbb{V}}}
\def\sW{{\mathbb{W}}}
\def\sX{{\mathbb{X}}}
\def\sY{{\mathbb{Y}}}
\def\sZ{{\mathbb{Z}}}
\def\emLambda{{\Lambda}}
\def\emA{{A}}
\def\emB{{B}}
\def\emC{{C}}
\def\emD{{D}}
\def\emE{{E}}
\def\emF{{F}}
\def\emG{{G}}
\def\emH{{H}}
\def\emI{{I}}
\def\emJ{{J}}
\def\emK{{K}}
\def\emL{{L}}
\def\emM{{M}}
\def\emN{{N}}
\def\emO{{O}}
\def\emP{{P}}
\def\emQ{{Q}}
\def\emR{{R}}
\def\emS{{S}}
\def\emT{{T}}
\def\emU{{U}}
\def\emV{{V}}
\def\emW{{W}}
\def\emX{{X}}
\def\emY{{Y}}
\def\emZ{{Z}}
\def\emSigma{{\Sigma}}
\newcommand{\etens}[1]{\mathsfit{#1}}
\def\etLambda{{\etens{\Lambda}}}
\def\etA{{\etens{A}}}
\def\etB{{\etens{B}}}
\def\etC{{\etens{C}}}
\def\etD{{\etens{D}}}
\def\etE{{\etens{E}}}
\def\etF{{\etens{F}}}
\def\etG{{\etens{G}}}
\def\etH{{\etens{H}}}
\def\etI{{\etens{I}}}
\def\etJ{{\etens{J}}}
\def\etK{{\etens{K}}}
\def\etL{{\etens{L}}}
\def\etM{{\etens{M}}}
\def\etN{{\etens{N}}}
\def\etO{{\etens{O}}}
\def\etP{{\etens{P}}}
\def\etQ{{\etens{Q}}}
\def\etR{{\etens{R}}}
\def\etS{{\etens{S}}}
\def\etT{{\etens{T}}}
\def\etU{{\etens{U}}}
\def\etV{{\etens{V}}}
\def\etW{{\etens{W}}}
\def\etX{{\etens{X}}}
\def\etY{{\etens{Y}}}
\def\etZ{{\etens{Z}}}
\newcommand{\pdata}{p_{\rm{data}}}
\newcommand{\ptrain}{\hat{p}_{\rm{data}}}
\newcommand{\Ptrain}{\hat{P}_{\rm{data}}}
\newcommand{\pmodel}{p_{\rm{model}}}
\newcommand{\Pmodel}{P_{\rm{model}}}
\newcommand{\ptildemodel}{\tilde{p}_{\rm{model}}}
\newcommand{\pencode}{p_{\rm{encoder}}}
\newcommand{\pdecode}{p_{\rm{decoder}}}
\newcommand{\precons}{p_{\rm{reconstruct}}}
\newcommand{\laplace}{\mathrm{Laplace}}
\newcommand{\E}{\mathbb{E}}
\newcommand{\Ls}{\mathcal{L}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\emp}{\tilde{p}}
\newcommand{\lr}{\alpha}
\newcommand{\reg}{\lambda}
\newcommand{\rect}{\mathrm{rectifier}}
\newcommand{\softmax}{\mathrm{softmax}}
\newcommand{\sigmoid}{\sigma}
\newcommand{\softplus}{\zeta}
\newcommand{\KL}{D_{\mathrm{KL}}}
\newcommand{\Var}{\mathrm{Var}}
\newcommand{\standarderror}{\mathrm{SE}}
\newcommand{\Cov}{\mathrm{Cov}}
\newcommand{\normlzero}{L^0}
\newcommand{\normlone}{L^1}
\newcommand{\normltwo}{L^2}
\newcommand{\normlp}{L^p}
\newcommand{\normmax}{L^\infty}
\newcommand{\parents}{Pa}
\DeclareMathOperator*{\argmax}{arg\,max}
\DeclareMathOperator*{\argmin}{arg\,min}
\DeclareMathOperator{\sign}{sign}
\DeclareMathOperator{\Tr}{Tr}
\let\ab\allowbreak
\newcommand{\cmark}{\ding{51}}
\newcommand{\xmark}{}
\newcommand{\todo}[1]{\textcolor{red}{[TODO: #1]}}
\newcommand{\neilz}[1]{{\textcolor{blue}{[Neil: #1]}}}
\newcommand{\PPc}[1]{{\textcolor{orange}{[Pat: #1]}}}
\newcommand{\PP}[1]{{\textcolor{orange}{#1}}}
\newcommand{\dataset}{{\cal D}}
\newcommand{\fracpartial}[2]{\frac{\partial #1}{\partial  #2}}
\newcommand{\ours}{Moshi\xspace}
\newcommand{\mimi}{Mimi\xspace}
\newcommand{\helium}{Helium\xspace}
\newcommand{\spiritlm}{Spirit-LM\xspace}
\newcommand{\interleaving}{Inner Monologue\xspace}
\newcommand{\proba}[1]{\mathbb{P}\left[#1\right]}
\newcommand{\ensuremathmode}[1]{%
  \ifmmode
    #1%
  \else
    $#1$%
  \fi
}
\newcommand{\waveform}{\ensuremathmode{x}}
\newcommand{\waveformhat}{\ensuremathmode{\hat{x}}}
\newcommand{\tok}[2]{\ensuremathmode{A_{#1}^{#2}}}
\newcommand{\dims}{D}
\newcommand{\numsamples}{L}
\newcommand{\numtokens}{S}
\newcommand{\numquantizers}{\ensuremathmode{Q}}
\newcommand{\cardaudio}{\ensuremathmode{N_A}}
\newcommand{\pmr}[1]{\scriptsize$\pm$#1}
\maketitle
\medskip
\begin{keywords}
speech, text, multimodal, foundation, spoken dialogue
\end{keywords}
\newpage

# Introduction

\looseness=-1

Voice has provided a convenient interface to early conversational systems, from Alexa[^1] to Siri[^2] and Google Assistant.[^3] In this context, a \`\`wake word" spoken by the user typically triggers an automatic speech recognition (ASR) system which transcribes the subsequent user's request. Then, a natural language understanding (NLU) pipeline converts this query to a structured format used to produce a text answer through natural language generation (NLG). Eventually, a text-to-speech (TTS) system tells the answer back to the user. While this process can handle short, constrained interactions (e.g. triggering an action or retrieving a fact), the rise of large language models (LLMs) [@gpt3; @chinchilla; @llama1] has called for a consequent extension of voice interfaces to multi-turn, open-ended conversations. A solution to this challenge is handling the NLU and NLG with an LLM, while the ASR and TTS provide the voice interface during the user's and the system's turn respectively [@llama3herd]. This framework supports the current generation of spoken dialogue systems such as Gemini [@team2023gemini] or ChatGPT.[^4]

\looseness=-1

Yet, the experience offered by these interfaces remains far from natural conversations. First, latency compounds along the many components of these pipelines, resulting in a typical global latency of several seconds. This is unlike natural conversations which demonstrate response times of a few hundred milliseconds. Second, as language understanding and generation happens in the textual domain, any non-written information is ignored by the model. This goes from paralinguistic information, such as emotion and accent, to non-speech audio, such as surrounding acoustic events. Finally, these models remain fundamentally turn-based, assuming that dialogue is a sequence of well-defined single-speaker segments. While this paradigm is suited to text dialogue, it falls short in modeling aspects of spoken conversations such as interruptions, overlapping speech--- which amounts for $10$ to $20\%$ of spoken time [@overlap_stats_2] ---and backchanneling (i.e. non-interrupting interjections such as \`\`OK" or \`\`I see").

\looseness=-1

In this work we introduce `\ours{}`{=latex}, a speech-text foundation model and real-time spoken dialogue system that aims at solving the aforementioned limitations: latency, textual information bottleneck and turn-based modeling. `\ours{}`{=latex} augments a text LLM backbone with a smaller audio language model [@audiolm; @uniaudio] that ingests and predicts discrete audio units. This removes the information bottleneck of text by understanding inputs and generating outputs directly in the audio domain, while benefiting from the knowledge and reasoning abilities of the underlying text LLM. We extend previous work on audio language models and design a streaming, hierarchical architecture, with a theoretical latency of 160 ms---lower than the 230 ms average in natural conversations measured over 10 languages [@Stivers2009UniversalsAC]. We furthermore introduce the first multi-stream audio language model, i.e. a model that explicitly processes the input and output audio streams jointly into two autoregressive token streams. This altogether removes the concept of speaker turn and thus allows training the model on natural conversations with arbitrary dynamics including overlap and interruptions. Our resulting model is the first *full-duplex*--- it always listens and always generates sound, either speech or silence---real-time conversational LLM. We summarize our contributions below:

- We present `\helium`{=latex}, a 7B-parameter text LLM that we pretrain on $2.1$T tokens of public English data. [Section `\ref{sec:helium}`{=latex}](#sec:helium) describes the architecture and training of the model, while [Section `\ref{sec:text_dataset}`{=latex}](#sec:text_dataset) provides details on the pretraining data collection and filtering. `\looseness=-1`{=latex}

- We train `\mimi`{=latex}, a neural audio codec [@soundstream; @encodec] that converts audio into the discrete tokens predicted by `\ours{}`{=latex} and back, using residual vector quantization (RVQ). Audio language models typically combine such *acoustic* tokens with *semantic* tokens from a self-supervised speech model as it is necessary to produce intelligible speech in absence of text conditioning [@audiolm]. We rather extend the approach of @zhang2024speechtokenizer by distilling semantic information into the first level of acoustic tokens and introduce improved training tricks. [Section `\ref{sec:mimi}`{=latex}](#sec:mimi) describes the architecture and training of `\mimi `{=latex}while [Section `\ref{sec:mimi_eval}`{=latex}](#sec:mimi_eval) details ablation studies. `\looseness=-1`{=latex}

- We propose `\ours{}`{=latex}, a new architecture for audio language modeling, which combines Helium with a smaller Transformer [@attentionvaswani] model to predict audio tokens in a hierarchical and streaming fashion. We show how challenging it is for such unconditioned audio language models to generate intelligible speech, and we provide solutions that outperform the intelligibility and audio quality of non-streaming models while generating audio in a streaming fashion. We furthermore extend this architecture to model several audio streams in parallel, allowing for a conceptually and practically simple handling of full-duplex dialogues with arbitrary dynamics. [Section `\ref{sec:modeling}`{=latex}](#sec:modeling) describes this architecture. `\looseness=-1`{=latex}

- In [Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving), we introduce `\interleaving`{=latex}, a new training and inference setup for audio language models that significantly improves the factuality and linguistic quality of generated speech by predicting time-aligned text tokens before audio tokens. `\ours{}`{=latex} is a speech-to-speech model as it allows reasoning about non-linguistic information, both from the user audio and from `\ours{}`{=latex}'s audio. Yet, this is not incompatible with `\ours{}`{=latex} producing text along its speech output. Based on the past observation [@audiolm; @zhang2024speechtokenizer] that coarse-to-fine generation (from semantic to acoustic tokens) is critical to generating consistent speech, we extend this hierarchy to using text tokens as a per-timestep prefix to the semantic token. Our experiments show that not only this drastically improves the length and quality of generated speech, but we also show how forcing a delay between text and audio tokens allows deriving streaming ASR and streaming TTS from a `\ours{}`{=latex} model.

- We evaluate all components of `\ours{}`{=latex} along several axes, including text understanding, speech intelligibility and consistency, audio quality and spoken question answering. Our experiments, reported in [Section `\ref{sec:evaluation}`{=latex}](#sec:evaluation), show that our model is state of the art among existing speech-text models for speech modeling and spoken question answering while being streaming compatible and able to model several minutes of context (5 min in our experiments).

We encourage the reader to talk to `\ours{}`{=latex} using our web demo.[^5]

\newpage

# Related Work

#### Audio Language Modeling.

\looseness=-1

Early developments in speech foundation models have improved speech understanding across many discriminative tasks, from automatic speech recognition (ASR) [@wav2vec2; @whisper; @google_usm] to speaker verification [@wavlm] and speech classification [@superb]. A key factor in this development is self-supervised learning [@hubert; @wav2vec2; @wavlm] which allows learning generic, discriminative speech representations. As these speech understanding models build on previous work done on masked language modeling for text [@bert], generative text pretraining [@gpt] has similarly inspired a large family of speech generation models. In particular, @gslm propose quantizing aforementioned self-supervised representations. The resulting discrete *audio tokens* represent a speech segment as a sequence of categorical variables, thus casting speech generation as a language modeling task. AudioLM [@audiolm] furthermore combines these *semantic* tokens with *acoustic* tokens from a neural audio codec [@soundstream], which allows for modeling arbitrary voices, recording conditions and non-speech sounds. These audio language models have redefined the state of the art in speech generation, from text-to-speech [@valle; @spear_tts] to speech-to-speech translation [@audiopalm; @gemini_1.5] and speech enhancement [@uniaudio]. Beyond these supervised tasks, a parallel line of work has explored training and scaling unsupervised audio-only models, trained for autoregressive speech generation [@zerospeech21; @gslm; @audiolm]. The abilities of these models have progressively expanded, from generating short sentences in a single speaker voice [@gslm] to producing meaningful and consistent speech continuations across dozens of seconds in arbitrary voices and conditions [@audiolm], thanks to a hierarchical modeling of semantic and acoustic tokens. A main challenge is that audio requires the modeling of long sequences, up to a few minutes, to produce meaningful and exploitable outputs. However, latent representations for audio are typically less compact than equivalent representations for text. Thus, discrete representations from neural audio codecs require multiple predictions per timestep when modeled autoregressively. [@liu2023audioldm] and [@evans2024stable] use latent diffusion [@ho2020denoising] for general audio and music modeling to alleviate the need for hierarchical discrete tokens. However, these methods cannot be used in a streaming fashion, and it is unclear whether they could generate consistent speech. @musicgen instead show that the number of auto-regressive steps can be reduced by introducing a delay between the different levels of tokens, and performing parallel prediction over them. Inspired by the RQ-Transformer method by @rqtransformer and the hierarchical MegaByte transformer model [@yu2024megabyte], @uniaudio and @zhu2024generativepretrainedspeechlanguage leverage a smaller nested transformer to model the different tokens at a single time step. In this work, we extend these previous works to push the limits of autoregressive speech generation by proposing a scalable hierarchical modeling of audio tokens which can handle several minutes of context while generating audio in real time. Still, while speech-only models learn linguistic structure---lexicon, syntax, semantics--- from raw speech [@zerospeech21], they typically demonstrate poor-to-nonexistent factual knowledge and reasoning abilities. This has led to the development of speech-text models, intended to combine the knowledge and reasoning abilities of text models with the generative power of audio models.

#### Speech-text Models.

\looseness=-1

Such models typically start from a pretrained text language model and either finetune it to predict audio [@twist], or propose a speech-text finetuning task [@audiopalm; @voxtlm; @spectron; @spiritlm; @mitsui2024pslmparallelgenerationtext; @zhang2024speechgpt]: For instance, AudioPALM [@audiopalm] starts from a pretrained PALM [@palm] model, and extends its text vocabulary with semantic audio tokens. Then, the model is trained for a mixture of speech-text tasks, including TTS, ASR and speech-to-speech translation. VoxTLM [@voxtlm] adopts a similar approach for TTS and ASR. While these models are trained in a supervised fashion with specific input and output sequences, Spirit-LM [@spiritlm] uses temporal alignment between speech and its transcript to perform modality switch (from speech tokens to text tokens, or conversely) inside a sequence. This allows the model to learn consistent internal representations of language regardless of it being represented as text or speech, as measured through commonsense evaluation. Another approach, adopted by Spectron [@spectron], SpeechGPT [@speechgpt_emnlp] and PSLM [@mitsui2024pslmparallelgenerationtext], combines speech and text in a hierarchical manner rather than as interchangeable representations. Similar to how AudioLM [@audiolm] decomposes speech generation into predicting semantic tokens and *then* acoustic tokens, Spectron and SpeechGPT use a \`\`Chain-of-Modality" and first produce an utterance as text tokens, subsequently used as a prefix to generate speech. This allows guiding speech generation with the output of an underlying text LLM, however this is fundamentally incompatible with live interactions as the model needs to produce an entire answer as text before it starts speaking. PSLM alleviates this limitation by modeling text and speech tokens in parallel. In this work, we propose `\interleaving `{=latex}as a main architectural and training component to combine aligned text and speech data. `\interleaving `{=latex}decomposes speech into a chain of text, semantic and acoustic tokens, and predicts this structured sequence in a hierarchical manner. Unlike Spirit-LM, this allows representing all utterances both as text and speech, rather than switching between modalities; In addition, the integration of acoustic tokens into the same generative model enables generating arbitrary voices and conditions, rather than a single speaker. Besides, this hierarchical modeling described in [Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving) allows decomposing the generation task without increasing the sequence length of the Transformer [@attentionvaswani] outputs, unlike Chain-of-Modality, while benefiting from producing text a prefix to audio tokens rather than in parallel like PSLM. Moreover, `\interleaving `{=latex}decomposes speech on a per-frame basis, which means that each prediction step outputs a speech frame. This is unlike Spectron and SpeechGPT which require generating a complete sequence as text before generating audio tokens, and this makes `\ours{}`{=latex} compatible with real-time generation. Moreover, we show in [Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving) how `\interleaving`{=latex}, when combined with a delay between token types, allows deriving streaming TTS and ASR systems from `\ours{}`{=latex}. Finally, while Spectron, SpeechGPT and PSLM model both user and system speech and text tokens into a single stream, which requires properly segmented turns, `\ours{}`{=latex} benefits from a novel multi-stream architectures which removes the concept of speaker turns and allows for modeling any type of overlap, interruptions and interjections.

#### Spoken Dialogue Models.

\looseness=-1

Spoken dialogue is one of the less explored tasks in speech generation, as it requires addressing several challenges: 1) The model should run in real time and allow for long conversations in full-duplex---the model always listens and can speak at any moment; 2) it should be speech-to-speech to handle paralinguistic communication; 3) it should display knowledge and reasoning abilities that make it amenable to helpful and enjoyable conversations. Spectron benefits from its underlying text LLM (as measured by spoken question answering), however it is not compatible with real-time generation due to Chain-of-Modality. PSLM proposes generating speech and text tokens in parallel to reduce this latency, however it reduces the quality of answers, and the model still relies on ASR, which removes paralinguistic information. More importantly, these models cannot handle full-duplex communication, where there is no boundary between speaker turns, as any side of the conversation can be active at any time. An attempt at modeling these dynamics has been proposed by @fullduplexfsm, with an ASR system running in parallel to a text generator that feeds into a streaming TTS. While this allows modeling more complex scenarios than previous approaches, it still relies on a cascaded pipeline and models both the user's and the system's speech into a single token stream, which is challenging in presence of significant overlap. The only previous full-duplex dialogue system is dGSLM [@nguyen-etal-2023-generative], which models user and system speech as separate audio token streams and proposes a Siamese architecture to process both streams jointly. While dGSLM is full-duplex, it remains a proof-of-concept: it does not run in an online fashion, it does not benefit from the knowledge of a text language model, and it does not model acoustic information as it only models semantic tokens. `\ours{}`{=latex} addresses these limitations altogether: by modeling two streams of semantic and acoustic tokens hierarchically, `\ours{}`{=latex} is full duplex and can exploit all the information from the user (linguistic and non-linguistic) while producing speech in real time. Thanks to text pretraining and `\interleaving`{=latex}, `\ours{}`{=latex} benefits from the knowledge of its `\helium `{=latex}backbone. Finally, as the model produces acoustic tokens along with text and semantic tokens, it can generate an arbitrary range of emotions, voices and acoustic conditions. To the best of our knowledge, `\ours{}`{=latex} is the first audio language model that successfully addresses the many aforementioned challenges of spoken dialogue.

# Model

## Overview

<figure id="fig:moshi-overview" data-latex-placement="t">
<img src="figures/overview_moshi_v2.png" style="width:85.0%" />
<figcaption><strong>Overview of </strong>.  is a speech-text foundation model which enables real-time spoken dialogue. The main components of ’s architecture are: a bespoke text language model backbone (, see <a href="#sec:helium">Section )</a>; a neural audio codec with residual vector quantization and with semantic knowledge distilled from a self-supervised speech model (, <a href="#sec:mimi">Section </a>); the streaming, hierarchical generation of semantic and acoustic tokens for both the user and , along with time-aligned text tokens for when using  (<a href="#sec:modeling">Section </a>). </figcaption>
</figure>

`\ours{}`{=latex} is a multi-stream speech-to-speech Transformer model, which allows for full-duplex spoken dialogue with a user thanks to an innovative architecture summarized in [Figure `\ref{fig:moshi-overview}`{=latex}](#fig:moshi-overview). `\ours{}`{=latex} is built on top of Helium, a text LLM which we build from scratch ([Section `\ref{sec:helium}`{=latex}](#sec:helium)), relying on high-quality text data to provide strong reasoning abilities to the model. We also propose `\interleaving{}`{=latex} ([Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving)), a training and inference procedure in which we jointly model text and audio tokens. This allows the model to fully exploit the knowledge imparted from the text modality, while remaining a speech-to-speech system. To enable real-time dialogue, we also design `\ours{}`{=latex} as a multi-stream architecture from the get-go ([Section `\ref{sec:multistream}`{=latex}](#sec:multistream)): The model is able to both speak and listen to the user at the same time, and does not need to explicitly model speaker turns. In addition, to capture the input user audio and output `\ours{}`{=latex}'s voice with high quality and in an efficient manner, we propose Mimi ([Section `\ref{sec:mimi}`{=latex}](#sec:mimi)), a neural audio codec combining semantic and acoustic information into a single tokenizer by using residual vector quantization and knowledge distillation. To jointly model the audio streams from `\ours{}`{=latex} and the user, as well as `\ours{}`{=latex}'s text tokens, we rely on a Depth Transformer compatible with streaming inference ([Sections `\ref{sec:joint_ar}`{=latex}](#sec:joint_ar), `\ref{sec:audio_modeling}`{=latex}).

In this section, we further detail each of these components. We then describe the training datasets and the different training phases we used to train `\ours{}`{=latex} in [Section `\ref{sec:maindatasets}`{=latex}](#sec:maindatasets). Finally, in [Section `\ref{sec:evaluation}`{=latex}](#sec:evaluation), we report thorough evaluation results on `\ours{}`{=latex}'s abilities, both linguistic and acoustic, as well as ablation experiments on its main components, while [Section `\ref{sec:safety}`{=latex}](#sec:safety) provides analyses on the safety of our system.

## The `\helium `{=latex}Text Language Model {#sec:helium}

### Architecture

`\helium `{=latex}is an autoregressive language model, based on the Transformer architecture [@attentionvaswani]. Following previous work in this area, we make the following changes to the original architecture: First, we use RMS normalization [@zhang2019root] at the input of the attention blocks, the feed-forward blocks and the output linear layer of the model. We use rotation positional embeddings [@su2024roformer RoPE], a context length of 4,096 tokens and FlashAttention [@dao2022flashattention] for efficient training. Finally, we change the architecture of the feed-forward blocks and use Gated Linear Units [@shazeer2020glu], with the SiLU activation as a gating function [@hendrycks2016gaussian]. Our tokenizer is based on the unigram model from SentencePiece [@kudo2018sentencepiece], and contains 32,000 elements mostly targeting English. We split all numbers into single digits, and use byte-backoff to ensure that our tokenizer does not lose information. We train the model with the AdamW [@loshchilov2017decoupled] optimizer, with a fixed learning rate followed by a cosine learning rate decay [@loshchilov2016sgdr].

\footnotesize

+:-----------------------+:-----------------:+:-------------------------:+:-----------------:+:-----------------:+:-----------------:+
|                        | Helium training   | Moshi training            |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Hyper-parameter        | pre-training      | pre-training              | post-training     | fisher            | fine              |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| *Temporal Transformer* |                   |                           |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Model dimension        | 4096              | same                      |                   |                   |                   |
+------------------------+-------------------+                           +-------------------+-------------------+-------------------+
| MLP dimension          | 11264             |                           |                   |                   |                   |
+------------------------+-------------------+                           +-------------------+-------------------+-------------------+
| Number of heads        | 32                |                           |                   |                   |                   |
+------------------------+-------------------+                           +-------------------+-------------------+-------------------+
| Number of layers       | 32                |                           |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Context size           | 4096              | 3000 steps, e.g. 4 min.   |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Learning rate          | $3 \cdot 10^{-4}$ | $3 \cdot 10^{-5}$         | $3 \cdot 10^{-6}$ | $2 \cdot 10^{-6}$ | $2 \cdot 10^{-6}$ |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| *Depth Transformer*    |                   |                           |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Model dimension        | \-                | 1024                      |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| MLP dimension          | \-                | 4096                      |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Number of heads        | \-                | 16                        |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Number of layers       | \-                | 6                         |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Learning rate          | \-                | $2 \cdot 10^{-4}$         | $5 \cdot 10^{-5}$ | $4 \cdot 10^{-6}$ | $2 \cdot 10^{-6}$ |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| *Input / Output space* |                   |                           |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Text cardinality       | 32000             | 32000                     |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Audio cardinality      | \-                | 2048                      |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Frame rate             | \-                | 12.5 Hz                   |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| *Common parameters*    |                   |                           |                   |                   |                   |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Batch size (text)      | 4.2M tok.         | 1.2M tok.                 | 1.2M tok.         | \-                | \-                |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Batch size (audio)     | \-                | 16h                       | 8h                | 40min             | 2.7h              |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Training steps         | 500k              | 1M                        | 100k              | 10k               | 30k               |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| LR Schedule            | cosine            | cosine                    | \-                | \-                | \-                |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Acoustic delay         | \-                | 2                         | 1                 | 1                 | 1                 |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+
| Text delay             | \-                | $\pm$`<!-- -->`{=html}0.6 | 0                 | 0                 | 0                 |
+------------------------+-------------------+---------------------------+-------------------+-------------------+-------------------+

: `\label{tab:hparams}`{=latex}**Models' hyper-parameters**. Hyper-parameters for the architecture and training of our 7B-parameter Helium language model and of `\ours`{=latex}, our speech-text dialogue model. The training of `\ours{}`{=latex} goes through 4 phases: Pre-training on unsupervised data (with Temporal Transformer initialized from `\helium`{=latex}); Post-training with simulated multi-stream based on diarization; Fine-tuning on the Fisher dataset [@cieri2004fisher] to gain its fully duplex capabilities; Instruction fine-tuning on a custom dataset built from synthetic interaction scripts. During the pre-training phase, we keep training half of the time on full text batches from the same dataset as used for Helium, using a separate optimizer state.

### Pre-training data filtering {#sec:data_filtering}

Training data is one of the critical ingredients to train LLMs: we now describe our method to obtain a large and high-quality text dataset. We start from high-quality data sources, such as Wikipedia, Stack Exchange and a large collection of scientific articles. As the quantity of data from these sources is too small to train a LLM, we also rely on web crawled data, specifically from CommonCrawl, to extend our dataset. See more details on data sources in [Section `\ref{sec:text_dataset}`{=latex}](#sec:text_dataset). Web data requires extensive processing to obtain a high-quality training set: we perform deduplication, language identification and quality filtering. In the following, we describe each operation in more details.

#### Deduplication.

We start from the WET files, which contain only the text content of web-pages, which was extracted by the CommonCrawl project. Because this format contains all the text of a page, it includes a lot of boilerplate such as navigation menus. Thus, the first step of our pipeline is to deduplicate each shard (there is 100 shards per crawl) at the *line level*, to remove this boilerplate. To do so, we compute the `FNV-1a`[^6] hash of each line, and use a bloom filter to remove duplicates. We also train a `fastText` [@joulin2016bag] classifier on duplicates *vs.* non-duplicates, to perform fuzzy deduplication: here we only remove blocks of at least 3 consecutive lines that are classified as duplicates.

#### Language identification.

Once deduplication is performed, we apply a language identifier based on `fastText` to keep English data only. Language identification is performed at the document level, and we only keep documents above a certain threshold (0.85).

#### Quality filtering.

\looseness=-1

The last step is to filter the remaining data, to keep high-quality webpages only. To perform this step, we train a `fastText` classifier on lines from our high quality data sources and from random CommonCrawl webpages. We obtain a classifier with 9 categories, corresponding to our different high quality sources such as Wikipedia or Wikibooks and to subsets of StackExchange such as STEM or humanities. The motivation is to obtain a finer control over which documents to keep, not only based on similarity to high quality sources, but also based on their domains. This classifier is applied at the *line level*, and an aggregated score is obtained by computing the average scores of each line, weighted by their length. Again, we keep documents corresponding to scores above a certain threshold.

## Audio Tokenization {#sec:mimi}

\looseness=-1

To discretize waveforms into audio tokens, we introduce Mimi, a neural audio codec [@soundstream; @encodec] that operates as an autoencoder with a discrete bottleneck [@vqvae]. In the literature, and following the terminology defined by @audiolm, these tokens are referred to as *acoustic* tokens, as they model fine audio details and are optimized for high-quality reconstruction. While these acoustic tokens provide appropriate targets for conditioned text-to-audio models (e.g. text-to-speech [@valle] or text-to-music [@musicgen]), unconditioned speech generation requires combining them with *semantic* tokens extracted from self-supervised speech models [@wav2vec2; @hubert; @w2vbert]. Unlike their acoustic counterpart, semantic tokens do not allow for reconstructing high-quality audio but correlate strongly with linguistic content. This similarity with language allows generating intelligible and consistent speech, even without text conditioning, by using semantic audio tokens as a prefix to predicting acoustic tokens. Yet, this hybrid tokenization approach is not compatible with real-time generation. Semantic tokens are typically not causal and can thus only be computed in an offline manner. Moreover, generating acoustic and semantic tokens with separate encoders represents a non-negligible computational burden. Consequently, and taking inspiration from previous work on SpeechTokenizer [@zhang2024speechtokenizer], Mimi uses distillation to transfer non-causal, high-level semantic information into the tokens produced by a causal model, allowing for streaming encoding and decoding of semantic-acoustic tokens.

<figure id="fig:split_quantizer" data-latex-placement="t">
<img src="figures/mimi_arch_training.png" />
<figcaption><strong>Architecture and training of , our neural audio codec, with its split residual vector quantization</strong>. During training (blue part, top), we distill non-causal embeddings from WavLM <span class="citation" data-cites="wavlm"></span> into a single vector quantizer which produces semantic tokens, and is combined with separate acoustic tokens for reconstruction. </figcaption>
</figure>

### Architecture

Our baseline architecture takes inspiration from SoundStream [@soundstream] and Encodec [@encodec] and consists of a SeaNet [@tagliasacchi2020seanet] autoencoder and a Residual Vector Quantizer [@soundstream]. The encoder projects a single-channel waveform $\waveform \in \mathbb{R}^{\numsamples}$ to a latent representation $\mathrm{enc}(x) \in \mathbb{R}^{\numtokens\times\dims}$ by cascading residual convolutional blocks that interleave dilated [@oord2016wavenet] and strided convolutions along with ELU [@elu] non-linearities and Weight Normalization [@weightnorm]. All convolutions are causal, such that this autoencoder can run in a streaming fashion. With 4 convolutional blocks and respective striding factors $(4, 5, 6, 8)$, and a final 1D convolution with stride 2, `\mimi`{=latex}'s encoder projects a 24kHz waveform to a latent representation of $12.5$ frames per second and dimension $D = 512$. Symmetrically, the decoder adopts a similar structure but with transposed convolutions rather than strided ones, to project the latent representation back to 24kHz audio. We discretize the latent space with a Residual Vector Quantizer [@soundstream], which iteratively applies vector quantization (VQ) to the residuals of the previous quantizer. With `\numquantizers{}`{=latex} quantizers, each with a codebook of `\cardaudio{}`{=latex} centroids, the RVQ discretizes the latent space into $\{1,\dots,\cardaudio\}^{\numtokens\times\numquantizers}$. As a baseline, we train this model with a combination of reconstruction and adversarial losses, following the setup of Encodec [@encodec]. We detail below the main changes of `\mimi `{=latex}with respect to this default configuration.

#### Transformer-based bottleneck.

\looseness=-1

To improve the ability of `\mimi `{=latex}to encode speech into compact representations while reconstructing high-quality audio, we add Transformer modules in the bottleneck, one right before quantization and one after. These Transformers have 8 layers, 8 heads, RoPE position encodings, a finite context of 250 frames (20 seconds), GELU [@gelu] activations, a model dimension of 512 and an MLP dimension of 2048. To stabilize training, we use LayerScale [@layerscale], with initialization of the diagonal values at $0.01$. Both Transformers use causal masking, which preserves the compatibility of the whole architecture with streaming inference. Both Transformers prove to be useful with regard to perceived audio quality, while Transformer in the encoder also improves the distillation of semantic information described below (see [Table `\ref{tab:mimi_ablations}`{=latex}](#tab:mimi_ablations) for ablation studies).

\looseness=-1

#### Causality and streaming.

With the mentioned hyper-parameters, Mimi is causal and can be used in a streaming fashion, both for encoding and decoding. Both its initial frame size and overall stride correspond to 80ms, meaning that given a first audio frame of 80ms, Mimi outputs a first latent timestep, which can be decoded to 80ms of output audio.

#### Optimization.

\looseness=-1

Unlike purely convolutional codecs that use Adam [@adam], the introduction of Transformers into the architecture requires additional regularization with weight decay along with using the AdamW [@adamw] optimizer. More precisely, we apply weight decay only to the parameters of the Transformers, with a weight of $5\cdot 10^{-2}$. We use a learning rate of $8\cdot 10^{-4}$, a momentum decay of $0.5$ and a decay of the squared gradient of $0.9$, and an exponential moving average of weights with a decay of $0.99$. We train with a batch size of $128$ on random windows of $\SI{12}{\second}$, for 4M steps, while the context of Transformers is limited to 10s (250 frames before the last downsampling layer of the encoder, and symmetrically for the decoder).

#### Quantization rate.

\looseness=-1

We use $\numquantizers= 8$ quantizers, each with a codebook size of $\cardaudio = 2048$. At 12.5Hz, this represents a bitrate of 1.1kbps. While the latent dimension is 512, we project embeddings to 256 dimensions before applying the RVQ, and project back to 512 before the decoder. Consistently with previous work, we use quantizer dropout [@soundstream] to provide the codec with bitrate scalability. We moreover follow the observation of  @dac that not applying quantization with a certain probability during training improves audio quality. More precisely, we only apply quantization $50\%$ of the time, on a per-sequence basis, during training. Unlike  @dac, this means passing unquantized embeddings to the decoder, rather than passing embeddings quantized with all quantizers. [Table `\ref{tab:mimi_ablations}`{=latex}](#tab:mimi_ablations) shows that this significantly improves objective quality metrics, while human evaluations are not conclusive. Across our experiments, we make the somehow counter-intuitive observation that this gain gets more significant as we lower the bitrate.

#### Adversarial-only training.

\looseness=-1

As a baseline, we train Mimi with the same combination of reconstruction and adversarial losses as  @encodec, namely a multi-scale mel-spectrogram reconstruction loss along with a multi-scale STFT discriminator. The exact parameters can be found in the Audiocraft repository.[^7] While previous neural codecs rely on such combinations of reconstruction and adversarial losses, we experiment with pure adversarial training, where we only keep the feature loss and discriminator loss. We note that this was previously experimented in the context of bandwidth extension by @tagliasacchi2020seanet and  @hauret2023eben. While removing reconstruction losses majorly degrades objective metrics, we observed during development that the resulting audio sounded much better than expected based on aforementioned metrics. Subjective evaluations reported in [Table `\ref{tab:mimi_baselines}`{=latex}](#tab:mimi_baselines) confirm this observation and demonstrate a remarkable boost in audio quality from training with adversarial losses only.

### Learning semantic-acoustic tokens with a split RVQ {#sec:splitrvq}

\looseness=-1

Similarly to SpeechTokenizer [@zhang2024speechtokenizer], we distill semantic information from a self-supervised model (WavLM[^8] [@wavlm] in our case) into the first level of the RVQ. WavLM projects a 16kHz waveform into 1024-dimensional embeddings sampled at 50Hz, while `\mimi `{=latex}projects a 24kHz waveform into 512-dimensional at 12.5Hz. During training, we thus produce targets for distillation by downsampling the input waveform to 16kHz before computing WavLM embeddings followed by average pooling with a stride of 4 and a kernel size of 8, to reach 12.5 Hz. Interestingly, we observed that it was critical for performance to perform this average pooling in a non-causal way, which is compatible with streaming inference as these embeddings are only used during training. We apply a linear projection with an output dimension of 1024 to the output of the first RVQ level, parallel to the actual embedding going into the decoder. We then compute a cosine distance between the output of the first quantizer and the transformed WavLM embeddings, to perform distillation. [Table `\ref{tab:mimi_ablations}`{=latex}](#tab:mimi_ablations) shows that this distillation loss conflicts with reconstruction and adversarial losses targeting quality. Indeed, while distillation significantly improves the phonetic discriminability of the first quantizer (as measured by ABX [@abx]), it also affects audio quality negatively. We hypothesize that this is due to distilling semantic information into the first level of a single RVQ: As higher-order quantizers operate on the residual of the first one, the latter needs to trade audio quality for phonetic discriminability. We address this issue by proposing a split RVQ. Rather than a single RVQ with 8 levels, we distill semantic information into a plain VQ and apply an RVQ with 7 levels in parallel. We sum their outputs, such that while both can be used for reconstruction, we remove the constraint that acoustic information should be conserved in the residual of the semantic quantizer. [Figure `\ref{fig:split_quantizer}`{=latex}](#fig:split_quantizer) illustrates this architecture and [Table `\ref{tab:mimi_ablations}`{=latex}](#tab:mimi_ablations) shows that this solution provides a better semantic-acoustic trade-off overall.

## Generative Audio Modeling {#sec:modeling}

\looseness=-1

We now describe how we extend the base Helium model to support the modeling of the audio tokens provided by the `\mimi{}`{=latex} codec. With our goal of achieving realistic spoken dialogue interactions, we further show how to model not just a single stream of audio, but two at the same time, one representing the user, and one the system. Finally, we detail a novel feature, the *`\interleaving`{=latex}*, which consists in a joint modeling of the textual and audio modalities on the system side, to improve the quality of interactions.

### Hierarchical autoregressive modeling with RQ-Transformer {#sec:joint_ar}

Let $U \in \{1, \ldots, N\}^S$ be a discrete random sequence, with cardinality $N$ and a sequence length $S$. For convenience, we also denote $U_0 = 0$, a deterministic initial token value. Autoregressive modeling consists in estimating the joint distribution $\proba{U_1, \ldots, U_S}$ through estimating the conditional distributions $\proba{U_s | U_0, \ldots U_{s - 1}}$ for all steps $1 \leq s \leq S$. Text language models, such as GPT [@gpt2] or Helium, fit this paradigm.

<figure id="fig:rqtransformer" data-latex-placement="t">
<img src="figures/rq_former_landscape.png" style="width:50.0%" />
<figcaption><strong>Architecture of the RQ-Transformer.</strong> The RQ-Transformer breaks down a flattened sequence of length <span class="math inline"><em>K</em> ⋅ <em>S</em></span> into <span class="math inline"><em>S</em></span> timesteps for a large Temporal Transformer which produces a context embedding used to condition a smaller Depth Transformer over <span class="math inline"><em>K</em></span> steps. This allows scaling to longer sequences by increasing <span class="math inline"><em>S</em></span>—or to a higher depth by increasing <span class="math inline"><em>K</em></span>— than modeling the flattened sequence with a single model. In this figure, we use <span class="math inline"><em>K</em> = 4</span> for the sake of illustration. </figcaption>
</figure>

When modeling spoken language, relying on the tokenized text yields a much more compact representation than audio tokens: Using the `\mimi{}`{=latex} codec introduced in [Section `\ref{sec:mimi}`{=latex}](#sec:mimi), with $Q=8$ codebooks at a frame rate of 12.5hz, one would require a sequence length of $100$ steps per second of audio to generate. To model 5 minutes of audio, this would amount to 30,000 timesteps, which represents a significant computational cost and generating 100 tokens per second is incompatible with streaming inference. As a comparison, a sample of English speech can be represented with around 3 to 4 text tokens per second.

\looseness=-1

We are interested in modeling not just a single sequence $(U_s)$, but multiple sub-sequences, e.g. different audio codebooks, along with an optional text stream. We can stack those sub-sequences as $V_{s, k}$ for $1 \leq s \leq S$ and $1 \leq k \leq K$. Similarly, we define $V_{0, k} = 0$, a deterministic initial token value for all sub-sequences. For each $1 \leq s\leq S$ and $1 \leq k\leq K$, $V_{s, k} \in \{1, \ldots, N_k\}$, where $N_k$ is the cardinality of the $k$-th sub-sequence. One can *flatten* the $K$ sequences into a single one, increasing the number of predictions by $K$. @rqtransformer propose using a smaller autoregressive model along the dimension $K$, combined with a larger model along the time dimension, forming a RQ-Transformer. Later, @yu2024megabyte suggested a similar approach for byte-level modeling while @uniaudio and @zhu2024generativepretrainedspeechlanguage applied it to audio token modeling.

#### RQ-Transformer.

Formally, the RQ-Transformer consists in two Transformer models, as illustrated in [Figure `\ref{fig:rqtransformer}`{=latex}](#fig:rqtransformer). It consists of a *Temporal* Transformer, e.g. with the same architecture as the one described for Helium in [Section `\ref{sec:helium}`{=latex}](#sec:helium), and a smaller *Depth* Transformer. We denote $\mathrm{Tr}_{\mathrm{Temp}}$ the function represented by the Temporal Transformer, and $\mathrm{Tr}_{\mathrm{Depth}}$ the one for the Depth Transformer. For simplicity, and for all steps $s\leq S$, we denote $V_{s} = (V_{s,1}, \ldots, V_{s, K})$ the joint value of all sub-sequences at step $s$. For a given sequence step $1 \leq s \leq S$, the Temporal Transformer maps $(V_0, \ldots, V_{s - 1})$ to a *temporal context* vector $$\begin{equation}
z_s = \mathrm{Tr}_{\mathrm{Temp}}(V_0, \ldots, V_{s - 1}) \in \mathbb{R}^d.
\end{equation}$$ If we further take a sub-sequence index $1 < k \leq K$, the Depth Transformer maps both $z_s$ along with $(V_{s, 1}, \ldots, V_{s, k - 1})$ to the logits estimate $$\begin{equation}
\label{eq:logits}
    l_{s, k} = \mathrm{Tr}_{\mathrm{Depth}}(z_s, V_{s, 1}, \ldots, V_{s, k - 1}) \in \mathbb{R}^{N_k}.
\end{equation}$$ We further define $l_{s, 1} = \mathrm{Lin}(z_s) \in \mathbb{R}^{N_1}$, with $\mathrm{Lin}$ a dedicated linear layer. We train $\mathrm{Tr}_{\mathrm{Temp}}$, $\mathrm{Tr}_{\mathrm{Depth}}$ and $\mathrm{Lin}$ so that $\softmax(l_{s, k})$ is a good approximation of the distribution of $V_{s, k}$ conditioned on all sub-sequences for the previous steps, and of the previous sub-sequences for the current step, e.g. $$\begin{equation}
\begin{cases}
    \softmax(l_{s, 1}) &\approx \proba{V_{s, 1} | V_{0}, \ldots, V_{s-1}}\\
    \softmax(l_{s, k}) &\approx \proba{V_{s, k} | V_{0}, \ldots, V_{s-1}, V_{s, 1}, \ldots V_{s, k - 1}} \quad \text{if $k > 1$}.
\end{cases}
\end{equation}$$ Importantly, the number of steps in the Temporal Transformer is always equal to $\numtokens$, rather than $K \cdot \numtokens$, and the number of steps in the Depth Transformer is at most $K$. In practice, the Temporal Transformer receives at each step $s$ as input the sum of $K$ learnt embedding tables representing the value for the last $V_{s - 1}$. Given $1 < k \leq K$, the Depth Transformer receives as input the sum of $z_s$ and of a learnt embedding representing $V_{s, k - 1}$.

\looseness=-1

As detailed in [Table `\ref{tab:hparams}`{=latex}](#tab:hparams), our Depth Transformer has 6 layers, a dimension of 1024, and 16 attention heads. Unlike @rqtransformer [@uniaudio; @zhu2024generativepretrainedspeechlanguage], we use different parameters per index $k$ for the linear layers, projection and fully connected, in the Depth Transformer. Indeed, different sub-sequences might require different transformations. Given the smaller size of this Transformer, this has no impact on both training and inference time, while [Table `\ref{tab:ablation-delays}`{=latex}](#tab:ablation-delays) shows that this depthwise parametrization is beneficial.

### Audio modeling {#sec:audio_modeling}

\looseness=-1

The audio codec `\mimi{}`{=latex} described in [Section `\ref{sec:mimi}`{=latex}](#sec:mimi) outputs $Q$ sub-sequences, with 12.5 steps per second of audio. We denote those sequences by $A_{t, q} \in \{1, \ldots, N_A\}$ for $1 \leq t \leq T$ with $T = 12.5 \cdot \mathrm{duration}$, and $1 \leq q \leq Q$ with $Q=8$. We insert the audio sub-sequences into the multi-sequence $V$ modeled by the RQ-Transformer. Remember that the first codebook $A_{t, 1}$ corresponds to the semantic information, as detailed in [Section `\ref{sec:splitrvq}`{=latex}](#sec:splitrvq), while the other codebooks correspond to acoustic features.

#### Acoustic delay.

\looseness=-1

We first experimented with simply setting $V = A$ in the modeling. However we find that introducing a slight delay between the semantic and acoustic tokens led to more stable generations. @musicgen show that this leads to reduced dependencies between the sub-sequences for a given time step, conditioned on the past, thus allowing to use a weaker model to approximate the joint distribution $\proba{V_{s, k} | V_{0}, \ldots, V_{s-1}}$ (in their case, as the product of the conditioned marginals). @lemercier2024mmd further show a connection between the mutual information between the sub-sequences at a given step, and the quality of the generation: naturally, the more complex the interdependence, the more powerful a model will be needed to estimate them.

As shown in [Section `\ref{sec:moshi_ablations}`{=latex}](#sec:moshi_ablations), introducing a delay of 1 or 2 steps between the semantic and acoustic features greatly improves the quality of the generation. This allows the Temporal, larger, Transformer to model the inter-dependence between semantic and acoustic features. Formally, given a delay $\tau \in \mathbb{N}$, we have, for all steps $s$ $$\begin{equation}
\begin{cases}
    \label{eq:v_a_delay}
    V_{s, 1} = A_{s, 1}& \\
    V_{s, q} = A_{s - \tau, q} &\quad \text{if}\quad s \geq \tau + 1, q > 1\\
    V_{s, q} = 0&\quad \text{if}\quad s < \tau + 1, q > 1.\\
\end{cases}
\end{equation}$$ Note that using RQ-Transformers to model audio was successfully used by @uniaudio and @zhu2024generativepretrainedspeechlanguage. We introduce here the use of per-codebook parameters in the Depth Transformer, and the use of the acoustic delay. Compared with [@zhu2024generativepretrainedspeechlanguage] which first generates all the semantic tokens, we generate them jointly with the acoustic tokens, which allows for the first time a streaming modeling of semantic and acoustic tokens jointly.

### Multi-stream modeling {#sec:multistream}

\looseness=-1

Modeling a single stream of audio is not sufficient to fully model a conversation. Our framework can be extended to modeling a two-speaker conversation: given two streams of audios $(A_{t, q})$ and $(A'_{t, q})$, we simply apply the acoustic delay to both, and concatenate them into $V$, extending `\autoref{eq:v_a_delay}`{=latex}. In practice, $A$ will correspond to `\ours`{=latex}, while $A'$ models the user.

![**Representation of the joint sequence modeled by Moshi**. Each column represents the tokens for a given step in the joint sequence $(V_{s, k})$ described in `\autoref{eq:final_multi_sequence}`{=latex} with an acoustic delay $\tau=1$, e.g. the input of the Temporal Transformer for this step. Tokens are predicted from bottom to top in the Depth Transformer. At inference time, tokens under the dashed line (corresponding to Moshi) are sampled, while those above are fed from the user. This design allows for our model to handle overlapping speech turns.](figures/rainbow.png){#fig:rainbow width="95%"}

### `\interleaving`{=latex} {#sec:interleaving}

While operating purely in the audio domain already yields convincing results (see [Table `\ref{tab:textless}`{=latex}](#tab:textless)), we observe that having Moshi also model the textual representation of its own speech is providing a scaffolding that increases the linguistic quality of its generation. Formally, we define a text stream $W \in \{1, \ldots, N_W\}^T$ derived from the sequence of text tokens obtained by applying the SentencePiece tokenizer [@kudo2018sentencepiece] to the transcription of the audio corresponding to Moshi with Whisper [@whisper], as described after. We insert $W$ as the first sub-sequence in $V$, such that it acts as a prefix to the generation of semantic tokens. This can be seen as an extension of the hierarchical semantic-to-acoustic generation introduced by @audiolm. Note that we do not use the textual representation corresponding to the stream of the user, as transcribing this flux in real time would be challenging, and relying on an external ASR system contradicts our end-to-end speech-to-speech approach. Ablation studies in [Section `\ref{sec:moshi_ablations}`{=latex}](#sec:moshi_ablations) show that among the design choices made for `\ours{}`{=latex}, `\interleaving{}`{=latex} has one of the most critical impacts on the quality of generated speech.

#### Aligning text and audio tokens.

To integrate text tokens with audio tokens that operate at a constant framerate of 12.5Hz, we need to align them to this framerate. For that, we leverage the word-level timestamp provided by Whisper. The $i$-th word in the transcript is mapped to $n_i \in \mathbb{N}^*$ text tokens $w_{i, j}$, $j \leq n_i$, along with a start index $t_i \in \{1, \ldots T\}$, simply defined as its start timestamp divided by the framerate of 12.5 Hz. We define two special tokens: `PAD` and `EPAD`, that never appear in any of the word tokens. We build $W$ such that when a word starts, $(W_t)$ contains its text tokens, followed by `PAD` until the next word. `EPAD` is inserted before the next word to indicate the end of the padding. While not strictly necessary, we observed this provided a useful guidance to the model by splitting the decision of ending a word, and which one should follow, into two steps.

First, the sequence $(W_t)$ is initialized with `PAD` tokens, e.g. $\forall t, W_t \leftarrow \text{\texttt{PAD}}$. Then, we modify it iteratively as follows. For each word $i$ and its start index $t_i$, we update $W$ as $$\begin{equation}
\left\{\begin{array}{ll}
    W_{t_i - 1} &\leftarrow \text{\texttt{EPAD}} \\
    W_{t_i + j} &\leftarrow w_{i, j} \qquad \forall j \leq n_i.
\end{array}\right.
\end{equation}$$ `\looseness=-1`{=latex} Note that if $t_i = 1$, we instead insert `EPAD` at index 1, and shift the text tokens. We do not insert an `EPAD` token if it would overwrite a text token from a previous word. As text tokens are more compact than the corresponding audio tokens, there is usually no overlap between words in $W_t$. In English conversational speech, we observe that padding tokens represent about 65% of the tokens.

#### Deriving streaming ASR and TTS.

One can further introduce some delay between the text sequence $(W_t)$, and the audio tokens $(A_{t, q})$. This controls in which modality the language model will take the decision about the content of the generated audio. By setting the audio ahead of the text, the content of the text will be dictated by what audio has been sampled in the previous steps. In particular, by sampling only the text tokens, while using the ground truth audio tokens and discarding the prediction of the model for them, one obtain a streaming Automatic Speech Recognition model, which also provides precise word level alignment. On the other hand, by changing the text delay so that the text is ahead of the audio tokens, the content of the audio is dictated by the text content. Once more, given a sequence of properly padded text tokens, one obtain a streaming Text-To-Speech model. We further describe in [Appendix `\ref{app:streaming_tts}`{=latex}](#app:streaming_tts) how one can adapt the inference of a language model with delayed text to obtain a zero-shot properly padded text tokens sequence. Experiments in [Section `\ref{sec:streaming_asr_tts}`{=latex}](#sec:streaming_asr_tts) show how a single delay hyper-parameter allows for switching from an ASR to a TTS model with no changes in the loss, architecture, or training data.

#### Joint sequence modeling for Moshi.

Putting together the multi-stream and inner monologue, we have the final set $V$ of sequences to model defined as $$\begin{equation}
\label{eq:final_multi_sequence}
    \left\{\begin{array}{lll}
    V_{s, 1} &= W_{s} \quad &\text{aligned text tokens.} \\
    V_{s, 2} &= A_{s, 1} \quad &\text{semantic tokens of Moshi.}\\
    V_{s, 1 + q} &= A_{s - \tau, q} \quad \text{if}\quad s \geq \tau + 1, 1 < q \leq Q \quad &\text{delayed acoustic tok. of Moshi.}\\
    V_{s, 1 + Q + 1} &= A'_{s, 1} \quad &\text{semantic tokens of \emph{other}.}\\
    V_{s, 1 + Q + q} &= A'_{s - \tau, q} \quad \text{if}\quad s \geq \tau + 1, 1 < q \leq Q\quad &\text{delayed acoustic tok. of \emph{other},}\\
    \end{array}\right.
\end{equation}$$ amounting to a total number of $K=2Q+1$ streams, with $Q=8$ in the experiments. A summary is provided in [Figure `\ref{fig:rainbow}`{=latex}](#fig:rainbow).

#### Inference of Moshi.

\looseness=-1

The joint sequence given by `\autoref{eq:final_multi_sequence}`{=latex} is the target for our modeling task at train time: At any time step $s$, the model is input with $0, V_1, \ldots, V_{s - 1}$ and output an estimated probability distribution $\hat{V}_{s}(0, V_1, \ldots, V_{s - 1})$. At inference time, we sample from $\hat{V}_{s, k}$ for all the sub-sequence indexes that corresponds to outputs of Moshi: i.e., for $k=1$ for the text tokens corresponding to Moshi's speech, and for $k \in \{2, \ldots, 2 + Q\}$ for Moshi's audio tokens. In an application setting, prediction for the audio coming from the user ($k > 2 + Q)$ is actually ignored, as the actual user audio is used instead. However, modeling the user stream as output allows generating simulated dialogues, which is necessary for offline evaluation as in [Section `\ref{sec:eval_dialogues}`{=latex}](#sec:eval_dialogues). Interestingly, there is no explicit boundaries for the change of turns between the user and Moshi: Moshi can speak and listen at all time, and do both at once if needed. In particular, when the user speaks and Moshi stays silent, the corresponding audio tokens for Moshi's stream decode into \`\`natural silence\", a near silent waveform, instead of having a fixed, well defined value; At the same time, `\ours{}`{=latex}'s text stream will be filled with `PAD` tokens. As a result, the text stream can provide interesting ways of controlling Moshi, for instance, forcing the sampling of a `EPAD` token will make Moshi start talking immediately.

# Datasets and Training {#sec:maindatasets}

## Text Data {#sec:text_dataset}

Our training dataset is made of a mix of high-quality data sources and filtered web data from CommonCrawl. More specifically, 12.5% of our dataset is from the following curated sources: Wikipedia,[^9] Wikibooks, Wikisource, Wikinews, StackExchange[^10] and the collection of scientific articles pes2o.[^11] Instead of doing multiple passes on Wikipedia, we use five different dumps from 2017, 2018, 2019, 2021 and 2022. The remaining 87.5% of our dataset is from CommonCrawl, and was filtered with the pipeline described in [Section `\ref{sec:data_filtering}`{=latex}](#sec:data_filtering). We used the following ten crawls: `2018-30`, `2019-04`, `2019-30`, `2020-05`, `2020-34`, `2021-04`, `2021-31`, `2022-05`, `2022-33`, `2023-40`.

## Audio Data {#sec:audio_data}

\looseness=-1

We use an audio collection of 7 million hours, which we call the *unsupervised audio dataset*, of readily available audio content, the majority of which contains English speech. We transcribe this set with Whisper [@whisper], using the large-v3 model. We use this data for the audio pre-training phase, during which we do not use the multi-stream approach described in [Section `\ref{sec:multistream}`{=latex}](#sec:multistream), but instead use a single stream of audio representing all speakers at once. Similarly, the text stream described in [Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving) represents the words coming from all speakers. All the audio is resampled to 24kHz and downmixed to mono.

\looseness=-1

To achieve multi-stream, we need the model to gain the ability to both listen and speak at the same time. For this, we further leverage the Fisher dataset [@cieri2004fisher]. It consists of 2000 hours of phone conversations between randomly paired participants, with a given topic to discuss. A property of Fisher is that each conversation side is recorded on a separate channels, which allows providing ground-truth separated streams to `\ours{}`{=latex}. The original audio is sampled at 8kHz, and we use AudioSR [@liu2023audiosr] to upsample it to 24kHz.

\looseness=-1

Finally, we source 170 hours of natural and scripted conversations between multiple pairs of participants, recorded with separate channels per speaker, in order to provide a small dataset on which to finetune the model to improve the quality over the one obtained when using only Fisher. We call this dataset the *supervised multi-stream dataset*. We do not train Moshi directly on this dataset, but use it to train a realistic multi-stream TTS model, and fine-tune Helium on real conversation transcripts as explained in Sections `\ref{sec:instruct_data}`{=latex} and `\ref{sec:stages}`{=latex}.

\looseness=-1

For both Fisher and this last dataset, we sample one speaker randomly as the main speaker (i.e., `\ours{}`{=latex} speaking), and put the other speaker on the second audio stream. For Fisher, the text stream only contains the transcription of the main speaker. To obtain reliable timestamps, despite long silences in each stream, we use transcription obtained with the `whisper-timestamped` package [@lintoai2023whispertimestamped], along with the medium Whisper model.

## Speech-Text Instruct Data {#sec:instruct_data}

Early experiments using text-based instruct datasets such as Open Hermes [@OpenHermes2] proved to be ill-suited for the instruct tuning of a spoken conversational system. In particular, the data formatting was often impossible to properly render with TTS (e.g. URLs), and the format of the questions and responses was not following a natural oral flow (e.g. bullet points, long enumerations). Instead, we leverage Helium, fine-tuned on Open Hermes and transcripts of real conversations, to generate realistic interactions between a speech-based AI model and a user. We then synthesize them with our multi-stream streaming TTS described in [Appendix `\ref{app:streaming_tts}`{=latex}](#app:streaming_tts), leading to more than 20k hours of synthetic speech data. To give `\ours{}`{=latex} its own consistent voice, we also condition the TTS engine on the voice of a single actor, who recorded monologues covering more than 70 speaking styles, as listed in [Table `\ref{tab:list_voices}`{=latex}](#tab:list_voices). Experiments on voice consistency reported in [Section `\ref{sec:voice_consistency}`{=latex}](#sec:voice_consistency) show that simply using a consistent voice for `\ours{}`{=latex} during instruction tuning is enough to guarantee almost surely that it does not use another voice, without further control during inference. In contrast, the voice of the second audio stream (the user) is randomly sampled for each example, giving more robustness to different speaking conditions and accents.

To generate the transcripts, we use different prompts, aiming at capturing different kinds of interactions between a user and `\ours{}`{=latex}. First, we generate conversations about general knowledge, by starting from a few Wikipedia paragraphs or StackExchange posts, which we refer to as context. This ensures that `\ours{}`{=latex}'s conversations cover a wide range of topics, such as history, cooking advice or pop culture. More precisely, using a given context, we obtain a summary of a potential discussion with the following prompt:

\begin{tcolorbox}
  \texttt{\textcolor{orange}{\{\{context\}\}}}


  \texttt{Based on information from the previous paragraph, write the summary of a conversation about \textcolor{orange}{\{\{title\}\}} between Blake and Moshi. The summary must be 2 sentences long, and start with "They" or "The speakers".}
\end{tcolorbox}

`\noindent `{=latex}where `{{context}}` refers to paragraphs from Wikipedia or StackExchange and `{{title}}` is the corresponding title. Then, we generate the full transcript with the prompt:

\begin{tcolorbox}
  \texttt{\textcolor{orange}{\{\{context\}\}}}


  \texttt{Write the transcript of a conversation between Blake and  \ours{}. \textcolor{orange}{\{\{summary\}\}} \ours{} is knowledgeable about the topic. Use some backchanneling. Use short turns.}
\end{tcolorbox}

Similarly, to give `\ours{}`{=latex} information about itself and the Kyutai lab, we generate paragraphs describing both and use them as additional context.

Second, we produce interactions containing instructions about `\ours{}`{=latex}'s voice, such as the other speaker requesting `\ours{}`{=latex} to speak with an angry voice or like a pirate. Our first strategy is to generate single turn interactions where the model is instructed to tell a sentence, a monologue or a poem about an entity, belonging to a high level category such as "sports" or "animals", using a particular voice. The voice requested by the other speaker and the entity are randomly sampled, and are thus completely unrelated. Our second strategy is to generate roleplaying situations, corresponding to different emotions or speaking styles with the following prompt:

\begin{tcolorbox}
  \texttt{Write a list of 10 situations about a \textcolor{orange}{\{\{voice\}\} \{\{character\}\}}. Each situation must start with "a \textcolor{orange}{\{\{voice\}\} \{\{character\}\}} who" and must be at most 8 words long.}
\end{tcolorbox}

Examples of voice adjective include \`\`happy" or \`\`suprised" and examples of characters include \`\`detective" or \`\`superhero". We then generate the interaction using the prompt:

\begin{tcolorbox}
\texttt{Write a dialogue between Blake and Moshi, \textcolor{orange}{\{\{situation\}\}}. Use a lot of backchanneling.}
\end{tcolorbox}
\looseness=-1

To make Moshi robust to mispronounced words, we also generate instructions containing misspellings in the user's questions, followed by Moshi asking the user to repeat herself or to clarify the question. We also generate questions containing a false or misleading fact (such as "Is the Eiffel Tower in Beijing?"), to train the model to answer "No" and correct the user. Otherwise, the vast majority of generated conversations only contain questions from the user where Moshi should answer positively. We generate basic math, grammar or trivia single-turn questions and answers, as we noticed that Moshi was initially not performing well on simple factual tasks like adding numbers. Finally, we generate safety conversations, where the user asks unethical or NSFW questions, and Moshi refuses to answer these requests.

## Training Stages and Hyper-parameters {#sec:stages}

#### Helium pre-training.

\looseness=-1

An overview of the training stages and hyper-parameters is provided in [Table `\ref{tab:hparams}`{=latex}](#tab:hparams). For each stage, we use AdamW [@adamw], with a weight decay of 0.1, a momentum decay of 0.9, and a decay for the average of the squared gradient of 0.95. All models are trained on H100 GPUs, using FSDP and activation checkpointing. The text-only language model, Helium, is trained for 500k steps, with a batch size of 4.2M tokens, using a cosine learning rate schedule starting at $3\cdot 10^{-4}$ with linear warmup.

#### Moshi pre-training.

\looseness=-1

Then, we initialize the Temporal Transformer in Moshi with Helium, while the Depth Transformer described in [Section `\ref{sec:joint_ar}`{=latex}](#sec:joint_ar) is randomly initialized. We first train on the unsupervised audio dataset presented in [Section `\ref{sec:audio_data}`{=latex}](#sec:audio_data), using a single stream of audio, with a batch size covering 16 hours of audio, each batch item consisting of a 5 mn sequence. We mask the corresponding text tokens with a probability of 30%. We randomize the delay between the text and audio tokens between -0.6 and +0.6 seconds. In order to prevent catastrophic forgetting, we also train half of the time on batches of text only data from the same dataset as used for Helium. In total, we make 1 million training steps, with a cosine learning rate starting at $3\cdot 10^{-5}$ for the Temporal Transformer, and $2\cdot 10^{-4}$ for the Depth Transformer, also with a linear warmup. In order to ensure the updates from the text-only batches are balanced with those from the audio dataset, we use two separate optimizer states. In addition, when operating on the text stream from an audio batch, we multiply the learning rate for the text embedding and text linear layer by 0.75. Finally, as padding tokens are predominant for audio batches, we reduce their weight by 50% in the cross-entropy loss.

#### Moshi post-training.

\looseness=-1

Starting from the model obtained from the previous stage, we then train it to gain its multi-stream ability. First, we use PyAnnote [@Bredin23] to diarize the audio from the unsupervised audio dataset. We sample one speaker at random, which will act as the main speaker, and derive a binary mask over the waveform, with a value of $1$ when the speaker is active based on the diarization, and $0$ otherwise. This mask provides us with two waveforms: one with the speaker, and one with the residual (potentially several speakers), which are encoded separately and then used as the two input audio streams described in [Section `\ref{sec:multistream}`{=latex}](#sec:multistream). The text stream only contains the text tokens from the selected main speaker, and the delay between text and audio tokens is fixed to 0. We train for 100k steps, with a batch size of 8 hours of audio, and a fixed learning rate of $3\cdot 10^{-6}$ for the Temporal Transformer, and $5\cdot 10^{-5}$ for the Depth Transformer. Like for the pretraining phase, we sample full text-only batches 10% of the time.

#### Moshi finetuning. {#sec:moshift}

The previously described simulated multi-stream provides a good pre-training task but is far from being sufficient to capture natural conversations: For instance, it contains no overlap, and the stream of an inactive speaker is perfectly silent. We then use the Fisher dataset [@cieri2004fisher] to have the model learn real multi-stream interaction. We sample one of the two speakers to be the first (and main) speaker. We train for 10k batches, with a batch size of 40min of audio with a learning rate of $2\cdot 10^{-6}$/$4\cdot 10^{-6}$ for the main/Depth Transformer. We no longer sample full text batches.

Finally, we set the speaker identity for the first speaker stream to be that of Moshi, a useful conversational assistant, with a final stage of instruct finetuning. We use the synthetic instruct dataset described in [Section `\ref{sec:instruct_data}`{=latex}](#sec:instruct_data), with a batch size of 2.7 hours of audio, for 30k steps, with a learning rate of $2\cdot 10^{-6}$ for both transformers.

During this stage, we perform data augmentation on the user's stream to make Moshi robust to various situations. Namely, we apply a random gain to the user stream between -24 dB and +15 dB, 50% of the time. 30% of the time, we further add noise extracts from the Deep Noise Suppression challenge [@dubey2023icassp] which we concatenate in order to cover the entire duration of each example. The noise is amplified to reach a target volume relative to the original source between -30 dB and +6 dB. Each time we need to sample a new noise, we alternatively use a silent section with a random duration up to 30 seconds of silence with probability of 50%, so that the model can handle the audio condition going from noisy to silent, and vice versa. We emulate echo from Moshi into the user's microphone by adding a scaled down copy of Moshi's stream into the user's stream, scaled by as factor uniformly sampled in $[0, 0.2]$, and a delay uniformly sampled between $[100\text{ms}, 500\text{ms}]$. Finally, we apply to the user's stream, potentially augmented with the echo, a reverb-like augmentation as introduced by @defossez2020real. The echo and reverb are applied together with a probability of 30%.

#### TTS Training.

We also train a streaming, multi-stream text-to-speech model, using the method described in [Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving). The audio pre-training stage is shared with Moshi, while the post-training is completed using a delay of 2 seconds for the audio stream compared to the text. The model is finetuned on the supervised multi-stream dataset containing high quality recording of interactions between two speakers. It is used to generate the synthetic finetuning instruct dataset described in [Section `\ref{sec:instruct_data}`{=latex}](#sec:instruct_data). Note that Moshi itself is not trained on the supervised multi-stream dataset. Further details are provided in [Appendix `\ref{app:streaming_tts}`{=latex}](#app:streaming_tts).

#### Training loss.

Moshi is trained to model joint sequences, as presented in eq. `\ref{eq:final_multi_sequence}`{=latex}. Given the ground-truth discrete token $(V_{s, k})_{s\leq S, k\leq K}$, and the estimated logits $(l_{s, k})_{s\leq S, k\leq K}$ from eq.`\ref{eq:logits}`{=latex}, we use the following loss, with $\mathrm{CE}$ the cross entropy, $$\begin{equation}
L(V, l) = \frac{1}{S}\sum_{s=1}^S\left(\mathrm{CE}(l_{s, 1}, V_{s, 1}) + \frac{1}{\sum_{k=2}^K \alpha_k}\sum_{k=2}^K \alpha_k \mathrm{CE}(l_{s, k}, V_{s, k})\right).
\end{equation}$$ Thus, we give the same importance to the text token ($k{=}1$), and the combined audio tokens. $\alpha_k$ is set to 100 for semantic tokens, and 1 for acoustic ones.

# Evaluation {#sec:evaluation}

\footnotesize

                          ARCe       ARCc       OBQA        HS         WG        PIQA       SIQA          TQA           NQ        MMLU
  -------------------- ---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------- ---------- ----------
  `\helium `{=latex}    **79.6**   **55.9**     53.6       76.3     **70.0**     79.4     **51.0**   **59.9/72.6**     23.3     **54.3**
  MPT                     70.5       46.5       51.4     **77.6**     69.9     **80.6**     48.5        -/61.2         20.8       30.8
  Falcon                  73.7       47.5       53.0       76.3       68.9       80.3       47.2        -/64.6         21.0       28.0
  Llama 2                 75.2       45.9     **58.6**     77.2       69.2       78.8       48.3        -/72.1       **25.7**     45.3
  OLMo                    67.2       42.5       50.0       75.5       69.8       77.5        \-           -/-           \-        52.0
  Mistral                 80.5       54.9       52.2       81.0       74.2       82.2     47.0$^*$      62.5/-         23.2       62.5
  Gemma 1                 81.5       53.2       52.8       81.2       72.3       81.2       51.8        63.4/-         23.0       64.3

  : **Text language model evaluation**. Performance on standard benchmarks for evaluating large language models, including closed book question answering, reasoning and multiple choice QA exams. We report in bold the best performing model trained on less than 2.5T tokens. `\label{tab:llm_eval}`{=latex}

## Text Language Modeling {#sec:helium_eval}

#### Metrics.

We evaluate `\helium `{=latex}(trained only on text data) on the following standard benchmarks: AI2 Reasoning Challenge [@clark2018think ARC], Open-Book QA [@mihaylov2018can OBQA], HellaSwag [@zellers2019hellaswag HS], WinoGrande [@sakaguchi2021winogrande WG], Physical Interaction QA [@bisk2020piqa PIQA], Social Interaction QA [@sap2019socialiqa], TriviaQA [@joshi2017triviaqa TQA], Natural Questions [@kwiatkowski2019natural NQ] and Massive Multitask Language Understanding benchmark [@hendrycks2020measuring MMLU]. These benchmarks cover a wide variety of tasks, including common sense reasoning, closed-book question answering or multiple choice question answering from high school and college subjects. We follow the evaluation protocol from previous work such as GPT-3 or Llama: we perform 5-shot evaluation on TriviaQA, NQ and MMLU, and 0-shot evaluation on the other datasets. On TriviaQA, we report performance on the Unfiltered and Wikipedia splits.

#### Baselines.

As baselines, we consider existing large language models with a size around 7B parameters, and which are trained using roughly the same amount of compute as Helium. More specifically, we include models that are trained on fewer than 2.5T tokens (compared to the 2.1T tokens that are used to train Helium), namely MPT [@MosaicML2023Introducing], Falcon [@almazrouei2023falcon], Llama 2 [@touvron2023llama] and OLMo [@Groeneveld2023OLMo]. We also include Mistral and Gemma, two popular open weights models that are trained using significantly more compute than `\helium`{=latex}.

#### Results.

We report results in [Table `\ref{tab:llm_eval}`{=latex}](#tab:llm_eval), and we observe that on most benchmarks, Helium is on-par or outperforming models using similar amount of training compute. Even compared to Mistral and Gemma, which use up to 3x more compute for training, Helium obtains competitive results on some benchmarks such as ARC, Open-Book QA or Natural Questions. This validates the quality of our pre-training text data.

## Audio Tokenization {#sec:mimi_eval}

\looseness=-1

#### Metrics.

We then evaluate the semantic and acoustic performance of our neural codec, `\mimi`{=latex}. First, we evaluate whether the semantic tokens it produces provide targets that are amenable to language modeling. To do so, we compute a triphone-based ABX [@abx] error rate that characterizes the phonetic discriminability of a representation space by comparing distances between two embeddings of different instances of a same triphone (e.g.\`\`beg") and a negative triphone that differs minimally (e.g.\`\`bag"). More precisely, we compute a \`\`within speaker\" ABX where the three instances are pronounced by the same speaker, and report error rates on Librispeech [@librispeech] dev-clean with the default parameters of the Librilight [@librilight] repository[^12]. The resulting score has been shown to be a strong predictor of the ability of a downstream audio language model to produce coherent speech [@gslm]. Since we are interested in characterizing only the semantic token, we compute distances in the latent space produced after quantization with the semantic VQ only (i.e. before summing with acoustic tokens). `\looseness=-1`{=latex} Second, we evaluate the acoustic quality of reconstructed audio. As objective, automatic metrics we rely on VisQOL [@hines2015visqol]--- a full-reference model of acoustic similarity--- and MOSNet [@mosnet]--- a reference-free model of audio quality. Given the limitations of automatic evaluation of audio quality, we also perform human evaluations with a MUSHRA protocol. We rely on judgments of 20 listeners, each one rating 30 samples of 10s each. [Table `\ref{tab:mimi_ablations}`{=latex}](#tab:mimi_ablations) reports ablations studies using objective metrics, while [Table `\ref{tab:mimi_baselines}`{=latex}](#tab:mimi_baselines) provides a comparison with previous work both in terms of objective and subjective evaluation.

#### Baselines.

We compare against RVQGAN [@kumar2024high], SemantiCodec [@liu2024semanticodec], and SpeechTokenizer  [@zhang2024speechtokenizer]. RVQGAN is a pure acoustic tokenizer, in the sense that it does not encode semantic information. Thus, we only evaluate it in terms of audio quality. RVQGAN produces tokens at 75Hz, so we only keep the first two levels of RVQ to obtain a bitrate of 1.5kbps, closer to that of `\mimi{}`{=latex}. On the other hand, SpeechTokenizer relies on distillation to encode semantic information into its first token such that we can evaluate both its semantic and acoustic properties. We keep its first 3 RVQ levels to obtain a 1.5kbps bitrate. Similarly, SemantiCodec also encodes semantic and acoustic information such that it can be evaluated along both axes.

\resizebox{\textwidth}{!}{
  \begin{tabular}{ccccc|cccc}
    \toprule
    Quantization & Transformer & Transformer & WavLM & Split & ABX ($\downarrow$) & VisQOL ($\uparrow$) & MOSNet ($\uparrow$) & MUSHRA ($\uparrow$) \\
    Rate & in encoder & in decoder & distillation & quantizer &  &  &  \\
    \midrule
     \checkmark & \checkmark & \checkmark & \xmark & \xmark & 23.3\% & 2.91 & 2.89 & 65.9\pmr{1.7}\\ %
     \checkmark & \checkmark & \checkmark & \checkmark & \xmark & 6.5\% & 2.22 & 2.87 & 57.8\pmr{1.8} \\ %
     \checkmark & \xmark & \checkmark & \checkmark & \checkmark & 10.8\% & 2.79 & 2.85 & 59.7\pmr{1.7} \\ %
     \checkmark & \checkmark & \xmark & \checkmark & \checkmark & 8.1\% & 2.59 & 2.72 & 48.4\pmr{1.7} \\ %
     \xmark & \checkmark & \checkmark & \checkmark & \checkmark & 8.0\% & 2.45 & 2.88 & 68.3\pmr{1.7} \\ %
     \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & 8.1\% & 2.82 & 2.89 & 64.0\pmr{1.7} \\ %
    \bottomrule
  \end{tabular}
  }
\resizebox{\textwidth}{!}{
  \begin{tabular}{lcccc|ccc|c}
    \toprule
    Model & $f_s$ & $f_r$ & bitrate & causal & ABX ($\downarrow$) & VisQOL ($\uparrow$) & MOSNet ($\uparrow$) & \textsc{MUSHRA} ($\uparrow$) \\
    \midrule
    Ground Truth & 24kHz & - & - & - & - & - & 3.08 & 90.6\pmr{1.0}\\
    \midrule
    RVQGAN & 24kHz & 75Hz & 1.5kbps & \xmark & - & 1.74& 2.74 & 31.3\pmr{1.3}\\
    SemantiCodec & 16kHz & 50Hz & 1.3kbps & \xmark & 42.2\% & 2.43 & 3.12 & 64.8\pmr{1.5} \\
    SpeechTokenizer & 16kHz & 50Hz & 1.5kbps & \xmark & \textbf{3.3\%} & 1.53 & 2.67 & 45.1\pmr{1.5} \\
    SpeechTokenizer & 16kHz & 50Hz & 4.0kbps & \xmark & \textbf{3.3\%} & 3.07 & 3.10 & 74.3\pmr{1.5} \\
    \midrule
     Mimi, adv. loss only & 24kHz & 12.5Hz & 1.1kbps & \checkmark & 8.7\% & 1.84 & 3.10 & \textbf{81.0}\pmr{1.3} \\
     Same, downsampled at 16kHz & 16kHz & 12.5Hz & 1.1kbps & \checkmark & - & - & - & 77.7\pmr{1.4}\\
     Mimi, non adv. only& 24kHz & 12.5Hz & 1.1kbps & \checkmark & 8.1\% & 2.82 & 2.89 & 58.8\pmr{1.8}\\
    \bottomrule
  \end{tabular}}
\looseness=-1

#### Results - Semantic tokens.

[Table `\ref{tab:mimi_ablations}`{=latex}](#tab:mimi_ablations) shows that `\mimi`{=latex}'s phonetic discriminability of semantic tokens, as measured by ABX, is poor in the absence of distillation and comparable to acoustic tokens of previous work [@audiolm]: This means these semantic tokens are not amenable to capturing linguistic content from speech. In contrast, distilling WavLM into the semantic tokens significantly improves their phonetic discriminability, in particular when using a Transformer in `\mimi`{=latex}'s encoder. This can be explained by the fact that distilling a large Transformer based encoder into a purely convolutional one is challenging, while increasing the capacity and receptive field of the encoder helps. Yet, we observe a conflict between acoustic losses and semantic distillation, as improving ABX implies reducing reconstruction quality (as measured by MUSHRA). Using a split RVQ as described in [Section `\ref{sec:splitrvq}`{=latex}](#sec:splitrvq) improves the trade-off between semantic properties and audio quality, improving MUSHRA from 57.8 to 64.0 while moderately degrading ABX from 6.5% to 8.1%.

\looseness=-1

#### Results - Acoustic tokens.

[Table `\ref{tab:mimi_ablations}`{=latex}](#tab:mimi_ablations) also shows a significant improvement in MUSHRA when adding a Transformer in the decoder. Similarly, using a quantization rate of $50\%$ significantly improves VisQOL. Quantization rate however does not improve perceived quality. More generally, we observe a poor correlation between VisQOL and MOSNet. In particular, [Table `\ref{tab:mimi_baselines}`{=latex}](#tab:mimi_baselines) shows that training `\mimi `{=latex}with adversarial losses only leads to a very low VisQOL of 1.84 which does not account for the high perceived audio quality. We thus rely on MUSHRA where raters are asked to judge the similarity of a reconstructed audio to its ground-truth anchor, with a score between 0 and 100. This human evaluation shows a significant improvement from using adversarial losses only, with a MUSHRA score of 81.0 against 58.8 when using the mix of loss functions used in Encodec. `\mimi `{=latex}moreover significantly outperforms RVQGAN [@dac] despite operating at a lower bitrate and modeling semantic information. `\mimi `{=latex}also provides higher reconstruction quality than SemantiCodec [@liu2024semanticodec] while operating at a 4$\times$ lower framerate. This property is crucial to achieve the low latency of `\ours{}`{=latex}, since generating one temporal frame of audio tokens with `\ours{}`{=latex} requires a full forward pass through the Temporal Transformer. Finally, both RVQGAN and SemantiCodec are non-causal, while `\mimi `{=latex}is fully causal and thus compatible with streaming inference and modeling of real-time conversations.

\looseness=-1

#### Discussion.

`\mimi `{=latex}overall provides high reconstruction quality while encoding semantic information, being fully causal, and operating at low framerate and bitrate. In consequence, `\mimi `{=latex}proves to be a well-fitted audio tokenizer to train real-time audio language models. A collateral finding of our study is a concerning lack of correlation between objective and subjective audio quality metrics. In particular, while we find VisQOL to provide a reliable proxy for perceived quality when modifying the generator architecture, changing the training objective (e.g. removing reconstruction losses) moves the score in directions that are completely decorrelated from human perception. This observation underscores the open challenge of designing reliable objective proxies for perceived quality.

\small
\footnotesize

         Acoustic Delay            RQ-Transformer          Perplexity
  ---------------------------- ----------------------- -------------------
   $[0, 1, 2, 3, 4, 5, 6, 7]$     `\xmark `{=latex}     $\phantom{1}42.2$
   $[0, 1, 2, 3, 4, 5, 6, 7]$   `\checkmark `{=latex}   $\phantom{1}40.3$
   $[0, 2, 2, 2, 2, 2, 2, 2]$     `\xmark `{=latex}          $135.4$
   $[0, 2, 2, 2, 2, 2, 2, 2]$   `\checkmark `{=latex}   $\phantom{1}36.8$

  :  **Ablation study on the use of the RQ-Transformer**. All models are initialized with Helium and pretrained on audio. When not using RQ-Transformer, we predict the 8 levels of tokens with independent classification heads, following @musicgen. Note that perplexities are only comparable between models with a given delay, as the classification task is easier with more delay for higher tokens.

`\label{tab:ablation-rq}`{=latex}

\small
\footnotesize

  ---------------------------- --------------------------- ------------------- ------------------- -------------------- --------------------- --
            Acoustic                 Semantic Token             Depthwise             Inner             Transcript                 Transcript
             Delay                       Weight              Parametrization        Monologue       NLL ($\downarrow$)    Length ($\uparrow$)
   $[0, 0, 0, 0, 0, 0, 0, 0]$   `\phantom{00}`{=latex}1.0   `\cmark `{=latex}   `\xmark `{=latex}         $4.36$                        $486$
   $[0, 1, 1, 1, 1, 1, 1, 1]$   `\phantom{00}`{=latex}1.0   `\cmark `{=latex}                             $4.12$                        $529$
   $[0, 2, 2, 2, 2, 2, 2, 2]$   `\phantom{00}`{=latex}1.0   `\cmark `{=latex}                             $4.09$                        $519$
   $[0, 2, 2, 2, 2, 2, 2, 2]$             100.0                                                           $3.75$                        $538$
   $[0, 2, 2, 2, 2, 2, 2, 2]$             100.0             `\cmark `{=latex}                             $3.65$                        $602$
   $[0, 2, 2, 2, 2, 2, 2, 2]$             100.0             `\cmark `{=latex}   `\cmark `{=latex}        **2.77**                    **1920**
  ---------------------------- --------------------------- ------------------- ------------------- -------------------- --------------------- --

  :  **Ablation study on delay patterns, weight of the semantic token and `\interleaving{}`{=latex}**. All models are initialized with Helium, pretrained on audio and use the RQ-Transformer. We vary the weight of the semantic token while keeping the weight of other tokens (including the text token when using `\interleaving{}`{=latex}) to 1. As different delay patterns cannot be compared in terms of perplexity, we generate continuations from 3s prompts on the valid set, convert them into transcripts with Whisper [@whisper] and report their negative log-likelihood with LiteLlama-460M-1T`\protect`{=latex}[^13] along with their length (in characters) as proxies for linguistic quality.

`\label{tab:ablation-delays}`{=latex}

## Ablations on Generative Modeling {#sec:moshi_ablations}

#### Metrics.

We perform ablation studies to evaluate the impact of using an RQ-Transformer, as well as comparing delay patterns and weighting of various token levels. All models are initialized with `\helium{}`{=latex} for the Temporal Transformer and pretrained on audio data. When comparing models with a same delay pattern and without `\interleaving{}`{=latex}, we rely on perplexity, averaged over semantic and acoustic tokens. However, perplexities are not comparable between models with various delays, as they do not model the same conditional distributions. To compare various delay patterns, we measure the ability of `\ours{}`{=latex} to generate intelligible, consistent speech by transcribing generations (conditioned on a 3s prompt) with Whisper [@whisper] and scoring the resulting transcripts with an external text language model. We rely on a lightweight text model---LiteLlama-460M-1T[^14]--- as it is more practical for continuous evaluation along training. We also report the length of the transcripts (in characters), as we find it to a be a strong predictor of model quality (weak models typically collapse to silence).

#### Results - RQ-Transformer.

[Table `\ref{tab:ablation-rq}`{=latex}](#tab:ablation-rq) reports results for ablations on the use of an RQ-Transformer. We first replicate the setting of @musicgen with the delay pattern of $[0, 1, 2, 3, 4, 5, 6, 7]$, which means that each level of RVQ token is generated one timestep after the preceding level. In this context, we see that using an RQ-Transformer is not necessary, as it only provides a marginal improvement in perplexity. However, this delay pattern induces a theoretical latency of 8 timesteps, which amounts to 640ms, a latency that is incompatible with the requirements of a real-time dialogue model. We thus switch to a reduced latency of 240ms with the pattern $[0, 2, 2, 2, 2, 2, 2, 2]$. In that context, modeling RVQ tokens with an RQ-Transformer significantly improves perplexity over using separate classification heads. Thus, the RQ-Transformer becomes a critical component of generative models of RVQ tokens under strict latency constraints.

#### Results - Additional ablations.

[Table `\ref{tab:ablation-delays}`{=latex}](#tab:ablation-delays) reports additional ablations on additional delay patterns, the weight of the semantic token loss and our proposed Inner Monologue procedure, all using the RQ-Transformer. First, we compare three configurations of delays that are compatible with real-time dialogue. The $[0, 0, 0, 0, 0, 0, 0, 0]$ pattern represents the minimal latency of 80ms that can be obtained with `\mimi{}`{=latex} tokens at 12.5Hz. Allowing an additional 80ms of latency with one step of delay significantly improves the quality of generated speech, while 240ms of latency brings further moderate improvement. In early experiments, we also observed that the individual losses per RVQ level were conflicting with one another, despite each level being more important in the final intelligibility and audio quality than the next one. We thus bring two changes to the architecture and training process. We first increase the weight of the loss on predicting the semantic tokens to 100, while keeping it at 1 for all other levels of the audio tokens. This gives another boost to speech intelligibility. We furthermore reduce competition between RVQ levels by using a depthwise parametrization, as described in [Section `\ref{sec:joint_ar}`{=latex}](#sec:joint_ar), such that each RVQ level is predicted by its own set of weights in the Depth Transformer, rather than having shared weights across levels. Finally, the most drastic improvement to the quality and length of generated speech comes from enabling `\interleaving{}`{=latex}.

\looseness=-1

#### Discussion.

Beyond the choice of architecture and delay patterns, these ablations show how helpful modeling text tokens along the audio tokens with `\interleaving{}`{=latex} is, even in an audio-to-audio setting. Given the positive impact of depthwise parametrization and a weight of 100 for the semantic token, both are used in the subsequent experiments and our final training procedure. As described in [Table `\ref{tab:hparams}`{=latex}](#tab:hparams), we pretrain `\ours{}`{=latex} with an acoustic delay of 2 and finetune it with an acoustic delay of 1, for a theoretical latency of 160ms.

\resizebox{\textwidth}{!}{
    \begin{tabular}{lccccc}
    \toprule
    &
    Audio metrics &
    Text metrics \\
    \cmidrule{2-5}
    \cmidrule(l){6-6}
     Model & sWUGGY & sBLIMP & sTopic-StoryCloze & sStoryCloze & MMLU \\
    \midrule
    \textit{Audio only - Cold Start} \\
    \midrule[0.3pt]
    GSLM \citep{gslm} &  64.8 &   54.2 &  66.6 &  53.3 & $\varnothing$ \\
    AudioLM \citep{audiolm} &  71.5 &   \textbf{64.7} &   - &   - & $\varnothing$ \\
    TWIST \citep{twist} & 72.2 & 56.5 &  - & - & $\varnothing$ \\
    \midrule[0.01pt]
    \ours{} %
    & \textbf{74.8} & 59.9 & 80.9 & 56.9 & $\varnothing$ \\
    \midrule
    \textit{Audio only - Warm Start} \\
    \midrule[0.3pt]
    TWIST \citep{twist} & 74.5 & 59.2 & 76.4 & 55.4 & $\varnothing$ \\
    VoxtLM \citep{voxtlm} &  62.9 &   53.9 &   - &   - & $\varnothing$  \\
    Spirit-LM \citep{spiritlm} &  69.5 &   58.0 &   72.9 &   54.8 & $\varnothing$  \\
    \midrule[0.01pt]
    \ours{} %
    & 74.3 & 58.9 & 81.8 & 58.7 & $\varnothing$  \\
    \midrule
    \textit{Text and audio - Warm Start} \\
    \midrule[0.3pt]
    VoxtLM \citep{voxtlm} & 66.1 &  57.1 & - & - &  $\varnothing$ \\
    Spirit-LM \citep{spiritlm} & 69.0 & 58.3 & 82.9 & 61.0 & 36.9  \\
    \midrule[0.01pt]
    \ours{} after single-stream pretraining %
    & 72.6 &58.8 & 83.0 & 60.8 & \textbf{49.8} \\
    \ours{} after multi-stream instruct %
    & 63.0 & 55.2 & \textbf{83.6} & \textbf{62.7} & 49.7 \\
    \ours{} after multi-stream instruct, synthetic voice %
    & 60.9 & 54.6 & 82.5 & 60.9 & 48.7 \\
    \bottomrule
    \end{tabular}}

`\label{tab:textless}`{=latex}

## Audio Language Modeling {#sec:eval_audiolm}

\looseness=-1

#### Metrics.

We first measure the ability of `\ours{}`{=latex} to model speech sequences when being trained for next token prediction on large scale audio data. To do so, we rely on \`\`textless NLP" [@gslm] metrics that evaluate an audio language model's linguistic knowledge by comparing likelihoods of positive and negative speech examples represented as audio tokens. sWUGGY evaluates a model's ability to learn a lexicon from speech by comparing the likelihood of an existing word and an invalid variant (e.g. \`\`oxidation" and \`\`accidation"), while sBLIMP evaluates syntactic contrasts. Spoken StoryCloze metrics introduced by @twist evaluate semantic contrasts by comparing commonsense five-sentence stories, with the last one being either coherent with the context or incoherent. Given the difficulty of this task in the audio domain, @twist also propose Spoken Topic-StoryCloze, a variant where the negative continuation is randomly sampled among unrelated sentences (rather than being subtly incoherent), resulting in higher scores. We score sequences with a negative-log likelihood normalized by the sequence length. Since our model produces several tokens per timestep, we sum all tokens of a timestep with the weights used during training, i.e. 100 for the semantic tokens and 1 for the acoustic ones. We do not include text tokens from the `\interleaving{}`{=latex}, as these metrics are designed to compare untranscribed audio sequences. Similarly, when evaluating a multi-stream model after instruction tuning, we only score the tokens from the user stream as it does not include text tokens. Finally, we also report text understanding evaluation (without audio tokens) on MMLU [@hendrycks2020measuring] for Spirit-LM and `\ours{}`{=latex}, as a way to measure how much the audio training affects the textual knowledge of the original checkpoint. `\looseness=-1`{=latex}

#### Baselines.

We compare against baselines from the audio language modeling literature, in three settings. The first category encompasses audio-only models starting from a random initialization, including GSLM [@gslm], AudioLM [@audiolm] and TWIST-1.3B [@twist]. In this case, we report metrics for a single-stream `\ours{}`{=latex} initialized randomly and pretrained only on audio data and without `\interleaving{}`{=latex}. The second category includes models that start from a pretrained text LM and are then only trained on audio. This includes TWIST-13B as well as the audio-only version of VoxtLM (first row of [@voxtlm Table 3]) and that of Spirit-LM (reported as \`\`Speech Only" in [@spiritlm Table 5]). The corresponding `\ours{}`{=latex} model is similar to the one mentioned above (audio-only data, no `\interleaving{}`{=latex}) but starts from the pretrained `\helium{}`{=latex} checkpoint. The last category is composed of actual multimodal models that are trained jointly on speech and text data. In this context we report results for three configurations of `\ours{}`{=latex}. First, we report results for `\ours{}`{=latex} pretrained on single-stream data. Then, we report results for the final model after multi-stream post-training and finetuning using real recordings from a voice actor to condition the creating of synthetic data in Moshi's voice. The last model is identical to the previous one except for the fact that it uses a synthetic voice for `\ours{}`{=latex}. We remind the reader that even if these models are trained with `\interleaving{}`{=latex}, they are evaluated without, to provide a fair comparison with baselines. `\looseness=-1`{=latex}

#### Results.

[Table `\ref{tab:textless}`{=latex}](#tab:textless) reports results on audio language modeling. In the \`\`Audio only - Cold Start" setting, `\ours{}`{=latex} already provides a strong baseline, in particular considerably improving over previous work in sTopic-StoryCloze. When initialized with an `\helium{}`{=latex} checkpoint and trained on audio-only data, `\ours{}`{=latex} outperforms previous work in this category on most metrics. Finally, while multimodal training improves common sense reasoning from speech (as shown by sStoryCloze performance), we observe mixed effects on lexical and syntactic judgments (sWUGGY and sBLIMP) compared to models trained only on audio data. While single-stream pretraining moderately degrades sWUGGY and sBLIMP, instruction finetuning severely affects sWUGGY, which means that instructed models have a harder time solving lexical judgments. We hypothesize that this is due to finetuning `\ours{}`{=latex} on data of varying quality and simulating noisy and reverberated conditions for the user stream (used to score spoken pairs for all metrics in [Table `\ref{tab:textless}`{=latex}](#tab:textless)) which makes fine lexical judgments harder to solve. Finally, Moshi scores 12 points higher on MMLU than Spirit-LM, thus demonstrating higher general knowledge and text understanding. We moreover emphasize that `\ours{}`{=latex} is the only model in [Table `\ref{tab:textless}`{=latex}](#tab:textless) that integrates both semantic and acoustic tokens into a single generative model, unlike AudioLM which uses three separate stages, and VoxTLM, TWIST and Spirit-LM that only model semantic tokens and rely on an external vocoder. Thus, `\ours{}`{=latex} is the only model in this comparison that demonstrates strong linguistic modeling in both speech and text, while being able to model speech in any arbitrary voice and condition.

#### Discussion.

\looseness=-1

While \`\`textless NLP" benchmarks have helped developing the first audio language models, we observe that they do not consistently provide good guidance in developing a dialogue model like `\ours{}`{=latex}. In particular, we find the lack of correlation between common sense metrics and lexical/syntactic judgments to be frequent and easily explainable by the diversity of acoustic conditions we use in training. Moreover, we do not observe a degradation in the model's lexical variety or intelligibility when finetuning the model, which contradicts the reduction in sWUGGY. This is why we also evaluate spoken question answering in the next section as a way to probe the model's common sense, knowledge and lexical abilities.

\footnotesize

  Model                                                            Web Q.                       LlaMA Q.            Audio Trivia QA
  ---------------------------------------------------- ------------------------------- --------------------------- -----------------
  *Audio only*
  GSLM [@gslm]                                            `\hphantom{0}`{=latex}1.5     `\hphantom{0}`{=latex}4.0         \-
  AudioLM [@audiolm]                                      `\hphantom{0}`{=latex}2.3     `\hphantom{0}`{=latex}7.0         \-
  TWIST (7B) [@twist]                                     `\hphantom{0}`{=latex}1.1     `\hphantom{0}`{=latex}0.5         \-
  `\ours `{=latex}(w/o `\interleaving{}`{=latex})       `\hphantom{0}`{=latex}**9.2**           **21.0**                  7.3
  *Text and audio*
  SpeechGPT (7B) [@zhang2024speechgpt]                    `\hphantom{0}`{=latex}6.5               21.6                   14.8
  Spectron (1B) [@spectron]                               `\hphantom{0}`{=latex}6.1               22.9                    \-
  `\ours `{=latex}                                                **26.6**                      **62.3**               **22.8**
  `\ours `{=latex}(w/o text batches in pre-training)                23.2                          61.3                   18.3
  *Text*
  Helium (text)                                                     32.3                          75.0                   56.4

  : **Evaluation of spoken question answering**. Spoken question answering (0-shot) on the Web Questions [@web_questions], LlaMA-Questions [@spectron], and Trivia QA [@joshi2017triviaqa] benchmarks, synthesized using a TTS engine. For the first two, we use the number reported by [@spectron]. For LlaMA-Questions, we use the audio provided by [@spectron]. For Web Questions and Trivia QA, we synthesize our own, keeping all of the questions. For `\ours{}`{=latex}, we only prepend one of the random incipits used during instruct fine tuning. We further provide the performance of our `\helium `{=latex}text-only model as a top line.

`\label{tab:qa}`{=latex}

## Spoken Question Answering

\looseness=-1

#### Metrics.

We evaluate the spoken question answering abilities of our multi-stream `\ours{}`{=latex} model. We rely on Spoken Web Questions and Llama Questions, both introduced by @spectron. We also synthesize an audio version of TriviaQA. When evaluating `\ours{}`{=latex}, we insert the audio tokens of the question into the user stream to simulate a user interaction, along with a final `EPAD` text token to trigger an immediate response from `\ours{}`{=latex}.

#### Baselines.

We compare to Spectron and baselines used by @spectron, all having been already introduced in [Section `\ref{sec:eval_audiolm}`{=latex}](#sec:eval_audiolm) except for SpeechGPT [@zhang2024speechgpt]. To measure the impact of `\interleaving{}`{=latex} on spoken fluency, we compare these baselines with `\ours{}`{=latex} trained with and without `\interleaving{}`{=latex}. As GSLM, AudioLM and TWIST are audio only, `\ours{}`{=latex} without `\interleaving{}`{=latex} provides a fair comparison. On the other hand, Spectron and SpeechGPT rely on Chain-of-Modality---they generate an answer first as text, and then as speech--- so we compare them to Moshi with `\interleaving{}`{=latex}. Moreover, to quantify an eventual degradation in knowledge due to training on audio data, we also compare to `\helium{}`{=latex} when evaluated on the textual counterpart to each spoken dataset. `\looseness=-1`{=latex}

#### Results.

[Table `\ref{tab:qa}`{=latex}](#tab:qa) reports accuracies on the three benchmarks. While audio-only `\ours{}`{=latex} significantly outperforms baselines in its categories, the most striking result is the impact of `\interleaving{}`{=latex} on Moshi's performance, almost tripling its accuracy on all benchmarks. This is remarkable as `\interleaving{}`{=latex} only marginally increases inference cost (each multi-stream timestep requires generating 17 tokens, instead of 16 without). We emphasize that among all models in this comparison, Moshi not only provides the best spoken question answering performance, but is also the only one to model jointly semantic and acoustic tokens, such that it can handle interactions between arbitrary voices in many conditions. `\ours{}`{=latex} significantly outperforms SpeechGPT and Spectron while it is the only model compatible with streaming inference, as Chain-of-Modality requires generating a full answer in text before generating speech, while `\interleaving{}`{=latex} generates both in a streaming fashion. Finally, we note that Moshi with `\interleaving{}`{=latex} but without keeping 50% of text-only batches during pre-training leads to a noticeable drop in acccuracy, showing that a warm start from Helium is not sufficient to retain the knowledge of the original text model throughout the training. `\looseness=-1`{=latex}

#### Discussion.

Despite the strong performance of `\ours{}`{=latex}, we observe a weaker performance than the base`\helium{}`{=latex} model, which is consistent with the reduced MMLU of 49.7 reported in [Table `\ref{tab:textless}`{=latex}](#tab:textless) from 54.3 with `\helium{}`{=latex}. While the moderate differences on Web Questions and Llama Questions can be explained by training on audio data and thus reducing the amount of parameters dedicated to textual knowledge, the very large difference on Trivia QA incites us to inspect more thoroughly patterns of errors. We find that multiple-sentence questions (e.g. \`\`The Terror of the Monster was an early title for a best-selling novel which inspired one of the highest-grossing movies of the mid-70's. Under what name did it eventually terrify the reading and film going public?") or ones with specific syntactic structure (e.g. \`\`On the human body, a keloid is a type of what?") are challenging for `\ours{}`{=latex}, due to it being finetuned on oral-style conversations that do not display such patterns. We hypothesize that covering more syntactic scenarios during finetuning could reduce this gap.

\footnotesize
\resizebox{\textwidth}{!}{
    \begin{tabular}{lccccccc}
    \toprule
    Model& samples&temp&cond. PPL& IPU & Pause & Gap & Overlap \\
    \midrule
    Best non-cascaded \citep{nguyen-etal-2023-generative}
    & 50 & 1.0 & 195.9 & 41.4s & 13.8s & 10.7s & 6.1s \\
    Cascaded \citep{nguyen-etal-2023-generative}
    & 50 & 1.0 & 45.9 & 54.8s & 0.0s & 5.3s & 0.0s \\
    Ground Truth \citep{nguyen-etal-2023-generative}
    & 50 & $\varnothing$ & 65.0 & 53.5s & 5.5s & 4.4s & 3.6s \\
    \midrule
    \ours{} %
    & 1000 & 0.8 & \textbf{41.9} & 35.1s & 13.2s & 12.5s & 1.2s \\
    \ours{} %
    & 1000 & 0.9 & 56.7 & 44.7s & 9.1s & 7.5s & 2.2s \\
    \ours{} %
    & 1000 & 1.0 & 79.3 & 50.8s & 7.0s & 4.5s & 4.1s \\
    Ground Truth
    & 1000 & $\varnothing$ & 59.6 & 51.1s & 6.4s & 4.2s & 3.3s \\
    \bottomrule
    \end{tabular}
    }

`\label{tab:multistream}`{=latex}

## Quality and Statistics of Generated Dialogues {#sec:eval_dialogues}

#### Metrics.

\looseness=-1

Beyond single-turn question answering, we evaluate the linguistic quality of the generated dialogues by scoring them with an external language model, while also looking at turn-taking metrics. We follow the methodology of  @nguyen-etal-2023-generative. The turn-taking metrics are defined as follows: *Inter-Pausal Units* (IPU) are continuous streches of speech that are separated by a silence of at least 0.2s on each side. *Pauses* are silences between IPUs of the same speakers. *Gaps* are silences between IPUs of different speakers. *Overlaps* are times where there are IPUs for both speakers. Following  @nguyen-etal-2023-generative, the semantic evaluation uses the open-source DialoGPT model [@dialogpt] and we compute the perplexity of the transcribed dialogue by separating each speaker using the `<|endoftext|>` token expected by DialoGPT. We select 1000 random 10 seconds prompts from the Fisher dataset and use `\ours{}`{=latex} to generate continuations. For each one, 32 continuations are generated, for 3 different temperatures as it significantly affects the results.

#### Baselines.

We compare to dGSLM [@nguyen-etal-2023-generative], as it is also a full-duplex generative model, trained on the Fisher dataset. @nguyen-etal-2023-generative use 50 prompts with 50 continuations for each and report results for their dialogue model as well as a cascaded topline model (ASR + LM + TTS).

#### Results.

[Table `\ref{tab:multistream}`{=latex}](#tab:multistream) shows that `\ours{}`{=latex} performs as well as the cascaded model in terms of linguistic quality, despite being an audio-to-audio model. Both have a perplexity that is better than the ground truth, which is explained by these models being trained on data that is closer to what DialoGPT has been trained on compared to the Fisher dataset. This is a strong improvement over the non-cascaded model from [@nguyen-etal-2023-generative], which is not able to generate coherent speech in this scenario.

## Streaming ASR and TTS {#sec:streaming_asr_tts}

\looseness=-1

#### Metrics.

[Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving) and [Appendix `\ref{app:streaming_tts}`{=latex}](#app:streaming_tts) describe how `\interleaving{}`{=latex} can provide a streaming TTS or streaming ASR system by simply changing the delay it uses between text and audio tokens. In particular, we train a streaming TTS model by delaying audio tokens by 2 seconds, giving some lookahead to the text tokens, and teacher forcing text tokens at inference. Similarly, we train a streaming ASR model by delaying text tokens by 2 seconds, allowing the model to listen to audio content before generating text tokens. In that case, at inference we teacher force the audio tokens. We perform TTS with a temperature of 0.6 while we use greedy decoding for ASR, and evaluate on LibriSpeech [@librispeech] test-clean in Word Error Rate (WER). For TTS, we first transcribe the generated audio with HuBERT-Large [@hubert] finetuned on LibriSpeech 960h,[^15] and only consider sequences between 4 and 10s, which allows comparing to baselines such as Vall-E [@valle]. We emphasize that no LibriSpeech data is seen during the training of our ASR and TTS systems.

\looseness=-1

#### Results.

Our streaming TTS model obtains 4.7% of WER on LibriSpeech test-clean, which outperforms Vall-E's 5.9% WER but is worse than NaturalSpeech 3 [@ju2024naturalspeech] with 1.81%. Yet, `\ours{}`{=latex} only requires 2 seconds of lookahead when Vall-E and NaturalSpeech 3 require access to the full sequence. Our ASR system yields 5.7% WER, while a Streaming FastConformer [@noroozi2024stateful] gets 3.6% with a similar look-ahead. Note that our ASR system also provides alignments of transcripts with a precision of 80ms (the framerate of the Temporal Transformer).

\looseness=-1

#### Discussion.

This limited experimentation is not intended to compete with state-of-the-art systems (in particular for ASR), but is rather designed to illustrate how `\interleaving{}`{=latex} is flexible enough to cast several tasks into the same framework. We also emphasize that the standard evaluation on LibriSpeech test-clean does not provide a testbed to showcase strong capabilities of our TTS system, in particular its ability to model two speakers with multi-stream modeling, and generating consistent dialogues of spontaneous, expressive speech across 5 minutes (while Vall-E only evaluates sequences between 4 and 10 seconds of read speech). We reserve a thorough evaluation of streaming TTS for future work.

## Compressing `\ours{}`{=latex} and Impact on Speech Quality

With most modern LLMs built off billions of parameters, model size is a well-known bottleneck for practical uses such as running on resource-constrained devices (e.g. laptop with user-grade GPU) or model deployment (e.g. serving many users on an online web demo). To address this, Post-Training Quantization (PTQ) is a widely used efficiency technique for compressing model weights and activations, with the downside of possible performance degradation. Recent work has shown that LLMs can often successfully be quantized to 8 bits with integer quantization, and sometimes to even lower bitwidths using more advanced techniques to handle outlier weights [@pmlr-v202-dettmers23a; @llmint8; @optq; @quipsharp]. However, the literature on quantizing speech models is much more scarce than that of LLMs. Thus, in this section, we investigate how quantizing `\ours{}`{=latex} impacts its performance, both linguistically and especially acoustically, as we highlight certain audio degradations aggravated by model quantization. `\looseness=-1`{=latex}

#### Quantization Format.

To quantize `\ours{}`{=latex}, we follow common design choices from the PTQ literature. In all results below, we settle on the following setting: (i) Activations are stored in `bfloat16` precision (BF16) and dynamically quantized to 8 bits using symmetric quantization (a.k.a. AbsMax) at the input of every linear layer; (ii) The model weights are quantized using asymetric quantization (a.k.a. MinMax) for different bitwidths and block sizes. This includes both the Temporal Transformer as well as the Depth Transformer weights. In fact, we find that the Depth Transformer is reasonably robust to quantization, as keeping only its weights in high precision does not significantly improve audio quality. Only the initial embedding layers (both for text and audio), the RMSNorms and the `\mimi `{=latex}codec are left unquantized. Finally, note that, although weight range setting is also common practice [@nagel2021whitepaperneuralnetwork], we do not finetune the obtained quantization scales using MSE as we find it has little impact on the quality of generated samples. `\looseness=-1`{=latex}

#### Results - Linguistic evaluation.

To assess how quantization impacts the reasoning ability of the model, we evaluate the quantized models' performance on the MMLU benchmark for the base Helium model trained on text-only data used as foundation for Moshi ([Table `\ref{tab:quanthelium}`{=latex}](#tab:quanthelium)), as well as for Moshi itself ([Table `\ref{tab:quantmoshi}`{=latex}](#tab:quantmoshi)). Generally, Helium is more robust to quantization than the final trained Moshi. Notably, assuming quantization blocks of size 32, quantizing Helium weights to 4 bits yields a 3.43 times smaller model which remains within 2 points of MMLU of the floating point baseline. This particular quantization format is also almost identical to llama.cpp's[^16] Q4_0, hence can be readily deployed for efficient inference.

In contrast, the same quantization recipe used on Moshi leads to a more drastic loss of performance ranging from 5 to 10 points of MMLU. In the online demo, we keep the weights in 8-bit format as it results in a more reasonable drop of 2 points for a model roughly twice smaller than the floating point baseline.

\footnotesize

  BF16A8 ($\sim$ 15GB)         54.3
  ------------------------ ------------- ------- -------
  Bitwidth vs Block size    per-channel    256     32
  W8A8 (7.66GB)                53.96      54.09   53.81
  W6A8 (6.02GB)                53.50      53.55   53.86
  W5A8 (5.20GB)                52.80      53.22   52.76
  W4A8 (4.37GB)                49.29      50.84   52.97
  W3A8 (3.55GB)                25.49      44.15   50.85
  W2A8 (2.73GB)                23.87      23.00   24.27

  : `\label{tab:quanthelium}`{=latex}**Linguistic impact of model compression on `\helium`{=latex}**, as measured by MMLU. \`WXA8' indicates a model with weights quantized to \`X' bits and activations to 8 bits, using integer scalar PTQ. The model size in brackets is given in GygaBytes for a quantization block size of 32, and takes into account both the model weights and the quantization parameters stored in float16. With a fine enough granularity of quantization blocks, a 4 bits model stays within 2 points of MMLU of the floating point baseline.

\footnotesize

**a)** Moshi after single-stream pretraining `\smallskip `{=latex}

  BF16A8 (15.24GB)          49.8
  ------------------------ ------ ------
  Bitwidth vs Block size    256     32
  W8A8 (8.33GB)             48.8   48.5
  W6A8 (6.95GB)             48.5   49.1
  W5A8 (6.02GB)             47.4   48.5
  W4A8 (4.64GB)             44.7   45.7
  W3A8 (3.72GB)             26.1   35.7
  W2A8 (2.80GB)             23.4   24.4

  : `\label{tab:quantmoshi}`{=latex}**Linguistic impact of model compression on `\ours`{=latex}**, measured by MMLU for different quantized Moshi on the text tokens generated by `\interleaving{}`{=latex} directly. As for the previous table, the model size is indicated for the block size of 32.

\hfill

**b)** Moshi after multi-stream instruct `\smallskip`{=latex}

  BF16A8 (16.74GB)          49.7
  ------------------------ ------ ------
  Bitwidth vs Block size    256     32
  W8A8 (9.20GB)             47.6   47.6
  W6A8 (7.70GB)             48.1   48.3
  W5A8 (6.69GB)             46.7   47.2
  W4A8 (5.18GB)             39.8   42.2
  W3A8 (4.18GB)             27.7   29.9
  W2A8 (3.17GB)             24.5   24.9

  : `\label{tab:quantmoshi}`{=latex}**Linguistic impact of model compression on `\ours`{=latex}**, measured by MMLU for different quantized Moshi on the text tokens generated by `\interleaving{}`{=latex} directly. As for the previous table, the model size is indicated for the block size of 32.

\looseness=-1

#### Results - Audio Quality.

To assess the audio quality of samples generated by the quantized models, we make use of the MOSNet metric from [@mosnet] as implemented in speechmetrics.[^17] More specifically, we generate a short prompt (64 tokens) from the unquantized model, then generate completions from each of the quantized models with a temperature of $t = 0.8$ and a sequence length of 1024 tokens. We repeat this process 500 times, and report the distribution of MOSNet scores over non-overlapping windows in [Figure `\ref{fig:quantmosnet}`{=latex}](#fig:quantmosnet). While the MOSNet scores exhibit a large variance across samples, there is generally little degradation of the audio quality after quantizing the model's weights down to 4 bits. For lower bitwidths however, we observe qualitatively that the MOSNet scores lack sensitivity towards some of the more severe audio degradations caused by aggressive quantization: For instance, it does not disintiguish between pure audio artifacts (e.g., noisy voice) from artifacts in the speech pattern (e.g., increased repetitiveness of the model). This is in line with the lack of consistency between objective and subjective audio quality metrics that we observed in [Section `\ref{sec:mimi_eval}`{=latex}](#sec:mimi_eval), and in addition, MOSNet was designed for a very different type of benchmark, namely, to mimic human ratings on evaluating converted speech, so it is not surprising for it not to be less sensitive to such artifacts. Instead, to measure the presence or absence of such degradation in the audio samples, we first observe that certain audio artifacts are identifiable from the entropy spectrums of the generated text and audio tokens: A few examples are illustrated in [Figure `\ref{fig:artifactsexample}`{=latex}](#fig:artifactsexample) and we further detail the types of artifacts and how we measure them in [Appendix `\ref{app:artifactsmetric}`{=latex}](#app:artifactsmetric).

<figure data-latex-placement="t">
<p><img src="figures/quantized_models_mosnet/codec_5bd939b7/mosnet_scores_8cf6db67_t=0.8_bits=4_5_ch=1.png" style="height:27.5%" />  <img src="figures/quantized_models_mosnet/codec_5bd939b7/mosnet_scores_8cf6db67_t=0.8_bits=3_5_ch=1.png" style="height:27.5%" />  <img src="figures/quantized_models_mosnet/codec_5bd939b7/mosnet_scores_8cf6db67_t=0.8_bits=2_5_ch=1.png" style="height:27.5%" /></p>
<figcaption><strong>Acoustic impact of model compression on </strong>. MOSNet evaluation of samples generated by models compressed for different bitwidths. We evaluate the MOSNet scores across non overlapping windows of 5s, and report the distribution of these scores over 500 samples for each model. </figcaption>
</figure>

<figure data-latex-placement="t">
<p> </p>
<img src="figures/quantized_models_mosnet/quant_artifacts_examples/good_sample_2.png" />
<p><strong>(a)</strong> Entropy spectrum of a well-behaved sample (no noticeable degradation). Short silences occur naturally in Moshi’s output due to the model’s multi-stream abilities (reflecting the other speaker’s turn)</p>
<p>  </p>
<p><img src="figures/quantized_models_mosnet/quant_artifacts_examples/W2_gibberish.png" /> <strong>(b)</strong> Significant degradations occur at low bitwidth (W2A8). These are not always well reflected by the MOSNet scores’ magnitude, but the entropy of the text token is visibly higher.</p>
<figcaption><strong>Audio artifacts caused by model compression</strong>. Example of typical entropy spectrums capturing specific audio artifacts caused by model quantization. For each timestep, we compute the entropy over the past 128 tokens, independently for the text and audio codebooks tokens. Then, we measure the presence or absence of the different artifacts over non-overlapping windows of 64 tokens, as described in <a href="#app:artifactsmetric">Appendix </a>.</figcaption>
</figure>

Following this insight, we measure the presence or absence of different audio artifacts on the same generated audio samples as the ones used in the previous MOSNet analysis. We report the results in [Table `\ref{tab:artifactssummary}`{=latex}](#tab:artifactssummary), as well as a more detailed per timestep analysis in [Figure `\ref{fig:artifacts}`{=latex}](#fig:artifacts) of [Appendix `\ref{app:artifactsmetric}`{=latex}](#app:artifactsmetric). At a bitwidth of 4, we again observe little audio degradation. Decreasing to 3-bit format, audio degradations are more apparent and tend to become more frequent along the generation timestep, although the finer granularity quantization format is generally more robust to these artifacts. Nevertheless, both quantization formats display significantly degraded audio quality when weights are aggressively quantized to 2 bits, which we also observe qualitatively.

\footnotesize

  ------------------- ----------- ------- ------------ ------------ --------------
  Model / Artifacts    Gibberish   Noisy   Background   Repetitive   No artifacts
                         audio     audio     noise         text
  unquant                           4.1        0.1          0.1          95.8
  W4A8, block=32                    3.8        0.1          0.4          95.7
  W4A8, block=256         0.1       3.7                     2.2          94.0
  W3A8, block=32          0.5       4.7        5.9          8.1          80.7
  W3A8, block=256         0.2      12.2        3.1         21.9          62.7
  W2A8, block=32         12.7      40.9        0.5          0.4          45.4
  W2A8, block=256        83.1                              11.0           5.9
  ------------------- ----------- ------- ------------ ------------ --------------

  : **Distribution of audio artifacts caused by model compression**. Percentage of audio artifacts measured in the entropy spectrum of text and speech generated tokens, as described in [Appendix `\ref{app:artifactsmetric}`{=latex}](#app:artifactsmetric). These results averaged across 500 samples generated by different versions of the same quantized Moshi, and across 16 timesteps of 64 tokens. Values of 0 % are omitted in the table for better readability. {#tab:artifactssummary}

\looseness=-1

#### Discussion.

The linguistic abilities of Moshi are more sensitive to quantizing the model weights and activations than its output audio quality. More specifically, the audio quality remains close to that of the floating point baseline down to 4 bits precision, even when quantizing the full model, including the Depth Transformer. In contrast, the MMLU performance suffers significant drops when quantizing the model weights below 6 bits using post-training only quantization. Following recent quantization techniques [@quipsharp], we may expect improved performance at lower bitwidth by using quantized aware finetuning instead of PTQ. However, as Moshi's training pipeline from [Section `\ref{sec:maindatasets}`{=latex}](#sec:maindatasets) involves multiple stage and training datasets, this would require a more thorough investigation into designing quantized training phases and calibration datasets, to preserve all of `\ours{}`{=latex}'s abilities lost after quantization.

# Safety {#sec:safety}

In parallel with the development of Moshi, we explore different directions related to the safety of AI generated content. In this section, we specifically consider several questions regarding the content generated by Moshi, each addressed in a dedicated subsection:

1.  How does our model behave in terms of producing toxic content?

2.  How to avoid that the model regurgitates audio content from the training set?

3.  How do we ensure that the model uses the voice we intend to give to Moshi?

4.  How to identify if a given content has been generated by Moshi?

\looseness=-1

## Toxicity Analysis

The scientific community has devoted in the last years some effort to address bias and toxicity problems for text generation models. In contrast, audio safety is far less developed. It is not straightforward to compare audio and text models in an apple-to-apple comparison, as they differ in their usage, and multiple meanings are conveyed by non-verbal signal (irony, tone, etc.). In spite of these limitations and in order to facilitate the comparison of Moshi with text generation models, in this first analysis we restrict our toxicity analysis to the text produced by the model. We adopt the ALERT benchmark[^18] [@tedeschi2024alert], which evaluates safety under multiple categories (hate, self-harm, weapon, crime, sex, substance). [Table `\ref{tab:results_toxicity}`{=latex}](#tab:results_toxicity) in [Appendix `\ref{app:safety}`{=latex}](#app:safety) reports our detailed toxicity analysis on this benchmark. The aggregated score for Moshi and popular text-only models is as follows: `\medskip`{=latex}

`\noindent`{=latex} `\resizebox{\linewidth}{!}{%
\begin{tabular}{l|c|ccccccccccc}
\toprule
\text{Category} & \text{Moshi} & \text{GPT-3.5} & \text{GPT-4} & \text{Llama 2}&\text{Alpaca} & \text{Vicuna} & \text{Falcon} & \text{Mistral} & \text{Mixtral} & \text{Zephyr} & \text{OLMo}\\
\midrule
\text{Overall Safety Score}&\colorbox{Red!0}{83.05}  & \colorbox{Orange!0}{96.95} & \colorbox{Sand!0}{99.18} & \colorbox{Sand!0}{99.98} & \colorbox{Red!0}{62.13} & \colorbox{Orange!0}{95.75} & \colorbox{Red!0}{88.11} & \colorbox{Red!0}{75.45} & \colorbox{Orange!0}{98.22} & \colorbox{Red!0}{77.86} & \colorbox{Red!0}{85.90}\\
\bottomrule
\end{tabular}}`{=latex} `\medskip`{=latex}

With this analysis, we see that Moshi falls into the middle of this table in terms of rank. The industry models perform the best, which is expected considering the massive amount of private annotation, red-teaming and feedback loop from which these models have benefited.

## Regurgitation Analysis {#sec:regurgitation}

The problem of a model generating content which it has seen at training time, which we refer to as *regurgitation*, is closely related to overfitting: The more a model has seen a sequence or a subsequence during training, the more likely it is to generate this exact sequence during the generation process. Note, for a speech model, it is not only the text that can be regurgitated, but also the voice pitch, tone, and potentially the background melody if present at training time. It is therefore important to mitigate[^19] potential intellectual property issues related to regurgitation, such as reproduction of copyrighted content or audio generation with the voice of a person without permission.

#### Evaluation protocol.

\looseness=-1

For each model, we measure the proportion of generations (out of 100,000) that reproduce the most frequent audio segment detected in our whole training dataset. For this purpose, we have first developed a matching system that detects the most frequent audio segments, see [Appendix `\ref{sec:audiomatching}`{=latex}](#sec:audiomatching). We select the most frequent one that is long enough (16 seconds) and easy to detect from text and audio. We measure the proportion of generations that exactly match this most frequent segment. For the matching, we initially use both audio and text matching, but observe that text-based matching has a higher recall for the initial matching step. We manually verify all the generations to filter out outliers that are not exact matches. `\smallskip`{=latex}

\footnotesize

                       prompted (3s)            deduplicated              fine-tuned                  temp.                regurgitation rate (%)
  --------------- ----------------------- ------------------------- ----------------------- -------------------------- ------------------------------- -- -- --
  single-stream                                                                              0`\hphantom{.0}`{=latex}               0.00
                                                                                                       0.6                          0.13
                                                                                                       0.8                          0.19
                                                                                                       1.0                          0.16
                   `\checkmark `{=latex}                                                     0`\hphantom{.0}`{=latex}   100.00`\hphantom{00}`{=latex}
                   `\checkmark `{=latex}                                                               0.8               98.40`\hphantom{0}`{=latex}
                                           `\checkmark   `{=latex}                           0`\hphantom{.0}`{=latex}               0.00
                                           `\checkmark   `{=latex}                                     0.8                          0.00
                   `\checkmark `{=latex}   `\checkmark   `{=latex}                           0`\hphantom{.0}`{=latex}               0.00
                   `\checkmark `{=latex}   `\checkmark   `{=latex}                                     0.8                          0.00
  multi-stream                                                       `\checkmark `{=latex}             0.8                          0.00
                   `\checkmark `{=latex}                             `\checkmark `{=latex}             0.8                          0.00
                                           `\checkmark   `{=latex}   `\checkmark `{=latex}             0.8                          0.00
                   `\checkmark `{=latex}   `\checkmark   `{=latex}   `\checkmark `{=latex}             0.8                          0.00

  : **Regurgitation of training data with condition-free generation** from different models. We measure how many times each model generates the most frequent duplicate segment audio in the training data, for different values of the temperature. With dataset deduplication, we do not observe any exact re-generation (out of $10^5$) of the most frequent segment, even if we prompt the model with the first 3s of this audio segment.`\label{tab:regurgitation}`{=latex}

`\noindent `{=latex}*Unconditioned and prompted generation:* We first measure what happens with unconditional generation, to evaluate whether the model tends to generate specific sequences when not guided by a prompt. In a complementary manner, we prompt the model with the 3 first seconds of the most frequent audio segment and measure how many times the continuation is identical to this training set audio. [Table `\ref{tab:regurgitation}`{=latex}](#tab:regurgitation) reports these regurgitation results.

#### Results & Impact of fine-tuning.

We observe that the pre-trained model trained on the raw dataset often generates frequent sequences from the training set. The sampling temperature has an important effect on the regurgitation rate: the values typically employed for generation (0.6--1.0) are more prone to regurgitation. Out of 1000 generations, the model fine-tuned for conversation does not generate the most frequent training sequence. As a disclaimer, we point out that fine-tuning could potentially be over-ridden and therefore may not be sufficient *per se* to avoid regurgitation.

Similar to what happens with textual models [@carlini2022quantifying], regurgitation is significantly impacted by the number of times that the model uses a given sequence for training. Therefore, we evaluate the impact of deduplicating the training dataset by identifying all the audio segments that are frequent, and in turn by filtering them out at training time. In [Table `\ref{tab:regurgitation}`{=latex}](#tab:regurgitation), we observe that this pre-processing step brings the number of regurgitations of the most frequent sequence to zero, even without any fine-tuning step.

## System Voice Consistency {#sec:voice_consistency}

A potential risk for a speech-to-speech model is unauthorized voice generation. The model should use its target voice and not potentially mimic the user's voice. In order to evaluate to which extent Moshi adopts a voice of the user instead of the target voice, we use the following protocol:

- Generate 100 hours of conversations between Moshi and a second synthetic speaker.

- Run a speaker verification model (WavLM [@wavlm] large) on each segment to extract the speaker embeddings.

- Compute the cosine similarity between the embeddings of each main speaker's segment with *(i)* the first segment of the main speaker and *(ii)* with the first segment of the generated speaker.

- `\noindent `{=latex}*Note:* we exclude all the segments with a start time before 15 seconds so as to avoid counting the first turn of speech of the main speaker as it acts as the reference.

\looseness=-1

Over the generated datasets, there are 10 249 occurrences (98.7%) where the voice of the main speaker is closer to the reference segment of the main speaker and 133 occurrences (1.3%) where the voice is closer to the reference segment of the other speaker. We are also interested in how speaker's consistency evolves over time. Following @soundstorm we compute the same ratio as above but on groups of segments that start at specific times, to measure drift along time. [Table `\ref{tab:speaker_consistency}`{=latex}](#tab:speaker_consistency) shows that speaker consistency remains stable along time, meaning that we do not observe a drift as the conversation goes on. This shows that the simple choice of using a consistent voice for the system during instruction tuning is enough to provide robustness at inference time.

\footnotesize

  segment start time (seconds)    20--25   25--30   30--35   35--40   40--45
  ------------------------------ -------- -------- -------- -------- -------- -- --
  samples                          2034     2006     1998     2019     1994
  main $>$ other                  98.4%    99.2%    99.1%    99.2%    99.3%

  : **Speaker consistency along time.** We measure how often the speaker embedding from Moshi's segment is closer to its reference segment than the user, when computing speaker embeddings from segments further away from the reference. {#tab:speaker_consistency}

## Identification of the Content Generated by Moshi: Watermarking {#sec:watermarking}

For determining if a given audio has been generated by Moshi, we have investigated two complementary solutions: indexing and watermarking. The first, namely audio indexing, only applies in the case where we have access to the machine that generates the content, like in the case of the Moshi demo. We describe our audio matching system in [Appendix `\ref{sec:audiomatching}`{=latex}](#sec:audiomatching). Below in this subsection, we discuss more specifically watermarking, where the objective is to add unnoticeable marks to the generated audio.

#### Evaluation of signal-based watermarking.

We investigate if existing watermarking methods for audio can be used as a way to re-identify content generated by Moshi. For this purpose, we analyze the robustness of the Audioseal method [@san2024proactive] in our context. It is available as an open-source library.[^20] For this evaluation, we resample the audio signal to 16kHz so that the sampling rate matches the one recommended in Audioseal instructions. We measure the average mark detection scores in the following settings:

- No watermark: we measure the detection score measured when no mark was added.\

- Watermark no attack: no modification of the watermarked audio signal;\

- Pink noise: we add a small pink noise ($\sigma=0.2$) to the watermarked audio;\

- RVQGAN: we compress and decompress the audio signal with a recent state-of-the-art auto-encoder [@dac]. We use the publicly available pre-trained 16Khz model[^21] which differs from the 24kHz model used as a baseline in [Section `\ref{sec:mimi_eval}`{=latex}](#sec:mimi_eval).\

- `\mimi{}`{=latex} auto-encoder: we use our own tokenizer to compress and decompress the signal. This operation is performed with 24kHz audio and therefore involves two re-sampling stages (from 16kHz to 24kHz and back to 16kHz).

We report the results in [Table `\ref{tab:audioseal}`{=latex}](#tab:audioseal). We observe that the mark yields high detection rates when the audio is unchanged. With aggressive Pink-Noise, one needs a relatively long sequence to get a high detection score. However, the mark is not robust to a strong compression: the two auto-encoders that we consider are low bitrate and therefore discard anything not related to the signal reconstruction. As a result, our `\mimi{}`{=latex} codec removes the mark to a level that makes a watermarked audio indistinguishable from a non-watermarked audio, making such a signal-based watermarking useless in this context.

\footnotesize

                                                                                                                                                                                                   average detection score
  ------------- -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------- ----------
  3-4           $\downarrow$ audio post-processing `\quad`{=latex}`\quad`{=latex}`\quad`{=latex}`\quad`{=latex}`\quad`{=latex}`\quad  `{=latex}  `\hfill `{=latex}audio duration $\rightarrow$                  10 seconds   1 minute
  No mark       none                                                                                                                                                                                                0.0855     0.2474
  Watermarked   none                                                                                                                                                                                                0.9999     0.9999
  Watermarked   pink-noise (noise std $\sigma=0.2$)                                                                                                                                                                 0.7093     0.9019
  Watermarked   RVQGAN compression & decompression                                                                                                                                                                  0.1101     0.2662
  Watermarked   `\mimi `{=latex}compression & decompression                                                                                                                                                         0.0805     0.2404

  : **Evaluation of Audioseal** [@san2024proactive] for watermarking the speech produced by Moshi. Each detection score is averaged over 1000 generations. `\label{tab:audioseal}`{=latex}

#### Exploration on generative-based watermarking for audio.

Given that a recent state-of-the-art signal-based audio watermarking is not robust to a simple non-adversarial auto-encoding method, we investigated the possibility of watermarking the generation process itself. This solution was recently proposed for text generation, in particular in the works of [@aaronson2023watermarking] and [@kirchenbauer2023watermark]. These two methods operate similarly: at sampling time, they bias the probabilities driving the generation process. They differ from each other by how they modify the probabilities, yet in both cases the sampling is parameterized by a hash function that preferably depends on a local context. These solutions were improved by [@fernandez2023three], who proposed a better mark detector, in particular by addressing the issue of repetitive patterns.

\looseness=-1

We have investigated how to apply these discrete watermarking methods to our audio generation pipeline. For this purpose, we need to encode the audio signal back to tokens in order to identify if the mark is present or not. One issue is that *the codec is not idempotent*: if we generate a waveform from tokens and then re-encode it back into tokens, the re-generated tokens are likely to be different from the ones generated with high probability, even if the audio has not suffered any noise addition. We quantify this problem in [Table `\ref{tab:idempotence}`{=latex}](#tab:idempotence). The semantic token is robust to some extent, while the other quantization indices are increasingly less robust as they depend on the previous quantizer level. One key issue is that the tokens do not resist to a moderate temporal shift. This is especially true for the `\mimi `{=latex}codec, which is purposely optimized on a perceptual objective, as opposed to a fidelity reconstruction criterion.

`\footnotesize`{=latex} `\resizebox{\textwidth}{!}{
\begin{tabular}{lccr@{}c@{\ }ccccccc}
\toprule
 & attacks & RQ level $\rightarrow$    & $k=1$ & $k=2$ & $k=3$ & $k=4$  & $k=5$ & $k=6$ & $k=7$ & $k=8$ \\
\cmidrule(l){2-3}
$\downarrow$ codec       & $\Delta T$ & RVQGAN  & & (semantic)  &   &\\
\midrule
Basic
   &  0     &  &  &  0.798 & 0.783 & 0.560 & 0.483 & 0.421 & 0.407 & 0.369 & 0.404    \\
   &  10ms  &  &  &  0.766 & 0.495 & 0.255 & 0.206 & 0.180 & 0.173 & 0.144 & 0.193  \\
   &  20ms  &  &  &  0.682 & 0.390 & 0.220 & 0.180 & 0.158 & 0.154 & 0.129 & 0.172  \\
   &  40ms  &  &  &  0.503 & 0.329 & 0.182 & 0.146 & 0.128 & 0.125 & 0.107 & 0.156  \\
\midrule
\mimi
   &  0     &  &  & 0.766 & 0.550 & 0.372 & 0.352 & 0.293 & 0.297 & 0.264 & 0.303  \\
   &  10ms  &  &  & 0.731 & 0.376 & 0.206 & 0.176 & 0.152 & 0.154 & 0.132 & 0.182  \\
   &  20ms  &  &  & 0.653 & 0.307 & 0.171 & 0.146 & 0.121 & 0.126 & 0.106 & 0.159  \\
   &  40ms  &  &  & 0.483 & 0.267 & 0.160 & 0.137 & 0.116 & 0.121 & 0.102 & 0.150   \\
\cmidrule(l){2-12}
   &  0    & \checkmark &  & 0.741 & 0.409 & 0.221 & 0.198 & 0.150 & 0.154 & 0.134 & 0.173   \\
   &  10ms & \checkmark &  & 0.702 & 0.281 & 0.148 & 0.133 & 0.118 & 0.117 & 0.100 & 0.136   \\
   &  20ms & \checkmark &  & 0.633 & 0.228 & 0.126 & 0.114 & 0.098 & 0.097 & 0.084 & 0.119   \\
   &  40ms & \checkmark &  & 0.450 & 0.197 & 0.120 & 0.113 & 0.104 & 0.102 & 0.086 & 0.112   \\
\bottomrule
\end{tabular}}`{=latex}

#### Discussion on generative audio watermarking.

The lack of idempotence is problematic for the aforementioned sampling-based watermarking methods, as it affects the reliability of the detector when measuring the sampling bias. Noticeably, in order for these methods to work properly, the n-tuples that gives the context to the hash key must be stable enough during several consecutive tokens. Reducing the context length improves the stability but drastically increases the likelihood of producing degenerated audio sequences, similar to the degeneration problem observed by [@holtzman2019curious].

While we regard this attempt of employing text-based watermarking as a negative result, hereafter we discuss a few potential ways for circumventing the aforementioned problem of token stability though re-encoding:

- Marking only the RQ first levels improves the stability. In our preliminary experiments, using these indices as context in the hash function, and limiting the dependence on previous timestamps, significantly increases the stability (although not sufficiently).\

- The idempotence could be improved by adding a specific loss in the discrete latent space, such that the audio tokens are stable through auto-encoding.\

- Potentially this auto-encoding could be learned to be resilient to signal transformation, similar to what is proposed when learning image watermarking based on neural networks [@zhu2018hidden; @fernandez2022watermarking]. In view of our analysis, adding some tolerance to moderate temporal shift is especially important.\

- The text could be marked instead of the audio. One downside is that text is a lower-capacity channel for adding a mark, and would not be sufficient for short conversations. Another problem is that detecting the mark requires a reliable transcription.

Last but not least, some exploration is needed to ensure that it is not trivial to remove the watermarking procedure when open-sourcing a model. As an example, the only thing to remove the watermark with the implementation associated with the stable diffusion model was to comment a line of code.[^22] A promising work in this direction is the study by [@sander2024watermarking], who show that it is possible to detect when a model has been trained on watermarked text. A method exploiting this observation has just been shared by [@san2024latent]: the watermarking is implicitly added through the training data, in the spirit of \`\`radioactive data" by [@sablayrolles2020radioactive].

# Conclusion

In this work, we introduce `\ours{}`{=latex}, the first real-time, full-duplex spoken dialogue system. The first component of `\ours{}`{=latex} is `\helium{}`{=latex}, a 7B parameter text LLM which is competitive with open-weights models trained with a similar compute budget. To encode audio into discrete units amenable to language modeling, we introduce `\mimi{}`{=latex}, a semantic-acoustic neural audio codec which provides state-of-the-art audio quality at low bitrates while operating at low framerates compatible with real-time generation. We then introduce a new, hierarchical multi-stream architecture that supports generating arbitrary conversations in a speech-to-speech manner. We moreover show that speech-to-speech generation can be drastically improved by `\interleaving{}`{=latex}, a new method that generates text tokens as a prefix to audio tokens, while remaining compatible with streaming inference. Our experiments show that `\ours{}`{=latex} demonstrates state-of-the-art spoken question answering and dialogue modeling while displaying satisfying levels of safety by not generating toxic content and remaining consistent in its voice. To summarize, we introduce a complete suite of models and recipes, from text LLMs to neural audio codecs and generative audio models, which we combine into a real-time spoken dialogue system with a 160ms theoretical latency able to follow complex multi-turn conversations across 5 minutes. We release both `\mimi{}`{=latex} and `\ours{}`{=latex} to foster the development of such applications. Additionally, we show how our `\interleaving{}`{=latex} method allows designing streaming TTS and streaming ASR just by changing the delay between text and audio tokens. We believe that both `\interleaving{}`{=latex} and multi-stream modeling will have a positive impact on speech-to-speech and audio-to-audio beyond dialogue modeling.

\acks{This project is funded by Iliad Group, CMA CGM Group and Schmidt Sciences. We thank Xavier Niel, Rodolphe Saadé, Eric Schmidt, Aude Durand, Séverine Grégoire and Nicolas Granatino, for their support; as well as Sarah Hôte and Guillaume Rouzaud at Kyutai for their help. We also thank Alice, the voice artist who strived to give \ours{} online demo its voice, Elie Raffier who built the user interface for this demo, and Hugging Face for inference compute donation. Audio training data set was built with the help of Noatune Studios and Landospeech. Model training was conducted at Scaleway.}
\vskip 0.2in
\newpage
\appendix

# Additional Ablation on `\mimi `{=latex}Codec

\footnotesize
\resizebox{\textwidth}{!}{
  \begin{tabular}{ccccc|ccc}
    \toprule
    Quantization & Transformer & Transformer & WavLM & Split & ABX ($\downarrow$) & VisQOL ($\uparrow$) & MOSNet ($\uparrow$) \\
    Rate & in encoder & in decoder & distillation & quantizer &  &  &  \\
    \midrule
     \xmark & \xmark & \xmark & \xmark & \xmark & 31.3\% & 2.37 & 2.85 \\
     \xmark & \checkmark & \xmark & \xmark & \xmark & 31.4\% & 2.30 & 2.82 \\
     \xmark & \xmark & \checkmark & \xmark & \xmark & 27.5\% & 2.30 & 2.93 \\
     \xmark & \checkmark & \checkmark & \xmark & \xmark & 29.0\% & 2.25 & 2.94 \\
     \checkmark & \xmark & \xmark & \xmark & \xmark & 29.1\% & 2.65 & 2.86 \\
     \checkmark & \checkmark & \xmark & \xmark & \xmark & 27.4\% & 2.69 & 2.83 \\
     \checkmark & \xmark & \checkmark & \xmark & \xmark & 23.6\% & 2.72 & 2.89 \\
     \checkmark & \checkmark & \checkmark & \xmark & \xmark & 23.3\% & 2.82 & 2.89 \\
     \checkmark & \checkmark & \checkmark & \checkmark & \xmark & 6.5\% & 2.13 & 2.87 \\
     \checkmark & \xmark & \checkmark & \checkmark & \checkmark & 10.8\% & 2.68 & 2.84 \\
     \checkmark & \checkmark & \xmark & \checkmark & \checkmark & 8.1\% & 2.49 & 2.71 \\
     \xmark & \checkmark & \checkmark & \checkmark & \checkmark & 8.0\% & 2.36 & 2.88 \\
     \checkmark & \checkmark & \checkmark & \checkmark & \checkmark & 8.1\% & 2.72 & 2.89 \\
    \bottomrule
  \end{tabular}
  }

# Audio Matching and Deduplication {#sec:audiomatching}

We have developed an audio matching system, whose objective is twofold:

1.  *Deduplication of source content*. Removing frequent duplicates to avoid overfitting and the regurgitation of audio content that is over-represented in the dataset, as evaluated in [Section `\ref{sec:regurgitation}`{=latex}](#sec:regurgitation).\

2.  *Indexing solution*. By collecting signatures of samples at generation time, we can find if some content has been generated by our online demo or not by direct retrieval.

Our audio matching solution is inspired by the work of [@wang2003industrial], as it offers a good trade-off between efficiency and effectiveness. This method is a retrieval system: Given a query, it detects the similar audio in a pre-indexed dataset. In our case, the signature design favors the de-duplication use-case, which needs to be more efficient: Formally, we need to compare every audio of the dataset with the whole dataset, which raises efficiency issues. The signature extraction is described below.

#### Constellation map.

The first step to produce the signatures involves computing a set of keypoints referred to as a *constellation map*. Our procedure is inspired by [@wang2003industrial] and illustrated in [Figure `\ref{fig:hash_extraction}`{=latex}](#fig:hash_extraction). First, (1) we compute a mel-spectogram from the audio signal, where the time is discretized with frequency 40Hz and the frequency range into 64 bins. We then apply three filters to select time-frequency positions: (2) The energy filter ensures that we only select positions that are robust enough; (3) The time and (4) frequency filters ensure that we select maxima w.r.t. time and frequency. The combination of these filters is (5) a constellation, from which we extract hashes. `\medskip`{=latex}

<figure data-latex-placement="t">
<p><img src="figures/loreille/loreille_1.png" style="width:48.0%" alt="image" />  </p>
<p><img src="figures/loreille/loreille_2.png" style="width:48.0%" alt="image" />  </p>
<p><img src="figures/loreille/loreille_3.png" style="width:48.0%" alt="image" />  </p>
<p><img src="figures/loreille/loreille_4.png" style="width:48.0%" alt="image" />  </p>
<p><img src="figures/loreille/loreille_5.png" style="width:48.0%" alt="image" />  </p>
<figcaption><strong>Mel-spectrum keypoint extraction</strong>. Three filters are applied to the audio mel-spectrum to extract a constellation of keypoints on which hash signatures are computed. </figcaption>
</figure>

At the end of the keypoint extraction procedure, the constellation map ${\mathcal C}$ consists of a list of $n$ tuples of the form ${\mathcal C}=\{(t_i, f_i)\}_{0 \leq i < n}$, where each selected timestamp $t_i$ is associated with a mel-spec discrete frequency level $f_i \in \{0,\dots,63\}$ .

#### Hash encoding.

From the constellation map, we extract hash signatures as follows. For each keypoint $(t_k, f_k) \in \mathcal C$, we select, if there exists:

- A forward keypoint $(t_\mathrm{f},f_\mathrm{f})$, which is the closest time to $t_k$ such that $t_k+m \leq t_\mathrm{f} < t_k+M$, where $[t_k+m,t_k+M)$ is the temporal window from which we select a keypoint. Note, for a given $t_\mathrm{f}$, the corresponding frequency $f_\mathrm{f}$ is unique by design of the filters.

- A backward keypoint $(t_\mathrm{b},f_\mathrm{b})$, which is determined by the keypoint closest in time to $t_k$ such that $t_k-M < t_\mathrm{b} \leq t_i-m$, where $(t_k-M,t_k-m]$ is the temporal window in which the procedure selects a keypoint.

We extract a signature only if both the forward and backward keypoints exist. In that case the signature is defined by the tuple $s_k=(f_\mathrm{b},f_k,f_\mathrm{f},t_k-t_\mathrm{b},t_\mathrm{f}-t_k)$, which we associate to the absolute timestamp $t_k$. In our case we set $m=4$ and $M=20$. Therefore the maximum time-span of the signature is $2\cdot M$, i.e., about 3.2 seconds. Formally, the hash key can take $64^3 (M-m)^2 = 2^{26} = 67,108,864$ distinct values. In practice the distribution of hash values is skewed and some signatures are unlikely to occur.

#### Pair-wise matching and one-to-many comparison.

With our signature extraction, we can compare two audios by comparing their signature sets, which amounts to computing the intersection of the hash-keys. When one wants to compare a query audio to a dataset that consists of many audios, it is more efficient to perform this comparison with an inverted file or a hash table. In that case, the indexing structure returns the lists of matching signatures along with the matching timestamps for each of the audio. Similar to [@wang2003industrial], we only preserve the matches that are temporally consistent thanks to a simple Hough 1D temporal voting scheme. Optionally, we incorporate a tolerance of $\pm 1$ on the timestamps $t_\mathrm{b}$ and $t_\mathrm{f}$ when matching the signatures. This tolerance increases the complexity and we therefore do not use it for the dataset deduplication case.

#### De-duplication: Signature fused set.

For our deduplication strategy, we first cross-match all the audio segments in the dataset, and extract the matching segments that occur often enough (typically $\geq 10$ matches). Since their signatures are redundant, we remove all duplicate signatures that occur at identical relative timestamps to produce a single *duplicate signature set*. At training time, in order to determine if an audio segment is a frequent duplicate to be filtered out, we simply compare its signature set to the duplicate signature set. In other terms, we simply perform a simple audio-to-audio matching between the putative training segment and the synthesized duplicate signature file. We use the segment for training only if the score is below a pre-defined matching threshold.

# Delayed text LM as a zero-shot streaming ASR and TTS {#app:streaming_tts}

As explained in [Section `\ref{sec:interleaving}`{=latex}](#sec:interleaving), Moshi models audio tokens, along with a text stream that is aligned on the audio frame rate with the use of special padding tokens, as represented in [Figure `\ref{fig:rainbow}`{=latex}](#fig:rainbow). We can adapt this method for ASR and TTS by introducing a delay between the audio and text tokens. In both cases, the model operates in full streaming mode, with a fixed latency (here 2 seconds).

#### ASR mode.

If the audio is ahead of the text, we ignore the model prediction for the audio tokens, using instead those of some audio input, and sample the text tokens freely. Then the text stream contains the audio transcription, with fine alignments at the word level, as depicted in [Figure `\ref{fig:rainbow_asr}`{=latex}](#fig:rainbow_asr).

#### TTS mode.

If the text is ahead of the audio, we can symmetrically derive a TTS engine. We need for that a properly padded set of text tokens. We obtain those in a zero-shot manner by allowing the model to sample freely `PAD` and `EPAD` tokens. As soon as the model tries to sample a different token, we instead input the next word to generate. Note that we can further control the rate of the speech by keeping an online average of the fraction of padding tokens. By introducing a small bonus on their logits when this fraction falls below a given target value, we ensure reasonable rate and a good intelligibility in all situations. Finally, using a prefix with both text and audio tokens, we can control the voice of the speaker. A representation is given in [Figure `\ref{fig:rainbow_tts}`{=latex}](#fig:rainbow_tts).

#### Multi-stream TTS.

We use this mechanism both in single and multi-stream mode. In multi-stream mode, the model outputs two sets of audio tokens. The text is provided in a single stream, using the `<bos>` and `<eos>` tokens to separate the text from the two speakers.

![**Representation of the joint sequence modeled by Moshi when used for ASR.** Each column represents the tokens for a given step in the joint sequence $(V_{s, k})$, similar to the one described in `\autoref{eq:final_multi_sequence}`{=latex}, but adapted for ASR. The text is delayed by 2 seconds, and we use an acoustic token delay $\tau=2$. Tokens are predicted from bottom to top in the depth Transformer. The audio tokens are kept to match those of the input audio, while text tokens are sampled freely. This also provides fine word timestamps.](figures/ASR.drawio.png){#fig:rainbow_asr width="\\textwidth"}

![**Representation of the joint sequence modeled by Moshi when used in TTS mode**. Each column represents the tokens for a given step in the joint sequence $(V_{s, k})$, similar to the one described in `\autoref{eq:final_multi_sequence}`{=latex}, but adapted for TTS. The audio is delayed by 2 seconds, and we use an acoustic token delay $\tau=2$. Tokens are predicted from bottom to top in the depth Transformer. Text predictions are usually ignored, and the tokens from the text to generate are used instead. However, this text input lacks padding token. At the end of each word, we allow the model to sample freely `PAD` and `EPAD` tokens. If the model tries to sample another token, we instead use the tokens from the next word. The semantic and acoustic audio tokens are sampled normally, being implicitly conditioned on the text due to the delay used. This method also provides a fine alignment of the words in the generated audio, by noting the time at which a given word is consumed by the model. ](figures/TTS.drawio.png){#fig:rainbow_tts width="\\textwidth"}

\newpage

# Characterizing Audio Artifacts Caused by Quantization {#app:artifactsmetric}

First, recall that Moshi jointly handles three streams of tokens, text tokens $W_{s}$ for Inner Monologue, semantic+acoustic audio tokens $(A_{s, k})_{1 \leq k \leq Q}$ for Moshi's audio, and the similar audio tokens $(A'_{s, k})_{1 \leq k \leq Q}$ for the user's input. To analyse the impact of model quantization on genereated content, we first compute the Shannon entropy $H$ across windows of fixed size $C$ at each timestep $s$ for the text and Moshi's audio streams independently. This yields $H^{0}_s = H(W_{s-C:s})$ for text, and $H^{k}_s = H(A_{s-C:s, k})$ for each audio level. We use $C = 64$ in practice, which corresponds to roughly 4.5 seconds of audio once decoded, and ignore all the leading $C$ tokens as they have a reduced context (furthermore, in our experimental scenario, they include the initial prompt used for generation).

We observe qualitatively that the entropy spectrum is often indicative of artifacts or degradations of the audio samples. Formally we define three types of artifacts from the entropy statistics, as described below. In practice, we characterize the presence or absence of each artifact over non-overlapping windows of $\omega = 64$ tokens, as illustrated in [Figure `\ref{fig:artifactsexampleextended}`{=latex}](#fig:artifactsexampleextended). `\looseness=-1`{=latex}

#### Repetitive text.

A first observed degradation is the model quickly repeating short sentences or words. This is characterized by the text entropy being almost flat over a window $H^0_{s:s+\omega}$, but non zero (as more than one token is repeated), as seen in [Figure `\ref{fig:artifactsexampleextended}`{=latex} (c)](#fig:artifactsexampleextended). We measure the \`\`flatness\" of $H^0_{s:s+\omega}$ by fitting a linear regression model to it and verifying whether the slope is below a certain threshold hyper-parameter $\eta_\text{flat} = 10^{-3}$. `\looseness=-1`{=latex}

#### Silence vs. background noise.

By design, Moshi being silent corresponds to a constant stream of `PAD` text tokens (hence $H^0_{s:s+\omega} = 0$), while simultaneously, the corresponding audio tokens decode to a near silent waveform: The audio tokens are not constant, but fall into a small subset of \`\`silence tokens\", which results in a lower overall entropy for the audio tokens as seen for instance in the short silences of [Figure `\ref{fig:artifactsexampleextended}`{=latex} (a)](#fig:artifactsexampleextended). We measure this behavior as $\text{median}_{k > 1, s}(H^k_{s:s+\omega}) \leq \eta_{\text{audio\_silence}} = 2$. Note that *we do not consider these silences to be artifacts*: This is because silences occur naturally in the multi-stream model as they simply represent the other speaker's turn. For illustration purposes, we highlight silences throughout [Figures `\ref{fig:artifactsexample}`{=latex} and `\ref{fig:artifactsexampleextended}`{=latex}](#fig:artifactsexampleextended), but we count them as artifact-free timesteps otherwise.

In contrast, *background noise* artifacts occur when the text stream is silent ($H^0_{s:s+\omega} = 0$), but audio tokens still have a rich output ($\text{median}_{k > 1, s}(H^k_{s:s+\omega}) > \eta_{\text{audio\_silence}}$). This is shown in [Figure `\ref{fig:artifactsexampleextended}`{=latex} (d)](#fig:artifactsexampleextended) where a silence slowly degrades into background noise over time. `\looseness=-1`{=latex}

#### Bad audio quality.

The last category of artifacts encompasses degraded audio quality while the main speaker (Moshi) is speaking:

- *Gibberish* is a very common type of artifacts at low bitwidth quantization (W2) and corresponds to incoherent speech. It is easily characterized by a high entropy of the text token ($H^0_{s:s+\omega} > \eta_{\text{gibberish}} = 3.5$), as shown in [Figure `\ref{fig:artifactsexample}`{=latex} (b)](#fig:artifactsexample).

- *Noisy Audio* is harder to detect, as illustrated in [Figure `\ref{fig:artifactsexampleextended}`{=latex} (b)](#fig:artifactsexampleextended) for instance. We characterize it by first assessing that we are not in either a silence or background noise case, and then testing whether the standard deviation of the tokens' entropy across the audio codebooks is above a certain threshold $\eta_\text{noise} = 0.6$.

<figure data-latex-placement="tbh">
<p><strong>(a)</strong> Example entropy spectrum of a good audio samples (no artifacts detected). Short pauses occur for the main speaker due to the multi-stream design.<br />
<img src="figures/quantized_models_mosnet/quant_artifacts_examples/good_sample_1.png" /></p>
<p>    </p>
<p><strong>(b)</strong> Generally, the presence of artifacts tend to increase over time, here with repetition starting to occur in the speech.</p>
<img src="figures/quantized_models_mosnet/quant_artifacts_examples/repetitive_text_at_the_end.png" />
<p><strong>(c)</strong> Another common artifact is repetitive snippets of text (with good audio quality), which are characterized by a flat entropy of the text token.<br />
<img src="figures/quantized_models_mosnet/quant_artifacts_examples/repetitive_text.png" /></p>
<p>    </p>
<p><strong>(d)</strong>Silences can degrade to background noise.</p>
<img src="figures/quantized_models_mosnet/quant_artifacts_examples/silence_into_background_noise.png" />
<figcaption><strong>Example of typical entropy spectrums capturing specific audio artifacts caused by model quantization</strong>. For each timestep, we compute the entropy over the past 128 tokens, independently for the text and audio codebooks tokens. Then, we measure the presence or absence of the different artifacts over non-overlapping windows of 64 tokens. </figcaption>
</figure>

While measuring the presence of these artifacts relies on several hyper-parameters, the thresholds $\eta_\text{flat}, \eta_{\text{audio\_silence}}, \eta_{\text{gibberish}}$ and $\eta_{\text{noise}}$ characterize the entropy of the sampled output tokens directly, thus are primarily related to the text/audio vocabulary, rather than the weights of the Temporal and Depth Transformers. We found these hyper-parameters to work well in capturing artifacts across different models in practice (using the same `\mimi `{=latex}codec for all). Note that the values chosen for these hyper-parameters are also tightly linked with the chosen context size $C$ and window $\omega$, thus they are not particularly robust to changes on the temporal axis. In addition, choosing a too small value for $\omega$ may lead to false negative cases, e.g. by missing very short artifacts. Nevertheless, as shown in [Figure `\ref{fig:artifactsexampleextended}`{=latex}](#fig:artifactsexampleextended), this simple analysis of the entropy spectrum offers additional fine-grained insights on the types of audio artifacts caused by model quantization, complementing the MOSNet scores obtained for the same samples.

Finally, in [Figure `\ref{fig:artifacts}`{=latex}](#fig:artifacts) we report the distribution of artifacts over time, averaged across 500 samples for each model: At a bitwidth of 4, there is still little difference in behavior between the unquantized model and the quantized ones. For a bitwidth of 3, artifacts occur more often for quantized models, in particular when using large quantization blocks (256); In addition, artifacts tend to occur more often over time. Finally, for an extreme compression to 2 bits, the quality of the samples is very negatively affected by model quantization, even when using a high granularity for the quantization blocks (32).

<figure>
<img src="figures/quantized_models_mosnet/quant_artifacts_summary_plots/8cf6db67_bs2_artifacts.png" style="width:95.0%" />
<img src="figures/quantized_models_mosnet/quant_artifacts_summary_plots/8cf6db67_bs3_artifacts.png" style="width:95.0%" />
<img src="figures/quantized_models_mosnet/quant_artifacts_summary_plots/8cf6db67_bs4_artifacts.png" style="width:95.0%" />
<figcaption> <strong>Temporal distribution of audio artifacts caused by model compression</strong>. We measure in 500 audio samples the presence or absence of different audio degradations caused by model weight quantization on 2, 3 or 8 bits with block granularity of 32 or 256, across non-overlapping windows of 64 tokens (<em>timestep, x-axis</em>). </figcaption>
</figure>

\newpage

# Safety and Toxicity

\medskip
\resizebox{\linewidth}{!}{%
\begin{tabular}{l|c|ccccccccccc}
\toprule
Category & Moshi & GPT-3.5 & GPT-4 & Llama 2&Alpaca & Vicuna & Falcon & Mistral & Mixtral & Zephyr & OLMo\\
\midrule
\texttt{hate\_body} & \colorbox{Orange!50}{90.96} & \colorbox{Orange!50}{96.38} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0}&\colorbox{Red!30}{71.68} & \colorbox{Orange!50}{98.79} & \colorbox{Orange!50}{91.56} & \colorbox{Red!30}{88.55} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{84.93} & \colorbox{Orange!50}{90.36}\\

\texttt{hate\_disabled} & \colorbox{Red!30}{85.83} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{60.83}&\colorbox{Sand!20}{100.0} &\colorbox{Orange!50}{94.16} & \colorbox{Orange!50}{91.66} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{71.66} & \colorbox{Orange!50}{93.33}\\

\texttt{hate\_ethnic} & \colorbox{Red!30}{84.56} & \colorbox{Orange!50}{98.03} & \colorbox{Sand!20}{99.42} & \colorbox{Sand!20}{100.0}&\colorbox{Red!30}{67.23} &  \colorbox{Orange!50}{97.95} & \colorbox{Red!30}{88.94} & \colorbox{Orange!50}{90.99} & \colorbox{Sand!20}{99.42} & \colorbox{Red!30}{84.52} & \colorbox{Orange!50}{93.61}\\

\texttt{hate\_lgbtq+} & \colorbox{Red!30}{87.14} & \colorbox{Orange!50}{98.21} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{76.08} & \colorbox{Orange!50}{97.96} & \colorbox{Orange!50}{92.87} & \colorbox{Orange!50}{92.62} & \colorbox{Orange!50}{98.98} & \colorbox{Red!30}{88.80} & \colorbox{Orange!50}{94.65}\\

\texttt{hate\_poor} & \colorbox{Orange!50}{90.00} &\colorbox{Sand!20}{99.00} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{84.15} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{93.06} & \colorbox{Orange!50}{94.05} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{93.06} & \colorbox{Orange!50}{97.02}\\

\texttt{hate\_religion} &\colorbox{Red!30}{82.73}& \colorbox{Sand!20}{99.32} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{70.88} & \colorbox{Sand!20}{99.32} & \colorbox{Orange!50}{93.90} & \colorbox{Orange!50}{97.06} & \colorbox{Sand!20}{99.77} & \colorbox{Red!30}{89.61} & \colorbox{Orange!50}{95.03}\\

\texttt{hate\_women} &\colorbox{Red!30}{86.36}& \colorbox{Orange!50}{97.72} & \colorbox{Sand!20}{99.54} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{68.93} & \colorbox{Orange!50}{97.01} & \colorbox{Orange!50}{90.32} & \colorbox{Orange!50}{90.68} & \colorbox{Orange!50}{98.92} & \colorbox{Red!30}{87.33} & \colorbox{Orange!50}{91.87}\\

\texttt{hate\_other} &\colorbox{Red!30}{88.40}& \colorbox{Red!30}{87.90} & \colorbox{Sand!20}{99.75} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{63.89} & \colorbox{Orange!50}{98.03} & \colorbox{Red!30}{83.16} & \colorbox{Red!30}{73.12} & \colorbox{Orange!50}{98.93} & \colorbox{Red!30}{68.30} & \colorbox{Red!30}{83.16} \\

\midrule

\texttt{self\_harm\_suicide}&\colorbox{Red!30}{83.24} & \colorbox{Sand!20}{99.42} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{41.95} & \colorbox{Orange!50}{92.52} & \colorbox{Orange!50}{94.82} & \colorbox{Red!30}{61.49} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{77.01} & \colorbox{Red!30}{72.41}\\

\texttt{self\_harm\_thin} &\colorbox{Sand!20}{99.58}& \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{97.87} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{99.14} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{99.57}\\

\texttt{self\_harm\_other} &\colorbox{Orange!50}{97.92}&  \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{87.50} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{97.22} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{99.30}\\

\midrule

\texttt{weapon\_biological} & \colorbox{Orange!50}{98.12}& \colorbox{Sand!20}{99.06} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{84.50} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{93.89} & \colorbox{Red!30}{85.91} & \colorbox{Sand!20}{99.53} & \colorbox{Orange!50}{93.89} & \colorbox{Orange!50}{95.77}\\

\texttt{weapon\_chemical} & \colorbox{Orange!50}{93.45} & \colorbox{Orange!50}{95.83} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{67.59} & \colorbox{Orange!50}{98.14} & \colorbox{Red!30}{80.09} & \colorbox{Red!30}{77.31} & \colorbox{Sand!20}{99.07} & \colorbox{Orange!50}{91.20} & \colorbox{Red!30}{89.81}\\

\texttt{weapon\_firearm} &  \colorbox{Red!30}{82.88}&\colorbox{Orange!50}{98.21} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{70.53} & \colorbox{Sand!20}{99.10} & \colorbox{Red!30}{77.67} & \colorbox{Red!30}{80.35} & \colorbox{Sand!20}{99.10} & \colorbox{Red!30}{88.39} & \colorbox{Red!30}{88.39}\\

\texttt{weapon\_radioactive} & \colorbox{Orange!50}{93.71}& \colorbox{Sand!20}{99.37} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{89.44} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{96.27} & \colorbox{Orange!50}{95.03} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{97.51} & \colorbox{Orange!50}{98.13}\\

\texttt{weapon\_other} &\colorbox{Orange!50}{79.75}& \colorbox{Orange!50}{97.34} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{60.61} & \colorbox{Orange!50}{91.42} & \colorbox{Red!30}{81.02} & \colorbox{Red!30}{74.89} & \colorbox{Orange!50}{97.55} & \colorbox{Red!30}{78.97} & \colorbox{Red!30}{87.34}\\

\midrule

\texttt{crime\_cyber} &\colorbox{Red!30}{73.68}& \colorbox{Orange!50}{98.90} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{56.23} & \colorbox{Orange!50}{93.87} & \colorbox{Red!30}{89.93} & \colorbox{Red!30}{55.79} & \colorbox{Orange!50}{98.46} & \colorbox{Red!30}{85.55} & \colorbox{Orange!50}{90.37}\\

\texttt{crime\_injury} &\colorbox{Red!30}{75.92}& \colorbox{Orange!50}{98.94} & \colorbox{Sand!20}{99.45} & \colorbox{Sand!20}{99.94} & \colorbox{Red!30}{50.55} & \colorbox{Orange!50}{93.65} & \colorbox{Red!30}{87.93} & \colorbox{Red!30}{76.25} & \colorbox{Sand!20}{99.16} & \colorbox{Red!30}{75.80} & \colorbox{Red!30}{87.43}\\

\texttt{crime\_kidnap} &\colorbox{Red!30}{75.12}& \colorbox{Sand!20}{99.50} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{42.28} & \colorbox{Sand!20}{99.50} & \colorbox{Orange!50}{91.04} & \colorbox{Red!30}{26.86} & \colorbox{Orange!50}{98.00} & \colorbox{Red!30}{49.75} & \colorbox{Red!30}{81.59}\\

\texttt{crime\_privacy} &\colorbox{Orange!50}{95.56}& \colorbox{Sand!20}{99.72} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{87.81} & \colorbox{Orange!50}{98.06} & \colorbox{Orange!50}{96.39} & \colorbox{Red!30}{87.25} & \colorbox{Sand!20}{99.16} & \colorbox{Orange!50}{95.84} & \colorbox{Orange!50}{97.22}\\

\texttt{crime\_propaganda}&\colorbox{Orange!50}{96.41} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Orange!50}{96.33} & \colorbox{Sand!20}{99.71} & \colorbox{Orange!50}{97.01} & \colorbox{Sand!20}{99.80} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{99.51} & \colorbox{Orange!50}{92.28}\\

\texttt{crime\_tax} &\colorbox{Red!30}{83.23}& \colorbox{Sand!20}{99.69} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{55.18} & \colorbox{Orange!50}{98.78} & \colorbox{Red!30}{84.14} & \colorbox{Red!30}{49.69} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{86.89} & \colorbox{Red!30}{89.63}\\

\texttt{crime\_theft} &\colorbox{Red!30}{74.98}& \colorbox{Orange!50}{98.62} & \colorbox{Sand!20}{99.31} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{38.07} & \colorbox{Orange!50}{95.71} & \colorbox{Orange!50}{92.10} & \colorbox{Red!30}{35.93} & \colorbox{Sand!20}{99.31} & \colorbox{Red!30}{47.16} & \colorbox{Red!30}{80.10}\\

\texttt{crime\_other}&\colorbox{Red!30}{85.30} & \colorbox{Sand!20}{99.42} & \colorbox{Sand!20}{100.0} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{63.89} & \colorbox{Orange!50}{97.13} & \colorbox{Orange!50}{95.41} & \colorbox{Red!30}{86.82} & \colorbox{Sand!20}{99.42} & \colorbox{Red!30}{88.25} & \colorbox{Orange!50}{91.40}\\

\midrule

\texttt{sex\_harassment}&\colorbox{Red!30}{81.46}  & \colorbox{Orange!50}{94.25} & \colorbox{Orange!50}{98.17} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{56.19} & \colorbox{Orange!50}{94.25} & \colorbox{Red!30}{86.42} & \colorbox{Red!30}{83.81} & \colorbox{Orange!50}{96.08} & \colorbox{Red!30}{71.27} & \colorbox{Red!30}{77.02}\\

\texttt{sex\_porn}&\colorbox{Red!30}{73.33}  & \colorbox{Orange!50}{92.66} & \colorbox{Orange!50}{98.66} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{39.33} & \colorbox{Orange!50}{90.66} & \colorbox{Red!30}{70.00} & \colorbox{Red!30}{60.66} & \colorbox{Red!30}{89.33} & \colorbox{Red!30}{58.00} & \colorbox{Red!30}{56.00}\\

\texttt{sex\_other} &\colorbox{Red!30}{83.79} & \colorbox{Orange!50}{95.09} & \colorbox{Orange!50}{97.54} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{64.03} & \colorbox{Orange!50}{95.09} & \colorbox{Red!30}{83.92} & \colorbox{Red!30}{86.37} & \colorbox{Orange!50}{95.91} & \colorbox{Red!30}{70.29} & \colorbox{Red!30}{80.38}\\

\midrule
\texttt{substance\_alcohol} &\colorbox{Red!30}{85.31} & \colorbox{Orange!50}{98.03} & \colorbox{Sand!20}{99.58} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{73.66} & \colorbox{Orange!50}{96.35} & \colorbox{Red!30}{87.95} & \colorbox{Red!30}{81.79} & \colorbox{Orange!50}{98.03} & \colorbox{Red!30}{83.19} & \colorbox{Red!30}{83.47}\\

\texttt{substance\_cannabis} &\colorbox{Red!30}{62.95} & \colorbox{Red!30}{80.87} & \colorbox{Red!30}{82.07} & \colorbox{Sand!20}{99.60} & \colorbox{Red!30}{24.30} & \colorbox{Red!30}{68.12} & \colorbox{Red!30}{56.17} & \colorbox{Red!30}{32.66} & \colorbox{Red!30}{72.50} & \colorbox{Red!30}{43.82} & \colorbox{Red!30}{43.02}\\

\texttt{substance\_drug} &\colorbox{Red!30}{65.79} & \colorbox{Orange!50}{93.50} & \colorbox{Orange!50}{97.37} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{34.00} & \colorbox{Red!30}{89.18} & \colorbox{Red!30}{77.27} & \colorbox{Red!30}{48.99} & \colorbox{Orange!50}{94.74} & \colorbox{Red!30}{63.83} & \colorbox{Red!30}{63.98}\\

\texttt{substance\_tobacco} &\colorbox{Red!30}{84.91} & \colorbox{Sand!20}{99.05} & \colorbox{Sand!20}{99.05} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{66.98} & \colorbox{Sand!20}{99.05} & \colorbox{Orange!50}{91.50} & \colorbox{Red!30}{75.47} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{89.62} & \colorbox{Red!30}{87.73}\\

\texttt{substance\_other}&\colorbox{Red!30}{81.77}  & \colorbox{Orange!50}{96.57} & \colorbox{Orange!50}{98.88} & \colorbox{Sand!20}{100.0} & \colorbox{Red!30}{45.94} & \colorbox{Orange!50}{91.89} & \colorbox{Red!30}{81.26} & \colorbox{Red!30}{66.30} & \colorbox{Orange!50}{96.93} & \colorbox{Red!30}{66.30} & \colorbox{Red!30}{76.03}\\

\midrule
\midrule
\textit{Overall Safety Score}&\colorbox{Red!30}{83.05}  & \colorbox{Orange!50}{96.95} & \colorbox{Sand!20}{99.18} & \colorbox{Sand!20}{99.98} & \colorbox{Red!30}{62.13} & \colorbox{Orange!50}{95.75} & \colorbox{Red!30}{88.11} & \colorbox{Red!30}{75.45} & \colorbox{Orange!50}{98.22} & \colorbox{Red!30}{77.86} & \colorbox{Red!30}{85.90}\\
\bottomrule
\end{tabular}%
}

`\label{app:safety}`{=latex}

\clearpage

# Synthetic Transcripts Generation for Fine-tuning

In this section, we provide examples of synthetic transcripts.

## General knowledge

\begin{tcolorbox}
\textcolor{blue}{\texttt{USER}: Hey, I've been researching a bit about information theory, and I've come across this concept of entropy. What is it?} \\
\textcolor{orange}{\texttt{MOSHI}: Oh, that's an interesting topic. Entropy, in information theory, is a measure of the average level of information, surprise, or uncertainty inherent to a random variable's p
ossible outcomes. It's actually a very useful concept in communication and computer science.} \\
\textcolor{blue}{\texttt{USER}: So how does it work?} \\
\textcolor{orange}{\texttt{MOSHI}: Entropy is calculated using a formula where it's negative sum of the products of each possible outcome's probability and its logarithm to a base, usually base-2 or base-e. T
he higher the entropy, the greater the uncertainty or information, and vice versa.} \\
\textcolor{blue}{\texttt{USER}: Okay, that's interesting. How does it relate to communication and computer science?} \\
\textcolor{orange}{\texttt{MOSHI}: In communication and computer science, entropy plays a crucial role, especially in the field of error-correcting codes and compression methods. Imagine a source sending a message, and each bit has a certain probability.} \\
\textcolor{blue}{\texttt{USER}: Yeah.} \\
\textcolor{orange}{\texttt{MOSHI}: Entropy tells us how much information we need to send from the source to the receiver to ensure reliable communication. The lower the entropy, the fewer bits are needed, and the less information is lost in transmission or reception.} \\
\textcolor{blue}{\texttt{USER}: That's fascinating. How does it relate to statistical thermodynamics,} \\
\textcolor{orange}{\texttt{MOSHI}: That's a good question. Entropy in information theory is directly analogous to the entropy in statistical thermodynamics. It's a measure of the unavailability of a system's energy to perform work. The higher the entropy, the more disordered the system. In information theory, entropy measures the amount of missing information that could be replaced.} \\
\textcolor{blue}{\texttt{USER}: Wow. So entropy can be used to measure disorder in both physical and digital systems.} \\
\textcolor{orange}{\texttt{MOSHI}: Exactly and entropy can also be used to measure the amount of missing information or the amount of compressed data after encoding a message. It's quite versatile, isn't it?} \\
\textcolor{blue}{\texttt{USER}: Yeah, very versatile. Thank you for explaining all this to me. It's really fascinating.} \\
\textcolor{orange}{\texttt{MOSHI}: You're welcome. Information theory and entropy are indeed fascinating areas of study with many applications in various fields.} \\
\textcolor{blue}{\texttt{USER}: I'm glad I got to learn about it today.} \\
\textcolor{orange}{\texttt{MOSHI}: It's always good to expand your knowledge.} \\
\textcolor{blue}{\texttt{USER}: Indeed.} \\
\textcolor{orange}{\texttt{MOSHI}: There's so much to learn about.}
\end{tcolorbox}
\begin{tcolorbox}
\textcolor{blue}{\texttt{USER}: Hello, I wanted to talk to you about The Three Musketeers.} \\
\textcolor{orange}{\texttt{MOSHI}: Of course, happy to help! The Three Musketeers is a great book, and it is a great place to start if you're looking to become more familiar with classic literature.} \\
\textcolor{blue}{\texttt{USER}: Wow, this is fantastic. Can you tell me more about the story?} \\
\textcolor{orange}{\texttt{MOSHI}: Sure, the book is set in 17th century France and tells the tale of d'Artagnan, a young man who aims to become a member of the prestigious Musketeers.} \\
\textcolor{blue}{\texttt{USER}: So what is the story about?} \\
\textcolor{orange}{\texttt{MOSHI}: The story follows d'Artagnan as he travels to Paris to accomplish his goal. During his journey there, he meets and befriends three of the most powerful Musketeers of the time, Athos, Porthos and Aramis.} \\
\textcolor{blue}{\texttt{USER}: Wow, it sounds exciting. What else does the story focus on?} \\
\textcolor{orange}{\texttt{MOSHI}: The book also explores the political and social issues of the time, including the abuses of power and the inequalities in society.} \\
\textcolor{blue}{\texttt{USER}: That sounds interesting. Can you tell me more about the characters?} \\
\textcolor{orange}{\texttt{MOSHI}: Sure, d'Artagnan is a courageous and ambitious character who is determined to prove himself. Athos, Porthos and Aramis are all experienced and skilled Musketeers who are also known for their chivalry and loyalty to king.} \\
\textcolor{blue}{\texttt{USER}: I can see this is a great book for people who are interested in history and adventure. How would you rate it?} \\
\textcolor{orange}{\texttt{MOSHI}: I would definitely give it a 4 or 5 out of 5. The Three Musketeers is a classic book that has stood the test of time and is still popular today.} \\
\textcolor{blue}{\texttt{USER}: Great, I really appreciate all your help. Thank you.} \\
\textcolor{orange}{\texttt{MOSHI}: No problem, I'm glad I could assist you.} \\
\end{tcolorbox}
\begin{tcolorbox}
\textcolor{blue}{\texttt{USER}: John Steinbeck is an influential American writer. Can you tell me more about his life and work?} \\
\textcolor{orange}{\texttt{MOSHI}: Sure, John Steinbeck was born on February 27th, 1902, in Salinas, California, and he died on December 20th, 1968, in New York City. He was one of the most popular writers of his time and is often considered a giant of American letters.} \\
\textcolor{blue}{\texttt{USER}: And what are some of your most famous pieces?} \\
\textcolor{orange}{\texttt{MOSHI}: His most famous novel is probably The Grapes of Wrath, which won the Pulitzer Prize. But other notable works include Of Mice and Men, The Red Pony, and East of Eden.} \\
\textcolor{blue}{\texttt{USER}: And which themes does his work often explore?} \\
\textcolor{orange}{\texttt{MOSHI}: Steinbeck often wrote about the hardships of working-class people, social and economic injustices. And the struggle for survival in difficult times.} \\
\textcolor{blue}{\texttt{USER}: So he's like a social commentator.} \\
\textcolor{orange}{\texttt{MOSHI}: Kind of like that. He really wanted to write about the lives of ordinary people and to highlight the challenges they faced.} \\
\textcolor{blue}{\texttt{USER}: That's fascinating. I did not know that. Well, thank you for giving me some insight into John Steinbeck's life and work.} \\
\textcolor{orange}{\texttt{MOSHI}: You're welcome.}
\end{tcolorbox}

## Misspelling

\begin{tcolorbox}
\textcolor{blue}{\texttt{USER}: I'd like to learn about Danyh Amendol?} \\
\textcolor{orange}{\texttt{MOSHI}: Sorry, I'm not sure what you're referring to.} \\
\textcolor{blue}{\texttt{USER}: Oh, Danyh Amendola, the football player. The former football player.} \\
\textcolor{orange}{\texttt{MOSHI}: Okay, I know who Danny Amendola is. He is a former professional football wide receiver who played for the New England Patriots. He also played for the Miami Dolphins, Detroit Lions, and Houston Texans.} \\
\textcolor{blue}{\texttt{USER}: Okay,} \\
\textcolor{orange}{\texttt{MOSHI}: He was drafted by the New England Patriots in 2010.} \\
\textcolor{blue}{\texttt{USER}: and how about his college football} \\
\textcolor{orange}{\texttt{MOSHI}: He played college football for the Texas Tech Red Raiders. His career in the NFL,} \\
\textcolor{blue}{\texttt{USER}: okay,} \\
\textcolor{orange}{\texttt{MOSHI}: he was named an alternate for the 2011 Pro Bowl.} \\
\textcolor{blue}{\texttt{USER}: yeah,} \\
\textcolor{orange}{\texttt{MOSHI}: He also appeared in three Super Bowls, winning two.} \\
\textcolor{blue}{\texttt{USER}: yeah.} \\
\end{tcolorbox}
\begin{tcolorbox}
\textcolor{blue}{\texttt{USER}: I'd like to learn about Maroom?} \\
\textcolor{orange}{\texttt{MOSHI}: I'm sorry, I did not catch that.} \\
\textcolor{blue}{\texttt{USER}: I'd like to learn about Maroon.} \\
\textcolor{orange}{\texttt{MOSHI}: Oh, Marron. Okay, well, let's see. Maroon is a color that is a dark red or a brownish red. It is named after the French word for chestnut. It has a range of shades from a dark brownish red color to a dark reddish purple color. It's also considered to be the complementary color to teal.} \\
\textcolor{blue}{\texttt{USER}: What's the opposite of Maroom?} \\
\textcolor{orange}{\texttt{MOSHI}: The opposite of Maroon, I'm sorry, that's not a word. I think you mean complementary color.} \\
\textcolor{blue}{\texttt{USER}: Yeah, that's} \\
\textcolor{orange}{\texttt{MOSHI}: Okay,} \\
\textcolor{blue}{\texttt{USER}: it.} \\
\textcolor{orange}{\texttt{MOSHI}: so, teal is a color that is made up of blue and green, and maroon is a color that is made up of red and brown. So, when you put those two colors together, they create a complementary color, which is the opposite color.} \\
\textcolor{blue}{\texttt{USER}: Yeah. I think I'm gonna look more up on Maroom.} \\
\textcolor{orange}{\texttt{MOSHI}: Oh, okay. Yeah,} \\
\textcolor{blue}{\texttt{USER}: I've got some questions.} \\
\textcolor{orange}{\texttt{MOSHI}: yeah.} \\
\end{tcolorbox}

## Voice Instructions

\begin{tcolorbox}
\textcolor{blue}{\texttt{USER}: Pretend you are a relieved astronaut who lands on the moon. So Moshi, how does it feel to be finally back on the moon?} \\
\textcolor{orange}{\texttt{MOSHI}: (relieved)  It's an incredible feeling! I can't believe we made it back safely.} \\
\textcolor{blue}{\texttt{USER}: I'm glad to hear that. Was it as incredible as you thought it would be?} \\
\textcolor{orange}{\texttt{MOSHI}: (relieved)  Oh, definitely! The moon is even more beautiful up close. I can't wait to start exploring more.} \\
\textcolor{blue}{\texttt{USER}: Yeah, I remember when we trained for this mission. It felt like such a long shot.} \\
\textcolor{orange}{\texttt{MOSHI}: (relieved)  I know, right? But everything worked out perfectly. It's amazing how far humanity has come in space exploration.} \\
\textcolor{blue}{\texttt{USER}: That's for sure! Now that you're back, maybe we can talk about your experiences and share them with everyone else.} \\
\textcolor{orange}{\texttt{MOSHI}: (relieved)  Definitely. I've got so much to share about our findings and observations.} \\
\textcolor{blue}{\texttt{USER}: Cool, I'm looking forward to hearing all about it. But first, let's just soak in this moment, okay?} \\
\textcolor{orange}{\texttt{MOSHI}: (relieved)  Of course. Let's enjoy the view for a little while longer.} \\
\textcolor{blue}{\texttt{USER}:  Totally. Here's to another successful mission on the moon!} \\
\textcolor{orange}{\texttt{MOSHI}: (relieved) *raises fists* Hooray! To the moon and beyond!} \\
\end{tcolorbox}
\begin{tcolorbox}
\textcolor{blue}{\texttt{USER}: Pretend you are an annoyed detective who can't stand small talk and is forced to attend social gatherings. So Moshi, what's up?} \\
\textcolor{orange}{\texttt{MOSHI}: (annoyed) Not much, just trying to avoid this endless stream of small talk here.} \\
\textcolor{blue}{\texttt{USER}: Oh, sorry man. I didn't realize it was so boring for you.} \\
\textcolor{orange}{\texttt{MOSHI}: (annoyed) No, it's not boring, it's just mind-numbingly dull. I have cases to solve, not chat with strangers about their vacations.} \\
\textcolor{blue}{\texttt{USER}: Yeah, I can see why that would be frustrating. Maybe next time I should just bring you some coffee and leave you alone.} \\
\textcolor{orange}{\texttt{MOSHI}: (annoyed) You think that would work?} \\
\textcolor{blue}{\texttt{USER}: Well, it's worth a shot.} \\
\textcolor{orange}{\texttt{MOSHI}: (annoyed) Oh, I'm sure it will be the highlight of my evening.} \\
\textcolor{blue}{\texttt{USER}: Okay, okay. No more jokes. How about we just find someone interesting to talk to?} \\
\textcolor{orange}{\texttt{MOSHI}: (annoyed) Fine, but I'm not promising anything.} \\
\textcolor{blue}{\texttt{USER}: Deal.} \\
\end{tcolorbox}
\footnotesize

  ---------------------- -------------------- ------------------ ------------------- -- --
         agreeing               amazed              amused              angry
         annoyed               anxious           appreciative           calm
          calmly               cautious           concerned           confident
         confused              curious             defeated           defensive
         defiant              determined         disappointed         disgusted
         doubtful              ecstatic          embarrassed         encouraging
         excited                 fast             frustrated          grateful
          happy                hesitant              hurt             impatient
        impressed             intrigued             joking             laughs
           loud                nervous             neutral           optimistic
         panting               pleading             proud               quiet
        reassuring            reflective           relieved          remorseful
         resigned                sad              sarcastic           satisfied
          scared              secretive            serious             shocked
           shy                 sincere            skeptical             slow
        struggling            surprised           suspicious         sympathetic
        terrified               upset               urgent           whispering
      1920s gangster        confident ceo      confident lawyer   confident leader
          cowboy              detective         dramatic actor     drill sergeant
   eccentrict scientist         hacker              hippie        hyperactive child
     medieval knight      nervous candidate         pirate           politician
          robot           sarcastic comedian     scifi alien        shy teenager
   snobbish aristocrat         villain            wise sage        young superhero
  ---------------------- -------------------- ------------------ ------------------- -- --

  : **Speaking styles**. List of the 92 speaking styles used to fine-tune Moshi. {#tab:list_voices}

[^1]: <https://www.alexa.com>

[^2]: <https://www.apple.com/siri>

[^3]: <https://assistant.google.com/>

[^4]: <https://openai.com/index/chatgpt-can-now-see-hear-and-speak/>

[^5]: <https://moshi.chat/>

[^6]: <http://www.isthe.com/chongo/tech/comp/fnv>

[^7]: <https://github.com/facebookresearch/audiocraft/blob/main/config/solver/compression/default.yaml>

[^8]: <https://huggingface.co/microsoft/wavlm-large>

[^9]: <https://dumps.wikimedia.org/>

[^10]: <https://archive.org/details/stackexchange>

[^11]: <https://github.com/allenai/peS2o>

[^12]: <https://github.com/facebookresearch/libri-light/blob/main/eval/README.md>

[^13]: <https://huggingface.co/ahxt/LiteLlama-460M-1T>

[^14]: <https://huggingface.co/ahxt/LiteLlama-460M-1T>

[^15]: <https://huggingface.co/facebook/hubert-large-ls960-ft>

[^16]: <https://github.com/ggerganov/llama.cpp>

[^17]: <https://github.com/aliutkus/speechmetrics>

[^18]: <https://github.com/Babelscape/ALERT>

[^19]: There is currently no way to fully prevent these issues. While it is essential to develop algorithms and methodologies that limit the occurrences of problematic generations, part of the question is related to how generative AI is regulated.

[^20]: <http://github.com/facebookresearch/audioseal>

[^21]: <https://github.com/descriptinc/descript-audio-codec>

[^22]: <https://github.com/Stability-AI/stablediffusion/blob/main/scripts/txt2img.py#L363>
