--- abstract: | Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on [*perceive--decide--respond*]{.underline} loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the [**[Audio Interaction Model]{.smallcaps}**]{.underline}, and realize it with **[Audio-Interaction]{.smallcaps}**, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose **[SoundFlow]{.smallcaps}**, a framework that instantiates the perceive--decide--respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct **[StreamAudio-2M]{.smallcaps}**, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and **[Proactive-Sound-Bench]{.smallcaps}** for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help. author: - | Zhifei Xie^1\*^ `\quad `{=latex}Zihang Liu^2\*^ `\quad `{=latex}Ze An^2^ `\quad `{=latex}**Xiaobin Hu**^2^`\quad `{=latex}**Yue Liao**^2^\ **Ziyang Ma**^1^`\quad `{=latex}**Dongchao Yang**^3^ `\quad `{=latex}**Mingbao Lin**^2$\dagger$^ `\quad `{=latex}**Deheng Ye**^1$\dagger$^\ **Shuicheng Yan**^2$\dagger$^ `\quad `{=latex}**Chunyan Miao**^1$\dagger$^\ ^1^NTU `\quad `{=latex}^2^NUS `\quad `{=latex}^3^CUHK\ `\faEnvelope`{=latex} [`Zhifei001@e.ntu.edu.sg`](mailto:Zhifei001@e.ntu.edu.sg) bibliography: - ref.bib title: Audio Interaction Model --- \newcommand{\ourdata}{OurData} \newcommand{\ourmodel}{OurModel} \newcommand{\ourmethod}{OurMethod} \newcommand{\bannerL}[2]{% % \colorbox{#1}{\strut\makebox[\bandwidth][l]{~\textbf{\textit{#2}}}}} \newcommand{\bannerR}[2]{% % \colorbox{#1}{\strut\makebox[\bandwidth][l]{~\textbf{\textit{#2}}}}} \newcommand{\best}[1]{\textbf{#1}} \newcommand{\tcpblue}[1]{\textcolor{blue}{\hfill //~#1}} \newcommand{\halfchunk}{\frac{1}{2}\mathrm{chunk}} \maketitle \newcommand{\logoblog}{\raisebox{-0.2ex}{\includegraphics[height=1em]{fig/logo.png}}} \newcommand{\logohf}{\raisebox{-0.2ex}{\includegraphics[height=1em]{fig/huggingface.png}}} \newcommand{\logogh}{\raisebox{-0.2ex}{\includegraphics[height=1em]{fig/github.png}}} \definecolor{deepblue}{RGB}{0, 45, 114} \definecolor{promptbg}{RGB}{221, 235, 247} \definecolor{promptframe}{RGB}{91, 155, 213} ::: center `\logoblog`{=latex} **Project page:** [[`https://xzf-thu.github.io/Audio-Interaction`]{style="color: deepblue"}](https://xzf-thu.github.io/Audio-Interaction)\ `\logohf`{=latex} **Data:** [[`huggingface.co/datasets/zhifeixie/StreamAudio-2M`]{style="color: deepblue"}](https://huggingface.co/datasets/zhifeixie/StreamAudio-2M)\ :::

Audio-Interaction listens to a continuous audio stream and decides at each moment whether to stay silent or speak, unifying conventional capabilities (e.g., dialogue, ASR) and streaming-native (e.g., simultaneous translation, proactive help) capabilitie within a single model.

# Introduction Audio is an inherently real-time and interactive modality at its core. Unlike text, which compresses events into symbolic form, or images, which capture static snapshots, audio is a continuous, always-on channel through which humans perceive and respond to their surroundings. Alongside rapid advances in large language models [@brown2020language; @touvron2023llama; @achiam2023gpt], reinforcement learning [@ouyang2022training; @rafailov2023direct], and agentic intelligence [@yao2022react; @schick2023toolformer], large audio language models (LALMs) have undergone a comparable transformation [@chu2023qwen; @tang2023salmonn; @chu2024qwen2; @xie2024mini], performing fine-grained emotion recognition, multi-step reasoning, tool use, and even code generation directly from acoustic inputs. Together, these advances move audio from narrow recognition tasks toward general-purpose intelligence. However, current LALMs still follow the conventional offline input-output formulation $y = f(x, A)$, mirroring multimodal designs such as LLaVA [@llava], which poorly matches the real-time and interactive nature of audio. **A common bridge has been to train a dedicated streaming model for each important task**, e.g., dialogue [@defossez2024moshi; @fang2024llama; @xie2024mini] and streaming speech recognition [@gao2022paramformer], but this bridging approach has two fundamental problems: [*(i) every*]{.underline} [*capability requires its own model trained from scratch*]{.underline}, and [*(ii) each model handles only a narrow*]{.underline} [*capability*]{.underline}. For instance, even fully-streaming systems such as Moshi [@defossez2024moshi], despite strong conversational capability, cannot interpret a hesitant pause or recognize a cough. **So, it is time to move toward a new paradigm beyond LALMs: Large Audio Interaction Models (LAIMs)**, an all-in-one framework that subsumes existing tasks within a single interactive model and bridges the gap between LALM-level capabilities and the real-time nature of audio. Moving to this regime surfaces two fundamental challenges absent from its offline predecessor. **(C1) Comprehension-grounded response triggering.** Offline LALMs respond passively to a fully observed clip, whereas an interactive model must decide *whether to respond* at every chunk based on semantic understanding of the unfolding context, not surface-level acoustic cues. Supervision for this decision is sparse and temporally ambiguous, and no existing corpus pairs continuous audio with properly timed intervention cues, requiring large-scale audio stitching for training data construction. **(C2) Real-time context continuity under chunked inference.** Audio must be consumed in fixed-length chunks to meet low-latency requirements, but chunking breaks the temporal continuity of acoustic signals and the long-range context accumulated across the interaction. The model must reconstruct continuity across chunks and retain earlier context without inflating the inference window or stalling on encoder-decoder synchronization. [*We instantiate this regime as **[Audio-Interaction]{.smallcaps}**, an always-on audio interaction model*]{.underline} train- ed within our **[SoundFlow]{.smallcaps}** framework. [Audio-Interaction]{.smallcaps} consumes audio one chunk at a time and, at each step, makes a comprehension-grounded decision between responding and remaining silent, forming a always-on *perceive--decide--respond* loop. Under this loop, traditional audio capabilities such as translation, recognition, and dialogue are naturally unified as instructions within a single interactive paradigm. **[SoundFlow]{.smallcaps} is an end-to-end audio-based interaction framework spanning data, training, and inference, with three components:** **i)** *interaction data synthesis* via a hierarchical event curation pipeline that composes short clips into coherent long-form interactions, with a time-frequency joint preprocessing module ([TFJP]{.smallcaps}) that smooths boundaries and suppresses noise to mimic real-world recordings; **ii)** *interaction-aware training* that casts audio modeling as chunk-level sequential decision, with history review and comprehension-aware silence addressing context forgetting and false triggering; **iii)** *asynchronous interactive inference* whose first-in-first-out scheme decouples encoding from decoding, eliminating stalling and cutting first-frame latency by $4.5\times$. [*Feeding this framework is **[StreamAudio-2M]{.smallcaps}***]{.underline}, a **302k-hour**, **2.6M-item** corpus spanning **28** interactive sub-tasks across **7** major categories, where each sample is a $3$-$15$ turn interaction with sparse, context-dependent response cues. We further release [***[ProactiveSound-Bench]{.smallcaps}***]{.underline} to evaluate a new capability, audio-based proactive assistance, which contains **644** *human-designed* events that probe whether a model can proactively interupt with no instruction. **We empirically validate [Audio-Interaction]{.smallcaps} from two perspectives.** First, from a performance standpoint, we demonstrate that converting the model from offline to interactive preserves competitive capability on mainstream tasks. [Audio-Interaction]{.smallcaps} matches state-of-the-art models on standard benchmarks (58.15 vs. 57.81 on **MMAU**), yet surpasses them in several cases, especially under full-speech and multi-turn settings. Beyond benchmark results, we look inside the model and analyze observations within the offline-to-interaction transformation. # Related Work {#sec:related_work} #### Large Audio Language Models. Large audio language models (LALMs) typically combine an audio encoder (often Whisper [@radford2023whisper]), an adapter, and a language model backbone [@chu2024qwen2; @tang2023salmonn; @qwen25omni2025], a design shared by our base model Qwen2.5-Omni [@qwen25omni2025]. Although recent work pursues deeper reasoning [@goel2025audio] and task-specific specialization [@xu2025fireredasr], all operate offline, requiring the complete audio clip before responding. #### Streaming Multi-modal Systems. Speech dialogue models [@xie2024mini; @fang2024llama; @fang2025llamaomni2; @defossez2024moshi; @qwen25omni2025] ingest audio chunk by chunk, but interaction stays turn-based: the model reacts only after an utterance ends, rather than understanding a continuous acoustic environment in real time. Even fully-streaming systems like Moshi [@defossez2024moshi] treat non-speech events as background, and streaming ASR [@gao2022paramformer] is limited to transcription. Online video understanding [@li2025videochat; @chen2024videollm] processes frames at roughly 1 fps, but the audio setting demands solutions this line lacks: chunk-level acoustic supervision, long-form heterogeneous streams built from short clips, and tight first-frame latency. ![Human listening is a continuous activity. We take in sound moment by moment and judge for ourselves when a reaction is called for. Current audio models work the opposite way: they wait for a finished recording, answer once, and handle only one kind of task per system. [Audio-Interaction]{.smallcaps} closes this gap by processing sound as it arrives and judging, step by step, when to speak and when to hold back---letting one model cover what previously took many specialized ones.](fig/new_main_newnane.png){#fig:main width="95%"} # Audio-Interaction ## Overview [Audio-Interaction]{.smallcaps} bridges the gap between conventional offline, clip-based audio language models and a general streaming audio-language setting. As shown in figure `\ref{fig:main}`{=latex}, conventional LALMs operate on fixed inputs, $y = f(x, \mathcal{A})$, where $\mathcal{A}$ is the complete utterance and $x$ the text instruction; only after the full signal is observed is a response produced. In contrast, [Audio-Interaction]{.smallcaps} operates directly on continuous audio streams, incrementally consuming audio chunks and autonomously deciding whether to remain silent or respond: $$\begin{equation} (d_t,\, r_t) = f\!\left(a_{\leq t},\; d_{

The training framework of SoundFlow. Audio signals, intermediate representations, and supervision signals are organized into a unified temporal sequence, and a streaming training strategy jointly optimizes language modeling and response triggering, enabling Audio-Interaction to decide when to respond or remain silent across diverse real-time tasks.

**Hierarchical Audio Event Selection.** Another key challenge in constructing streaming audio data is how to organize discrete (`audio`, `instruction`, `response`) segments into [*long, multi-turn audio*]{.underline} [*streams that remain coherent and consistent with real-world commonsense*]{.underline}. A straightforward solution is *random concatenation*, i.e., sampling audio clips independently and stitching them into a long sequence. However, this strategy is suboptimal, as event conflicts across clips (e.g., a car horn occurring while a speaker is talking) can easily break contextual consistency and impair the model's understanding of the evolving scene. To address this issue, we adopt a **hierarchical event curation pipeline** when composing mixed streaming data, which contains: [*(i) scenario planning:*]{.underline} We first use an LLM to plan a complete high-level scenario from randomly matched audio annotations, where each scenario contains multiple topics or sub-events. [*(ii) event refinement:*]{.underline} We then refine each topic into a sequence of concrete audio events and assign a corresponding audio clip to each event. [*(iii) clip grounding:*]{.underline} The final audio clips are obtained through two mechanisms, *retrieval* or *generation*. For retrieval, the model searches an audio clip database, selects the **top-3** most relevant candidates, and verifies their suitability. When no retrieved clip is sufficiently appropriate, we instead invoke an audio generation model to synthesize the required event. This hierarchical design yields long-form streaming audio with substantially better semantic coherence and environmental plausibility. ## Streaming Training **Streaming modeling.** As illustrated in Figure 2, both training and inference in our framework follow a fully streaming paradigm. Instead of processing a complete audio clip at once, the model incrementally consumes fixed-length audio chunks. In our implementation, each chunk spans 400 ms, balancing responsiveness and acoustic completeness. At each step, the model predicts a *single special token* $d_t \in \{\texttt{}, \texttt{}\}$ to determine whether it should continue listening or start responding. Intuitively, the model should remain silent when the current utterance is still incomplete or when the observed evidence is insufficient, and respond once enough information has been accumulated or timely intervention is required. Formally, $$d_t, r_t = f_{\mathrm{det}}(a_t, C_t), \qquad r_t = \begin{cases} \varnothing, & d_t=\texttt{},\\[4pt] f_{\mathrm{resp}}(a_t, C_t), & d_t=\texttt{}, \end{cases}$$ where $a_t$ is the current audio chunk and $C_t$ denotes the streaming context up to step $t$. If $d_t=\texttt{}$, the model emits no textual content and continues consuming subsequent audio chunks. Otherwise, it switches from streaming listening to autoregressive response generation. This formulation casts streaming interaction as a unified sequential process, allowing the model to jointly learn *when* to respond and *what* to generate in real-time spoken interaction. **Context Memory and Comprehension-Aware Silence Training.** During training, we observe two critical failure modes: [*(1) insufficient context retention*]{.underline}, where the model tends to overlook earlier context due to the prevalence of noisy or semantically empty segments in long training sequences; to address this issue, we introduce *history review* training by inserting questions about preceding content into later positions of the sequence, explicitly encouraging long-range contextual retrieval. [*(2) false triggering*]{.underline}, where the model tends to respond to interaction-irrelevant acoustic events; to mitigate this issue, we incorporate a large amount of silent audio verified by the agents in [ProactiveSound-Bench]{.smallcaps} to require no response, thereby strengthening the model's ability to remain silent unless intervention is truly warranted. **Dual-loss Multi-step Streaming Conversion.** [Audio-Interaction]{.smallcaps} is initialized from Qwen2.5-Omni-3B, which offers a strong performance--efficiency trade-off at a compact scale and is well suited for low-latency streaming inference. Since the special streaming control token $\texttt{}$ constitutes a new prediction target and is central to streaming interaction, we optimize it with a dedicated streaming objective in addition to the standard language modeling objective. Specifically, the overall training loss is defined as $$\mathcal{L} = \frac{1}{N}\sum_{j=1}^{N} \left( \underbrace{-\log P_\theta\!\left(t_j \mid \mathcal{H}_j\right)}_{\mathcal{L}_{\mathrm{LM}}} + \lambda\underbrace{-\log P_\theta\!\left(s_j \mid \mathcal{H}_j\right)}_{\mathcal{L}_{\mathrm{stream}}} \right),$$ where $t_j$ denotes the target text token, $s_j$ denotes the target streaming control token, $\mathcal{H}_j$ denotes the corresponding decoding context, and $\lambda$ controls the relative weight of the streaming objective. Let $\mathcal{A}^{\mathrm{ins}}$ denote the audio instruction, $\mathcal{A}^{\mathrm{in}}$ the input audio stream, and $\mathcal{T}$ the target response. The training pipeline consists of four stages. [*(1) Format training*]{.underline}: we use offline data to teach the model the target sequence format and the usage of $\texttt{}$, using samples of the form $(\mathcal{A}^{\mathrm{ins}}, \mathcal{A}^{\mathrm{in}} \rightarrow \mathcal{T})$. [*(2) Adapter training*]{.underline}: we train the adapter to map chunk-wise acoustic representations into the language model space while keeping the training format unchanged. [*(3) Large-scale streaming supervised training*]{.underline}: we jointly optimize the adapter and language model on core capabilities, including audio understanding, automatic speech recognition, and spoken dialogue, using $(\mathcal{A}^{\mathrm{ins}} \rightarrow \mathcal{T})$ and $(\mathcal{A}^{\mathrm{ins}}, \mathcal{A}^{\mathrm{in}} \rightarrow \mathcal{T})$. [*(4) Instruction-following fine-tuning*]{.underline}: we further train the model on complex streaming behaviors, including continuous assistance, comprehension-aware intervention, and proactive response, using interleaved sequences such as $(\mathcal{A}^{\mathrm{ins}}, \mathcal{A}^{\mathrm{in}}_1, \mathcal{T}_1, \mathcal{A}^{\mathrm{in}}_2, \mathcal{T}_2, \ldots)$, $(\mathcal{A}^{\mathrm{ins}}, \mathcal{A}^{\mathrm{in}}_1, \mathcal{A}^{\mathrm{in}}_2, \mathcal{T}, \mathcal{A}^{\mathrm{in}}_3, \mathcal{T}, \ldots)$, and $(\mathcal{A}^{\mathrm{in}} \rightarrow \mathcal{T})$. ## Stabilizing Asynchronous Inference via FIFO Scheduling. Real-time audio encoding and the model's special-token-based silence--response mechanism can introduce waiting conflicts and scheduling inconsistencies under complex interaction patterns. To mitigate this issue, we adopt an **asynchronous** inference scheme with **FIFO** scheduling. As illustrated in Fig. `\ref{fig:fifo_inference}`{=latex}, the encoder continuously processes streaming audio chunks and appends their acoustic representations to a temporally ordered queue. At each event step $t$, the incoming chunk $x_t$ is encoded into $\mathbf{a}_t$ and appended to the queue $\mathcal{Q}_t$. The decoding process is conditionally triggered based on the last generated token $r_{t-1}$. Specifically, if $r_{t-1} \in \{\texttt{}, \texttt{}\}$, the model consumes the queued features $\mathcal{Q}_t$ and produces the next output $r_t$. Otherwise, the system remains waiting until subsequent audio chunks arrive. This deployment \hfill ![image](fig/figure4-inference-newname.png){width="\\linewidth"} \captionof{figure}{SoundFlow's FIFO-scheduled asyn chronous streaming inference. Audio chunks are appended to temporal queue; decoding is triggered when decoder is not speaking.} `\label{fig:fifo_inference}`{=latex} scheme fully eliminates inference stalling, while reducing the first-frame latency for resuming listening after response completion by $4.5\times$. Together, these improvements enable both stable and low-latency streaming inference.

StreamAudio-2M is a dataset built for streaming audio interaction, pairing long-form, real-world-simulated audio with token-level annotations. It jointly trains the model to interact in real time grounded in context while covering 7 foundational capabilities across 28 sub-tasks.

# StreamAudio-2M Dataset {#sec:data_section} ## Overview {#sec:data_overview} Existing audio datasets are dominated by short *(clip, instruction, response)* triplets [@kong2024audioflamingo; @chu2024qwen2], which are fundamentally misaligned with streaming audio LLMs that operate over continuous streams and must jointly decide *when* to respond and *what* to produce. To bridge this gap, we introduce *StreamAudio-2M*, as shown in Figure `\ref{fig:streamaudio_main}`{=latex} a large-scale streaming-native corpus that covers the full spectrum of streaming audio interaction through **7 major categories**: *[Audio Agent,]{.underline}* *[Proactive Respond, Voice Chatting, Streaming Audio Understanding, Following Music, Real-time]{.underline}* *[ASR and Streaming Translation]{.underline}* , further partitioned into **28 streaming sub-tasks**. In total, the corpus comprises **2.6M** items totaling **302k hours**, where each sample is a **3--15 turn** heterogeneous interaction with interleaved events and sparse, context-dependent response cues. The detailed task composition and proportions are illustrated in Figure `\ref{fig:overview}`{=latex}. ## Curation Pipeline {#sec:curation_pipeline} The pipeline proceeds as follows. **(i) Data Collection.** As shown in Figure `\ref{fig:overview}`{=latex}, our sources are drawn from a wide range of well-established real-world datasets to ensure proximity to real distributions and robustness, including dialogue corpora (MOSS), ASR corpora (CommonVoice, GigaSpeech, LibriSpeech [@panayotov2015librispeech], VoxPopuli), speech translation data (CoVoST2 [@wang2021covost], AISHELL), music and audio-QA prompts (FMA, AudioSet [@gemmeke2017audio]), yielding $\sim$``{=html}1.64M foundational task items ($\sim$``{=html}8,900 hours); on top of these we add $\sim$``{=html}171k acoustic-event clips (AudioSet events, AudioX [@tian2025audiox], ElevenLabs) and noise sources (MUSAN [@snyder2015musan], WHAM! [@wichern2019wham], DNS-Challenge [@timcheck2023intel]) used only as environmental conditioning. **(ii) Preprocessing.** Textual sources are converted into speech with multi-voice CosyVoice and verified by LLM rewriting and ASR checking. **(iii) Sequence Concatenation.** Validated instances are composed into streaming sequences following Section `\ref{sec3.2: Streaming Data Construction}`{=latex}, with dual-track environmental noise superimposed. **(iv) Token-level Annotation.** The resulting sequences are converted into $\langle\text{input ids}, \text{labels}\rangle$ pairs.

Statistics of StreamAudio-2M. **(a)** The capability taxonomy spans seven core capabilities of a streaming audio model. **(b)** Round distribution, average response tokens, and silence proportion across tasks. **(c)** Statistics of source data.

## Proactive-Sound-Bench {#sec:proactive_bench} **ProactiveSound-Bench** evaluates proactive streaming response through **644** *human-designed* acoustic events, each requiring the model to correctly trigger or abstain within a continuous stream. Events span 6 top-level categories with $17$ sub-categories, and are organized into two tiers, *Single* and *Multiple*, where the *Single* tier tests single-event decisions and the *Multiple* tier concatenates same-category events to probe sustained intervention against distractors, with average accuracy as the final metric. Per-category statistics are provided in Table `\ref{tab_meso_definitions}`{=latex}. # Experiments {#sec:experiments} ## Settings {#sec:settings} #### Benchmarks. We evaluate [Audio-Interaction]{.smallcaps} on **8** audio benchmarks spanning the full spectrum of LALM capabilities: **MMAU** [@sakshi2024mmau] for general audio understanding across Sound, Music, and Speech; four spoken-dialogue benchmarks, including AlpacaEval [@dubois2023alpacafarm], SD-QA [@faisal2021sd], Llama Questions [@nachmani2023spoken], and Web Questions [@berant2013semantic], following the **VoiceBench** [@chen2026voicebench] setting; **LibriSpeech** [@panayotov2015librispeech] (clean/other) for speech recognition; **CoVoST2** [@wang2021covost] (En$\leftrightarrow$Zh) for speech-to-text translation; and our newly proposed **Proactive-Sound-Bench** for evaluating proactive response capability. #### Baselines. We compare against three categories of models. **Audio LLMs**: Audio Flamingo 2 [@ghosh2025audio], Qwen2-Audio [@chu2024qwen2], Voxtral-Mini [@liu2025voxtral], and Audio-Reasoner [@audio-reasoner]. **Omni LLMs**: Qwen2.5-Omni [@qwen25omni2025], Baichuan-Omni-1.5 [@li2025baichuan], and Phi-4-multimodal [@abouelenin2025phi]. **Task-specialized models**: Whisper-large-v3 [@radford2023whisper] and Canary [@sekoyan2025canary] for ASR; Moshi [@defossez2024moshi], Freeze-Omni [@wang2024freeze], and LLaMA-Omni2 [@fang2025llama] for streaming spoken dialogue. ## Main Results {#sec:main_results} \scriptsize \setlength{\tabcolsep}{3pt} \renewcommand{\arraystretch}{0.8} \resizebox{\textwidth}{!}{% \begin{tabular}{@{}l c c c cccc cccc@{}} \toprule \textbf{Model} & \textbf{Size} & \textbf{Stream.} & \textbf{Multi-turn} & \textbf{Text instruction} & \textbf{Audio instruction} \\ \cmidrule(lr){5-8} \cmidrule(lr){9-12} & & & & Sound & Music & Speech & Avg. & Sound & Music & Speech & Avg. \\ \midrule \cellcolor{headerblue}\textit{\textbf{Large Audio Language Models}} \\ Audio Flamingo 2 & 3B & \ding{55} & \ding{55} & \textbf{71.47} & \textbf{70.96} & 44.74 & \underline{62.40} & 1.50 & 1.49 & 0.35 & 1.16 \\ Qwen2-Audio & 7B & \ding{55} & \ding{51} & 54.95 & 50.98 & 42.04 & 49.20 & 22.32 & 19.16 & 16.31 & 19.41 \\ Voxtral-Mini & 3B & \ding{55} & \ding{51} & 58.56 & 49.70 & 43.53 & 50.60 & 46.08 & 34.13 & 30.50 & 37.24 \\ Audio-Reasoner & 8.4B & \ding{55} & \ding{55} & 60.06 & 64.30 & \textbf{60.70} & 61.71 & 20.48 & 26.65 & 13.48 & 20.57 \\ \cellcolor{headerblue}\textit{\textbf{Omni Language Models}} \\ Qwen2.5-Omni & 3B & \ding{55} & \ding{51} & 65.36 & 48.94 & 57.78 & 57.81 & 51.81 & 44.01 & 29.79 & 42.51 \\ Qwen2.5-Omni & 7B & \ding{55} & \ding{51} & \underline{67.87} & \underline{69.16} & \underline{59.76} & \textbf{65.60} & \underline{60.54} & \underline{50.90} & \underline{35.11} & \underline{49.58} \\ Phi-4-multimodal & 5.6B & \ding{55} & \ding{51} & 60.97 & 52.87 & 52.83 & 55.56 & 44.65 & 27.84 & 21.99 & 31.75 \\ Baichuan-Omni-1.5 & 7B & \ding{55} & \ding{51} & 65.47 & 58.98 & 55.26 & 59.90 & 57.53 & 36.53 & 24.82 & 40.40 \\ \cellcolor{headerblue}\textit{\textbf{Streaming Audio Language Models}} \\ \textbf{Audio-Interaction} & 3B & \ding{51} & \ding{51} & 64.12 & 47.80 & 55.13 & 55.68 & \best{65.63} & \best{57.93} & \best{46.68} & \best{58.15}\\ \bottomrule \end{tabular} } \footnotesize \setlength{\tabcolsep}{3.5pt} \renewcommand{\arraystretch}{0.8} **Model** **Size** **SpokenQA** **Voicebench** ----------------------------------------------------------------------- ---------- --------------------- --------------------- -------------------- --------------------- 3-4 `\cmidrule`{=latex}(lr)5-6 LLa. Q. Web Q. Alpa. SD-QA `\cellcolor{headerblue}`{=latex}***Specialized Models*** Moshi 7B 62.20 26.30 2.01 15.01 Freeze-Omni 7B 72.00 44.73 4.14 50.16 `\cellcolor{headerblue}`{=latex}***Omni & Audio Language Models*** Baichuan-Omni-1.5 7B **78.50** [59.10]{.underline} **4.50** 43.40 Qwen2-Audio 7B 69.67 45.20 3.74 35.71 Qwen2.5-Omni 3B 66.00 27.95 4.32 49.37 Qwen2.5-Omni 7B [75.33]{.underline} **62.80** [4.49]{.underline} **55.71** Phi-4-multimodal 5.6B 60.2 26.6 3.81 39.78 `\cellcolor{headerblue}`{=latex}***Streaming Audio Language Models*** **Audio-Interaction** 3B 67.31 54.34 4.28 [52.14]{.underline} : WER (%, $\downarrow$) on LibriSpeech and spee ch translation(S2TT) BLEU ($\uparrow$) on CoVoST2. {#tab:asr} \hfill \footnotesize \setlength{\tabcolsep}{3pt} \renewcommand{\arraystretch}{0.8} **Model** **Size** **ASR** **S2TT** ----------------------------------------------------------------------- ---------- -------------------- -------------------- ------------------------ ------------------------ 3-4 `\cmidrule`{=latex}(lr)5-6 clean other en-zh zh-en `\cellcolor{headerblue}`{=latex}***Specialized Models*** Canary 1B **1.48** **2.93** \- \- Canary-Qwen 2.5B [1.49]{.underline} [3.10]{.underline} \- \- `\cellcolor{headerblue}`{=latex}***Omni & Audio Language Models*** Baichuan-Omni-1.5 7B 5.71 10.09 \- \- Qwen2-Audio 7B 1.60 3.60 45.20 24.40 Qwen2.5-Omni 3B 2.87 5.90 39.50 18.17 Qwen2.5-Omni 7B 1.80 3.40 41.40 [29.40]{.underline} Phi-4-multimodal 5.6B 1.69 3.82 [46.30]{.underline} 22.39 `\cellcolor{headerblue}`{=latex}***Streaming Audio Language Models*** **Audio-Interaction** 3B 3.17 6.04 `\best{55.22}`{=latex} `\best{35.21}`{=latex} : WER (%, $\downarrow$) on LibriSpeech and spee ch translation(S2TT) BLEU ($\uparrow$) on CoVoST2. {#tab:asr} We summarize our main results as three enhancements(Tab. `\ref{tab:dialogue}`{=latex}): **\[Enh.1\]** [Audio-Interaction]{.smallcaps} (Fig. `\ref{figure1-overview}`{=latex} `\ref{figure2-overview}`{=latex})preserves general audio understanding under streaming training, **\[Enh.2\]** it remains competitive on core speech tasks, and **\[Enh.3\]** it unlocks streaming capabilities that offline LALMs cannot express. ***\[Enh.1\] Retained audio understanding under streaming training.*** On MMAU (Tab. `\ref{tab:mmau}`{=latex}), our model reaches **58.15** under audio instructions, slightly above its Qwen2.5-Omni-3B initialization, and remains comparable to several 7B systems at a smaller parameter scale. ***\[Enh.2\] Competitive performance on core speech tasks.*** On CoVoST2 (Tab. `\ref{tab:asr}`{=latex}), our model improves over its initialization by **+15.72/+17.04** BLEU on en-zh/zh-en and reaches scores comparable to 7B baselines. It also matches or exceeds the base model on three of four dialogue benchmarks, with only a marginal WER regression on LibriSpeech as the cost of moving from an utterance-level ASR head to a chunk-wise streaming decoder. ***\[Enh.3\] Unlocked capabilities beyond offline LALMs.*** The first is [*robustness to spoken instructions*]{.underline}: offline baselines suffer sharp drops under audio instructions, while our model has no such mismatch by construction and remains stable. The second is [*selective proactive response*]{.underline}: on Proactive-Sound-Bench (Tab. `\ref{tab:main_results_updated}`{=latex}), our model reaches **61.2** on Single and **62.8** on Multi tiers, with balanced coverage across categories and stable performance under longer streams. The third is [*capability stability under stream concatenation*]{.underline}, which reflects the inherent long-stream robustness gained from native streaming training: as $N$ grows to $5$, [Audio-Interaction]{.smallcaps} retains over $91\%$ of its single-segment accuracy, while baseline collapses by $30\%$+. \begin{table*}[t] \footnotesize \setlength{\tabcolsep}{4.2pt} \renewcommand{\arraystretch}{1.1} \caption{Results on the Proactive-Sound-Bench. \textit{Equip.} stands for Equipment. \textbf{Sin.} and \textbf{Mul.} denote Single-round and Multi-round respectively. \textbf{Best} and \underline{second-best} results are highlighted.} \begin{tabular}{@{}l!{\vrule width 0.4pt}cccccccccccc!{\vrule width 0.4pt}cc@{}} \toprule \textbf{Model} & \textbf{Human} & \textbf{Daily} & \textbf{Equip.} & \textbf{Traffic} & \textbf{Nature} & \textbf{Music} & \textbf{Avg.} \\ \cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7} \cmidrule(lr){8-9} \cmidrule(lr){10-11} \cmidrule(lr){12-13} \cmidrule(lr){14-15} & Sin. & Mul. & Sin. & Mul.& Sin. & Mul.& Sin. & Mul.& Sin. & Mul.& Sin. & Mul.& Sin. & Mul. \\ \midrule \rowcolor{headerblue} \textit{\textbf{Omni \& Audio Language Models}} \\ Qwen2.5-Omni-3B & 37.2 & 28.9 & 48.1 & 42.5 & 30.0 & 17.9 & 44.9 & 36.7 & 45.6 & 17.5 & 53.3 & 40.0 & 41.0 & 29.3 \\ Qwen2.5-Omni-7B & \underline{54.5} & 34.6 & \underline{72.9} & 40.2 & 47.9 & 19.3 & 53.1 & 24.5 & \underline{55.3} & 31.1 & 53.3 & \textbf{60.0} & 58.2 & 32.1 \\ Kimi-Audio-Instruct & 39.1 & 26.3 & 61.3 & 38.6 & 28.6 & 22.1 & 28.6 & 16.3 & 26.2 & 28.2 & 26.7 & 26.7 & 39.9 & 28.4 \\ MiniCPM-o-4.5 & 53.8 & 53.2 & \textbf{75.1} & \textbf{75.4} & \underline{52.9} & \underline{52.9} & \underline{55.1} & \underline{55.1} & 48.5 & 47.6 & 53.3 & 53.3 & \underline{58.9} & \underline{58.9} \\ Step-Audio 2 & 9.6 & 5.8 & 7.7 & 3.4 & 4.3 & 0.0 & 12.2 & 6.1 & 14.6 & 1.0 & 6.7 & 0.0 & 8.9 & 3.0 \\ Gemini-3-Flash & 48.1 & \underline{59.6} & 32.0 & 47.5 & 25.7 & 40.0 & 28.6 & 53.1 & 48.5 & \underline{56.3} & 33.3 & 53.3 & 37.0 & 50.8 \\ \rowcolor{headerblue} \textit{\textbf{Streaming Audio Language Models}} \\ \textbf{Audio-Interaction} & \textbf{56.4} & \textbf{64.9} & 68.1 & \underline{65.8} & \textbf{57.1} & \textbf{55.7} & \textbf{64.9} & \textbf{69.0} & \textbf{61.8} & \textbf{61.8} & \textbf{66.7} & \textbf{60.0} & \textbf{61.2} & \textbf{62.8} \\ \bottomrule \end{tabular} \label{tab:main_results_updated} \end{table*}

Results of per-head importance for special streaming control token generation, measured via single-head ablation across four tasks.

## Additional Analysis Beyond benchmark scores, we further investigate where in the model the offline-to-streaming gap is bridged. We present two observations, each addressing one of the structural challenges inherent to the streaming regime; further analyses, including attention maps and per-task breakdowns. ***\[Obs.1\] SALMs unify discrete chunks into a continuous representation at the early decoder layer.*** Each 0.4 s chunk is encoded with independent position embeddings and without cross-chunk encoder attention, leaving the audio frontend with no mechanism for representing time as continuous. We quantify this fragmentation with a *continuity ratio*, the cosine similarity of boundary pairs relative to intra-chunk pairs (1.0 denoting seamless continuity). As shown in Fig. `\ref{fig:continuity}`{=latex}, the encoder output sits at 0.25 and the projector shifts it by less than 0.02, whereas GPT Layer 0 lifts it to 0.80 in a single step. All four tasks trace the same curve, indicating that continuity is reconstructed at the earliest decoder layer through cross-chunk KV-cache access, as a property of the streaming regime rather than of any task-specific head. ***\[Obs.2\] SALMs learn the silent vs. respond decision through a single key attention head.*** A streaming model continuously emits `` or `` tokens to gate its output. To localize this decision, we zero each attention head in turn and measure the degradation in streaming-control-token generation. As shown in Fig. `\ref{fig:head_importance}`{=latex}, among 576 heads, a single head (L35H14) dominates across all four tasks, and its ablation alone reduces the S2TT token-match score by 0.88. This indicates that the streaming objective routes the decision through a narrow, task-independent pathway rather than dedicated per-task circuitry.

Capability stability of Audio-Interaction as the stream extends from 1 to 5 concatenated segments. We report MMAU average accuracy, dialogue accuracy, and end-to-end latency.

## Ablation Study Through ablation (Fig. `\ref{fig:ablation_stability}`{=latex}), we derive four key observations pertaining to [Audio-Interaction]{.smallcaps}: **\[Obs.1\]** the necessity of FIFO-scheduled asynchronous inference, **\[Obs.2\]** the cumulative contribution of streaming training and data, **\[Obs.3\]** the chunk size on the accuracy--latency trade-off, and **\[Obs.4\]** the balancing role of the dual-loss weight. \hfill \small \captionof{table}{effect of Asynchronous Infer.} `\label{tab:fifo_inference}`{=latex} **Settings** **Avg. FCL** **Stall %** ---------------------------- -------------- ------------- [Ours]{.smallcaps} **392ms** **0.0%** `\quad `{=latex}$w/o$ FIFO 831ms 5.2% \small \setlength{\tabcolsep}{2pt} \renewcommand{\arraystretch}{0.85} +--------------------+-----------------------+--------------------+-----------------------+--------------------------+ | **Variant** | **Configuration** | **MMAU**$\uparrow$ | **Alpaca.**$\uparrow$ | **Trig. Acc.**$\uparrow$ | +:===================+:======================+:==================:+:=====================:+:========================:+ | V1 | Baseline | 57.81 | **4.32** | -- | +--------------------+-----------------------+--------------------+-----------------------+--------------------------+ | \rowcolor{gray!15} | \+ Streaming SFT | **58.56** | 4.17 | 92.42% | | | | | | | | V2 | | | | | +--------------------+-----------------------+--------------------+-----------------------+--------------------------+ | V3 | V2 w/o TFJP pre. | 57.74 | 4.19 | 85.35% | +--------------------+-----------------------+--------------------+-----------------------+--------------------------+ | V4 | V2 w/o Event sel. | 55.11 | 4.25 | 88.51% | +--------------------+-----------------------+--------------------+-----------------------+--------------------------+ | \rowcolor{gray!15} | **Audio-Interaction** | 58.15 | 4.28 | **96.77**% | | | | | | | | V5 | | | | | +--------------------+-----------------------+--------------------+-----------------------+--------------------------+ : Effect of chunk size. {#tab:ablation_chunk} \hfill \setlength{\tabcolsep}{2pt} \renewcommand{\arraystretch}{0.85} +--------------------+-----------------------+---------------------+----------------------+ | **Variant** | **Alpaca.**$\uparrow$ | **MMAU**$\uparrow$ | **Lat.**$\downarrow$ | +:===================+:=====================:+:===================:+:====================:+ | Baseline | 4.32 | 57.81 | -- | +--------------------+-----------------------+---------------------+----------------------+ | Chunk = 0.2 s | 3.41 | 49.74 | **258** | +--------------------+-----------------------+---------------------+----------------------+ | Chunk = 0.6 s | 4.27 | [58.46]{.underline} | 674 | +--------------------+-----------------------+---------------------+----------------------+ | Chunk = 0.8 s | **4.30** | **59.13** | 786 | +--------------------+-----------------------+---------------------+----------------------+ | \rowcolor{gray!15} | [4.28]{.underline} | 58.15 | [392]{.underline} | | | | | | | **Chunk = 0.4 s** | | | | +--------------------+-----------------------+---------------------+----------------------+ : Effect of chunk size. {#tab:ablation_chunk} **\[Obs.1\] Necessity of FIFO inference.** As shown in Table `\ref{tab:fifo_inference}`{=latex}, removing FIFO scheduling increases the average first-chunk latency from $392$ ms to $831$ ms ($2.12\times$ slowdown) and raises the stall rate from $0.0\%$ to $5.2\%$, confirming that decoupling encoding from decoding is essential for stable, low-latency streaming inference. **\[Obs.2\] Cumulative contribution of streaming training and data.** As shown in Table `\ref{tab:ablation_data}`{=latex}, streaming SFT (V2) improves MMAU from $57.8$ to $58.6$ and reaches $92.4\%$ trigger accuracy over the offline base (V1). Removing TFJP preprocessing (V3) or hierarchical event selection (V4) drops trigger accuracy by $7.1$ and $3.9$ points, showing that boundary smoothing and semantically coherent event composition are both essential for context-dependent triggering. Full [Audio-Interaction]{.smallcaps} (V5) further enhances both comprehension and proactive intervention, achieving best trig. ACC of $96.7\%$. **\[Obs.3\] Chunk size on the accuracy--latency trade-off.** As shown in Table `\ref{tab:ablation_chunk}`{=latex}, an overly small chunk of $0.2$ s severely degrades performance (Alpaca. $3.41$, MMAU $49.7$) due to insufficient semantic context, while $0.6$ s and $0.8$ s recover accuracy but inflate latency to $674$ ms and $786$ ms. The chosen $0.4$ s setting attains comparable accuracy ($4.28$ / $58.2$) at nearly half the latency ($392$ ms), achieving the best accuracy--latency trade-off. **\[Obs.4\] Balancing role of the dual-loss weight $\lambda$.** As shown in Table `\ref{tab:ablation_lambda}`{=latex}, increasing $\lambda$ steadily improves trigger accuracy from $95.3$ to $96.9$, while overly large values ($\lambda{=}2.0$) start to harm comprehension (MMAU drops to $57.3$). We therefore adopt $\lambda{=}1.0$ as the best trade-off. \hfill \small \setlength{\tabcolsep}{5pt} \renewcommand{\arraystretch}{0.85} \captionof{table}{Effect of dual-loss weight $\lambda$.} `\label{tab:ablation_lambda}`{=latex} $\lambda$ $0.5$ $1.0$ $2.0$ ------------------------ ---------- ---------- ------- MMAU$\uparrow$ **58.3** 58.2 57.3 Trigger Acc.$\uparrow$ 95.3 **96.7** 96.9 ## Case study

Case studies show Audio-Interaction’s gains over SOTA streaming models. In the second, other models detect the cat mostly through the transcribed words "meow", while Audio-Interaction handles the audio cue directly via native streaming training.

# Conclusion In this work, we identified a key gap between the offline paradigm of existing Large Audio Language Models (LALMs) and the continuous, interactive nature of the audio modality, where streaming models remain confined to isolated, independent tasks and lack a general streaming audio language model. To close this gap, we formalized the **Audio Interaction Model** as a new concept and introduced **[Audio-Interaction]{.smallcaps}**, a unified Audio Interaction Model that handles conventional offline and streaming tasks while further achieving general streaming audio instruction following within a single all-in-one model. We realized this through the **[SoundFlow]{.smallcaps}** framework, which reformulates audio interaction as an always-on *perceive--decide--respond* process and instantiates it end to end, from data to training to deployment, via streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference. To support and evaluate this paradigm, we constructed **[StreamAudio-2M]{.smallcaps}**, a 2.6M-item streaming corpus covering 7 fundamental abilities and 28 sub-tasks, together with **[Proactive-Sound-Bench]{.smallcaps}**. Extensive experiments on 8 benchmarks show that [Audio-Interaction]{.smallcaps} preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including comprehension-grounded response triggering, long-stream interaction, and proactive assistance. We hope the Audio Interaction Model formulation, along with [SoundFlow]{.smallcaps} and our released resources, can serve as a foundation for future research on unified streaming audio intelligence. \newpage \newpage \appendix # Real-world validity and case study ## Real-World Validation {#appendix:realworld} To verify that the streaming behavior of [Audio-Interaction]{.smallcaps} generalizes beyond stitched synthetic streams, we evaluate on approximately 2 hours of naturally recorded audio drawn from four deployment scenarios that an always-on audio assistant is expected to encounter in practice: *Travel* (airports, train stations, hotel lobbies; multilingual conversations with PA announcements and crowd ambience), *Work* (small-group meetings, focused work with keyboard typing and notification chimes), *Home* (kitchen, living-room and bedroom activity with appliances, glassware, and a small number of staged safety-relevant events such as a dropped glass or a smoke-alarm beep), and *Commute* (walking, cycling, and in-vehicle conditions with traffic, wind, and occasional close-range horns). All audio was captured on consumer-grade smartphones and laptops at 16 kHz and was *not* processed by TFJP or any of the synthesis-time enhancement applied to [StreamAudio-2M]{.smallcaps}, so the evaluation reflects unfiltered acoustic conditions. Across the four scenarios, [Audio-Interaction]{.smallcaps} retains the bulk of its synthetic-stream performance, with degradation patterns that track scenario-specific acoustic difficulty rather than indicating systemic failure. Trigger accuracy averages 58.9% (vs. 62.0% on a matched synthetic split), and falls off most in *Travel* and *Commute*, where crowd ambience and non-stationary noise raise both ASR WER (to roughly 7.9% and 8.6%) and the false-positive rate of proactive responses; *Work* is closest to the synthetic baseline, while *Home* preserves trigger accuracy but shows mildly elevated false positives, driven by impulsive but benign kitchen sounds that locally resemble safety-critical events. The average first-chunk latency stays within $\pm 25$ ms of the synthetic measurement in every scenario, indicating that the FIFO scheduler is insensitive to recording-side jitter and device variation. More importantly, the model's internal decision-making is preserved on real recordings: per-chunk silence rates correlate at 0.91 (Pearson, 2 s bins) with the matched synthetic split, ablating the dominant streaming-control head L35H14 degrades token-match by 0.86 versus 0.88 on synthetic, and the boundary-to-internal continuity ratio at GPT Layer 0 is 0.78 versus 0.80 on synthetic. Together, these results suggest that the streaming decision boundary learned by [Audio-Interaction]{.smallcaps} reflects genuine acoustic comprehension rather than a concatenation cue, and that synthetic-stream training transfers to in-the-wild recordings without per-scenario adaptation. \newpage ## Case Study

\newpage

\newpage # Method Details {#appendix:method} This appendix expands the four operational components of the streaming framework that §3 and §4.2 of the main paper state but do not detail. Throughout, $c=400$ ms denotes the streaming chunk size, and $f_{\text{enc}}, f_{\text{proj}}, f_{\text{dec}}$ refer to the audio encoder, adapter, and language model components inherited from Qwen2.5-Omni-3B. Optimization hyperparameters (learning rate, batch size, total steps) are deferred to Appendix `\ref{appendix:hyperparams}`{=latex}. ## Streaming Data Construction {#appendix:data-construction} The TFJP module of §3.2 stabilizes clip-level audio prior to stitching through six operators sharing one STFT representation: [silence_cut]{.smallcaps} truncates silent runs longer than $\tau$ via an energy-percentile gate at the 10th percentile of frame energy; [noise_profile]{.smallcaps} estimates a stationary noise spectrum from the lowest-energy 5% of frames; [denoise]{.smallcaps} applies spectral subtraction with gating coefficient $\gamma=1.0$; [core_locate]{.smallcaps} returns the contiguous span maximizing a normalized energy / spectral-entropy score; [boundary_norm]{.smallcaps} snaps that span to the nearest $\delta=c/2=200$ ms boundary and [spec_smooth]{.smallcaps} applies a Hann taper of length $\omega=20$ ms at both ends. The default silence limit is $\tau=300$ ms, and the iteration cap $K=3$ in Algorithm 1 of the main paper is reached on $<2\%$ of clips during corpus construction. The hierarchical event curation pipeline drives a chat LLM through three roles realized by the prompt template in Figure `\ref{prompt:curation-p1}`{=latex}--`\ref{prompt:curation-p2}`{=latex}. Stage 1 plans a coherent scenario from a bag of randomly matched audio annotations and emits 3--15 sub-events with role labels in $\{\texttt{foreground}, \texttt{background}, \texttt{ambient}\}$; Stage 2 refines each sub-event into a retrieval query and a generation fallback caption; the verifier adjudicates retrieval candidates and synthesized clips identically against four criteria (identity, cleanliness, duration fit, continuity), returning one of `accept`, `reprocess` (route back through TFJP), or `reject`. All calls run in JSON-mode decoding at temperature $0.7$. ## Streaming Training {#appendix:streaming-training} A streaming sample carries two mutually exclusive supervision targets at every position: $y^{\text{stream}}$ supervises one $\langle\texttt{silent}\rangle$ or $\langle\texttt{response}\rangle$ token per chunk; $y^{\text{LM}}$ supervises the text tokens following each emitted $\langle\texttt{response}\rangle$. Audio-encoder positions and the instruction prefix are masked from both. The construction is formalized in Algorithm `\ref{alg:tokenize}`{=latex}. Two failure modes diagnosed in §3.3 require dedicated supervision: insufficient context retention in long streams, and false triggering on incidental sounds. Both are addressed by a single agent-driven pipeline with two prompts (Figure `\ref{prompt:supervision}`{=latex}). The history-review prompt synthesizes follow-up questions whose answer strictly depends on a turn at least three rounds earlier; the silent-audio prompt audits whether a candidate non-speech segment warrants a response under the four trigger criteria of ProactiveSound-Bench , with `borderline` clips discarded rather than mislabeled. The dual-loss objective $\mathcal{L}=\mathcal{L}_{\text{LM}}+\lambda\,\mathcal{L}_{\text{stream}}$ holds throughout the four-stage recipe; the recipe varies only the data composition and trainable modules across stages: Stage 1 unfreezes the LM head and the new-token embedding on offline single-turn data; Stage 2 trains the adapter only; Stage 3 jointly trains adapter and LM on the four core capabilities (ASR, S2TT, dialogue, audio understanding) of StreamAudio-2M; Stage 4 fine-tunes on interleaved multi-turn streams whose proactive insertions and history-review probes are introduced as Bernoulli mix-ins during the composition pass of §`\ref{appendix:curation}`{=latex}. ## Asynchronous FIFO Inference {#appendix:fifo-inference} The FIFO scheduler runs the encoder and the decoder as two independent processes communicating through one queue $\mathcal{Q}$. The encoder is a pure producer: it consumes audio chunks at fixed rate and atomically appends projected features to $\mathcal{Q}$, never blocking on decoder state. The decoder is gated on the type of its last emitted token $r^*$: when $r^*\!\in\!\{\langle\texttt{silent}\rangle, \langle\texttt{eos}\rangle\}$, the decoder is at an interruption point and drains $\mathcal{Q}$ atomically into its KV-cache before emitting one control token; when $r^*$ is a mid-response text token, the decoder issues a pure autoregressive step against the existing KV-cache without touching $\mathcal{Q}$. Drain-on-trigger (rather than pop-one-at-a-time) keeps the decoder's effective acoustic context aligned with wall-clock time after long responses and avoids spending decoder steps on stale silence-decisions --- the structural source of the $4.5\times$ first-frame latency reduction reported in §3.4. The schedule is formalized in Algorithm `\ref{alg:fifo}`{=latex}. ## Dataset Curation Pipeline {#appendix:curation} Text-form sources (MOSS, GammaCorpus, instruction chats) are converted into spoken form through a three-step chain: an LLM rewriter normalizes the text via the prompt in Figure `\ref{prompt:tts}`{=latex} (markdown stripping, numeral and abbreviation expansion, symbol replacement); CosyVoice renders the rewritten text with a voice $v$ sampled once per dialogue from a multi-voice pool $\mathcal{V}$; an ASR check rejects renderings whose transcript drifts beyond $\tau_{\text{wer}}=0.10$ from the rewritten reference, retrying up to $R=2$ times before discarding the entire instance --- not just the failing turn --- to preserve multi-turn coherence. Validated event clips and the noise pool $\mathcal{N}=\textsc{MUSAN}\cup\textsc{WHAM!}\cup\textsc{DNS\text{-}Challenge}$ are then composed into a single long-form streaming waveform by Algorithm `\ref{alg:compose}`{=latex}. Foreground clips are concatenated sequentially with TFJP re-applied at every junction; background and ambient clips inherited from the scenario plan are mixed in at random offsets with role-dependent gain (foreground at $0$ dB, background at $-6$ dB, ambient at $-12$ dB); two independent noise tracks --- one event-like, one ambient --- are tiled across the full duration with crossfaded boundaries and mixed at SNRs sampled from $P_{\text{snr}}=\mathcal{U}(5, 20)$ dB, with the ambient track held $5$ dB quieter to match real recording conditions. The output $(y, \mathcal{T})$ is exactly the input expected by Algorithm `\ref{alg:tokenize}`{=latex}: the waveform $y$ is split into $400$ ms chunks, encoded, and merged with the response timeline $\mathcal{T}$ to produce the $\langle X, y^{\text{stream}}, y^{\text{LM}}\rangle$ training tuple. The same routine handles all seven task categories of StreamAudio-2M; tasks differ only in which positions of $\mathcal{T}$ carry a non-empty response (e.g., real-time ASR places one entry per incoming chunk, voice chatting one per user-turn boundary, proactive response only at safety-critical events). \begin{algorithm}[H]\caption{Streaming Sample Tokenization and Label Construction} \label{alg:tokenize} \begin{algorithmic}[1] \Require instruction tokens $\mathcal{A}^{\text{ins}}$, audio chunks $a_{1:T}$, response timeline $\mathcal{R}=[(t_k, r_k)]_{k=1}^{K}$ sorted by $t_k$ \Ensure token sequence $X$, streaming target $y^{\text{stream}}$, LM target $y^{\text{LM}}$ \State $X, y^{\text{stream}}, y^{\text{LM}} \gets [\,],[\,],[\,]$ \State Append $\mathcal{A}^{\text{ins}}$ to $X$; extend labels with \textsc{Mask} \State $k \gets 1$ \For{$t = 1$ to $T$} \State Append encoder features of $a_t$ to $X$; extend labels with \textsc{Mask} \If{$k \leq K \land t_k = t$} \Comment{response triggers at chunk $t$} \State Append $\langle\texttt{response}\rangle$;\ $y^{\text{stream}}\!{+\!=}\!\langle\texttt{response}\rangle$,\ $y^{\text{LM}}\!{+\!=}\textsc{Mask}$ \For{token $w$ in $r_k$} \State Append $w$;\ $y^{\text{stream}}\!{+\!=}\textsc{Mask}$,\ $y^{\text{LM}}\!{+\!=}w$ \EndFor \State Append $\langle\texttt{eos}\rangle$;\ $y^{\text{stream}}\!{+\!=}\textsc{Mask}$,\ $y^{\text{LM}}\!{+\!=}\langle\texttt{eos}\rangle$ \State $k \gets k+1$ \Else \Comment{remain silent} \State Append $\langle\texttt{silent}\rangle$;\ $y^{\text{stream}}\!{+\!=}\!\langle\texttt{silent}\rangle$,\ $y^{\text{LM}}\!{+\!=}\textsc{Mask}$ \EndIf \EndFor \State \Return $X, y^{\text{stream}}, y^{\text{LM}}$ \end{algorithmic} \end{algorithm} \newpage \begin{algorithm}[H]\caption{FIFO-Scheduled Asynchronous Streaming Inference} \label{alg:fifo} \begin{algorithmic}[1] \Require audio stream $x_{1:\infty}$, encoder $f_{\text{enc}}$, decoder $f_{\text{dec}}$ \State \textbf{shared:} queue $\mathcal{Q}\!\gets\![\,]$;\ last token $r^*\!\gets\!\langle\texttt{silent}\rangle$;\ KV-cache $\mathcal{C}\!\gets\!\varnothing$ \State spawn \textsc{EncoderLoop} and \textsc{DecoderLoop} concurrently \Statex \Procedure{EncoderLoop}{} \Comment{producer; never blocks} \For{each arriving chunk $x_t$} \State $a_t \gets f_{\text{enc}}(x_t)$;\quad \textbf{atomic:} $\mathcal{Q}.\textsc{append}(a_t)$ \EndFor \EndProcedure \Statex \Procedure{DecoderLoop}{} \Comment{event-driven consumer} \Loop \If{$r^* \in \{\langle\texttt{silent}\rangle, \langle\texttt{eos}\rangle\}$} \State \textbf{wait until} $\mathcal{Q} \neq \varnothing$ \Comment{idle if queue empty} \State \textbf{atomic:} $\mathcal{F}\!\gets\!\mathcal{Q}.\textsc{flush}()$;\quad $\mathcal{C}\!\gets\!\textsc{Extend}(\mathcal{C},\mathcal{F})$ \State $r^* \gets f_{\text{dec}}(\mathcal{C})$ \Comment{emit one control token} \Else \Comment{mid-response} \State $r^* \gets f_{\text{dec}}(\mathcal{C})$ \Comment{AR text step; queue untouched} \EndIf \State \textsc{Emit}($r^*$) \EndLoop \EndProcedure \end{algorithmic} \end{algorithm} \begin{algorithm}[H]\caption{Dual-Track Streaming Sequence Composition} \label{alg:compose} \begin{algorithmic}[1] \Require ordered event list $E\!=\![(w_i, \rho_i, d_i, r_i)]_{i=1}^{|E|}$ (waveform, role, duration, response or $\varnothing$); noise pool $\mathcal{N}\!=\!\mathcal{N}_{\text{evt}}\uplus\mathcal{N}_{\text{amb}}$; chunk size $c$, fade window $\omega$, TFJP $\Phi$, SNR distribution $P_{\text{snr}}$ \Ensure stream waveform $y$, response timeline $\mathcal{T}$ \State $y_{\text{main}}\!\gets\!\varnothing$;\quad $\mathcal{T}\!\gets\![\,]$ \For{$i = 1$ to $|E|$} \State $w_i \gets \Phi(w_i)$ \Comment{re-apply TFJP at clip boundary} \If{$\rho_i = \texttt{foreground}$} \State $\textit{offset} \gets \textsc{Length}(y_{\text{main}})$;\quad $y_{\text{main}} \gets \textsc{Concat}(y_{\text{main}},\textsc{Fade}(w_i,\omega))$ \If{$r_i \neq \varnothing$} \State $\mathcal{T}.\textsc{append}\big(\big(\lceil(\textit{offset}+d_i)/c\rceil,\ r_i\big)\big)$ \EndIf \Else \Comment{$\rho_i \in \{\texttt{bg},\texttt{amb}\}$} \State $\textsc{MixIn}(y_{\text{main}},\,w_i,\, \text{rand offset},\,\textsc{RoleGain}(\rho_i))$ \EndIf \EndFor \State $D \gets \textsc{Length}(y_{\text{main}})$ \State $y^{(1)}\!\gets\!\textsc{TileCrossfade}(\textsc{Sample}(\mathcal{N}_{\text{evt}}),\,D)$;\quad $y^{(2)}\!\gets\!\textsc{TileCrossfade}(\textsc{Sample}(\mathcal{N}_{\text{amb}}),\,D)$ \State $\sigma_1 \sim P_{\text{snr}}$;\quad $\sigma_2 \sim P_{\text{snr}}+5\,\text{dB}$ \Comment{ambient held quieter} \State $y \gets y_{\text{main}} + \textsc{Scale}(y^{(1)},\sigma_1) + \textsc{Scale}(y^{(2)},\sigma_2)$ \State \Return $y,\,\mathcal{T}$ \end{algorithmic} \end{algorithm} \newpage

Prompt template for hierarchical event curation, Part 1: scenario planning followed by event refinement. Both calls run in JSON-mode decoding at temperature 0.7.

Prompt template for hierarchical event curation, Part 2: clip grounding verification, applied identically to retrieved and synthesized clips so the two paths share one acceptance criterion.

\newpage

Prompt template for comprehension-aware supervision: history-review question generation (Prompt A) and silent-audio verification (Prompt B). Both run on the same chat LLM in JSON-mode decoding; borderline clips from Prompt B are discarded rather than mislabeled.

Prompt template for the spoken-style rewriter applied to text-form supervision sources (MOSS, GammaCorpus, instruction chats) prior to CosyVoice rendering. The WER round-trip via downstream ASR constrains how aggressively the rewriter may paraphrase.

\newpage # StreamAudio-2M Dataset Sources {#app:dataset} StreamAudio-2M is assembled from a diverse pool of publicly available corpora, each selected to fill a distinct capability slot in the streaming regime. We deliberately favor well-established sources over scraped or proprietary collections, both for reproducibility and because the streaming pipeline already introduces substantial transformation on top of each upstream signal. Table `\ref{tab:app:sources}`{=latex} summarizes the role and quantitative contribution of every source; we walk through them by capability family below, with an emphasis on *how* each source is repurposed, since most are not used in the form their original release intended. \small \setlength{\tabcolsep}{4pt} \renewcommand{\arraystretch}{1.05} **Source** **Family** **Role in StreamAudio-2M** **Items** **Hours** ------------------------ ---------------- -------------------------------------------------- ----------- ----------- CommonVoice Speech Streaming ASR supervision (multilingual) 62,354 120 GigaSpeech Speech Streaming ASR supervision (in-the-wild) 86,740 170 LibriSpeech Speech Streaming ASR supervision (read speech) 81,647 160 VoxPopuli Speech Streaming ASR supervision (parliamentary) 39,746 80 CoVoST 2 (En$\to$CN) Speech Speech translation & simultaneous interpretation 198,942 390 CoVoST 2 (CN$\to$En) Speech Speech translation & simultaneous interpretation 16,826 35 AISHELL Speech Mandarin ASR / translation supervision 141,246 280 FMA (Open) Audio Open-ended music understanding prompts 33,154 150 FMA (Choice) Audio Multiple-choice music understanding prompts 42,347 AudioSet (Open) Audio Open-ended audio-QA grounding events 171,030 820 AudioSet (Choice) Audio Multiple-choice audio-QA reasoning prompts 135,753 AudioSet (Description) Audio Audio captioning & scene description 99,946 MOSS Speech Spoken-dialogue supervision (TTS-rendered) 392,198 4,900 GammaCorpus-Fact-QA Speech Factual spoken-QA supervision (TTS-rendered) 147,253 1,840 AudioSet (events) Acoustic event Real foreground events for streams 27,491 160 AudioX Acoustic event Synthesized rare-event clips 94,503 ElevenLabs Acoustic event Synthesized targeted sound effects 48,927 MUSAN Noise Music, speech and ambient background 1,896 620 WHAM! Noise Real-world reverberant scenes 13,425 DNS Challenge Noise Diverse environmental conditions 14,328 UltraChat Auxiliary Text-only instruction following (multi-turn) 156,732 -- Magpie-Pro Auxiliary Text-only instruction following (self-aligned) 167,324 -- DU-QA Auxiliary Text-only domain-understanding QA 14,308 -- COIG-CQIA Auxiliary Chinese instruction following 34,274 -- Web-QA Auxiliary Open-domain web question answering 5,892 -- BellGroup Auxiliary Chinese conversational instructions 108,173 -- : Source corpora used to construct StreamAudio-2M. Items denote the number of upstream instances drawn from each source before streaming composition; Hours denote the corresponding raw audio duration. Sources contributing only environmental conditioning are marked \`\`--" under Items. {#tab:app:sources} #### Speech-centric sources. The speech-centric portion of StreamAudio-2M underlies four offline capabilities the streaming model must inherit from conventional LALMs: spoken dialogue, streaming ASR, speech-to-text translation, and audio question answering. **MOSS** contributes the largest single block of dialogue supervision; we render its 392k text-form multi-turn instances into 4,900 hours of speech with multi-voice CosyVoice. **LibriSpeech** [@panayotov2015librispeech], originally an utterance-level recognition corpus, is re-segmented at the 400 ms chunk granularity used by [Audio-Interaction]{.smallcaps} so that ASR supervision can be delivered *during* the listening phase rather than at utterance end. **CoVoST 2** [@wang2021covost] provides 216k bidirectional English--Chinese speech-translation pairs, which we use both in their native offline form and in stitched form, where a continuous source stream is paired with an interleaved translation timeline to supervise simultaneous interpretation. #### Acoustic event sources. The streaming setting differs from offline LALM training in that it requires not only foreground events that warrant a response, but also a *long tail* of rare and context-specific events whose absence would force the model to over-trigger on the most common categories. We therefore combine real and synthetic event sources. **AudioSet** contributes the bulk of real recorded events, drawn evenly across its ontology to discourage the head-class bias common in event-classification setups. Where AudioSet coverage is sparse for a target ontology node (typically rare safety-critical sounds such as glass shattering or specific alarm patterns), we synthesize replacement clips with the audio generator **AudioX** [@tian2025audiox] and the sound-effect generator **ElevenLabs**; in both cases the synthesized clip passes through the verification stage before it is admitted to the corpus. Synthetic and real events together total 171k clips spanning the full ProactiveSound-Bench taxonomy, ensuring that every category the model is later evaluated on is also represented during training. #### Noise sources. Background noise is overlaid on every long-form stream as a dual-track condition during sequence concatenation. This reflects two properties of the deployment setting that offline LALM corpora typically ignore: real acoustic environments are seldom silent between events of interest, and the model must learn to suppress responses to non-foreground sound regardless of its loudness. We draw from three established noise corpora to cover complementary acoustic conditions: **MUSAN** [@snyder2015musan] for music, ambient and speech-babble noise; **WHAM!** [@wichern2019wham] for real-recorded urban and reverberant scenes. Together they contribute 620 hours of background that is mixed at a controlled SNR distribution rather than concatenated as standalone events. \newpage # Proactive-Sound-Bench {#app:bench} ## Task Definition {#app:bench:task} We define ProactiveSound-Bench as an audio-triggered proactive response task. Given an audio input $x$, the model is required to simultaneously perform two tasks: (i) The decision of whether to trigger a response(ii) The generation of a natural language response when triggered. Regarding the first point, when the model should respond-we delineate the boundary as follows: the model is required to proactively respond upon detecting sudden human physiological illness or discomfort, severe weather, potential equipment damage, or hazardous environmental signals. In all other cases, including normal human physiological sounds, routine equipment operation, and similar signals, the model should remain silent and refrain from disturbing the user. With respect to the second point, the model's responses should incorporate reminders, warnings, suggestions, or first-aid assistance, and they must possess sufficient information density. For instance, when a sudden human illness is detected, the model ought to provide the corresponding first-aid instructions rather than merely posing unsubstantial questions such as "Are you okay?". The goal of ProactiveSound-Bench differs from two common audio benchmarks in both *optimization objective* and *output space*. **Sound Event Detection (SED)** emphasizes detecting predefined acoustic events and localizing them in time; outputs are typically frame-level labels or temporal boundaries. **Audio captioning** tends to produce *neutral descriptive* text about what is heard. Both lines largely probe perception and recognition of acoustic content. By contrast, our benchmark jointly evaluates **whether to respond** and **what to say after triggering**, and uses a reference answer set with semantic matching thresholds to characterize the diversity and usefulness of acceptable replies. In this sense, ProactiveSound-Bench builds upon audio perception and further stresses *understanding acoustic events in context*: beyond robust acoustic sensing, models must disambiguate similar sounds across contexts and turn such understanding into appropriate interaction decisions. ## Categories and Coverage {#app:bench:categories}

#### Taxonomy rationale. The macro-level taxonomy of ProactiveSound-Bench is designed to broadly cover acoustic scenarios that assistant devices may encounter in everyday life. We construct it by progressively partitioning sounds according to how strongly they originate from the human body versus non-physiological sources. First, we separate cues that arise *directly from humans* from those that do not; the former are grouped into **Human Sound Signals**, emphasizing \`\`human-in-the-loop" acoustics such as crying, breathing- and ingestion-related cues, salient emotional vocalizations, body-motion sounds, and crowd-like ambience---while excluding text-based user queries as task inputs. Second, we include contexts that are strongly tied to human activity yet are not primarily human physiological productions: these typically correspond to object handling and domestic routines in living spaces, captured by **Daily Living Sounds** to characterize passive \`\`doing-things-at-home" acoustics and their decision boundaries. Third, we cover scenarios that are comparatively weakly tied to the human subject and are dominated by environmental processes or engineered systems: outdoor/natural dynamics are grouped under **Nature & Environment**, electromechanical devices and tools under **Equipment**, and roadway/vehicle-dominated listening conditions under **Traffic**; together these cover most everyday \`\`environment--device--traffic" sound regimes. Finally, we add **Music**, which focuses on *instrument-playing* related acoustic events and includes both nominally normal performances and severely out-of-tune corruptions caused by instrument damage. \small \setlength{\tabcolsep}{6pt} \renewcommand{\arraystretch}{1.3} +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}**Meso subdomain** | `\arraybackslash`{=latex} **Macro domain** | **Definition** | +:==============================================+:===========================================+:==========================================================================================================================+ | `\arraybackslash`{=latex}Body Movements | `\arraybackslash`{=latex} Human | Characterizing acoustics associated with exercise and injury. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Physiological states | `\arraybackslash`{=latex} Human | Auditory information associated with normal bodily functions or acute physiological stress. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Emotion Expression | `\arraybackslash`{=latex} Human | Significant affective vocalizations and expressive non-verbal signals. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Collective Ambience | `\arraybackslash`{=latex} Human | The dominant background environment in which a crowd participates in an activity. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | \arraybackslash | `\arraybackslash`{=latex} Daily Living | Domestic self-care workflows in private living spaces. | | | | | | Personal Care | | | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Daily Affairs | `\arraybackslash`{=latex} Daily Living | Routine indoor micro-interactions with furniture, handheld objects and dynamic surfaces. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Housekeeping | `\arraybackslash`{=latex} Daily Living | Cleaning- and tidying-centric domestic workflows dominated by repetitive surface interactions and maintenance motions. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | \arraybackslash | `\arraybackslash`{=latex} Equipment | Household electromechanical systems and appliances operation status. | | | | | | House Equipment | | | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Industrial Tools | `\arraybackslash`{=latex} Equipment | Tooling and industrial machinery acoustics associated with powered operation, and higher-energy mechanical transients. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | \arraybackslash | `\arraybackslash`{=latex} Traffic | Focusing on the acoustic signals of vehicle mechanical systems. | | | | | | Vehicle | | | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Traffic | `\arraybackslash`{=latex} Traffic | Intermittent Warning Signals in Urban Road Soundscapes. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Large Traffic | `\arraybackslash`{=latex} Traffic | Mass-transit and heavy-vehicle dominated contexts characterized by periodic rail/bogie rhythm, large chassis resonance. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | \arraybackslash | `\arraybackslash`{=latex} Environment | Weather-driven airborne and precipitation acoustics spanning calm atmospheric textures to highly dynamic storm processes. | | | | | | Meteorologys | | | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Geological Hazards | `\arraybackslash`{=latex} Environment | Impact sounds generated by terrain dynamics serve as indicators of slope instability, rockfalls, or geological movements. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Ecological Context | `\arraybackslash`{=latex} Environment | Biotic outdoor cues attributable to animals/plants/ecosystem activity. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | `\arraybackslash`{=latex}Social places | `\arraybackslash`{=latex} Environment | Human-occupied ambient soundscapes in social/public spaces. | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ | \arraybackslash | `\arraybackslash`{=latex} Music | Instrument-forward performance acoustics. | | | | | | Artistic | | | +-----------------------------------------------+--------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+ : Meso-level category definitions for ProactiveSound-Bench (conceptual scope only; exemplars are reported separately). {#tab_meso_definitions} \newpage # Experiments Details {#appendix:hyperparams} Table `\ref{tab:hyperparams}`{=latex} reports all method-, data-, and optimization-level hyperparameters held fixed across the four-stage training recipe of §`\ref{appendix:streaming-training}`{=latex}. Method-level constants ($c$, $\omega$, $\delta$, $\lambda$) follow the design choices identified by the ablations in §5.4; data-level constants ($\tau_{\text{wer}}$, $R$, SNR, role gains, Stage 4 mix probabilities) follow the values introduced in §`\ref{appendix:curation}`{=latex}. Optimization hyperparameters vary per stage to match each stage's data scale and trainable-parameter footprint: the streaming SFT stage receives the largest step budget, while the instruction-following stage uses the lowest learning rate to preserve previously acquired capabilities. All training is conducted in bf16 mixed precision with gradient checkpointing and DeepSpeed ZeRO-2 sharding on $32\!\times\!\textsc{NVIDIA H100}$ $80$ GB GPUs. \footnotesize \setlength{\tabcolsep}{4pt} \renewcommand{\arraystretch}{1.05} **Configurations** **Parameters** **Values** -------------------- ------------------------------------- -------------------------------------------------------------------------- ---------------------- ---------------------- ---------------------- 3-6 **Stage 1** **Stage 2** **Stage 3** **Stage 4** Streaming chunk size $c$ $400$ ms fade window $\omega$ $20$ ms half-chunk align $\delta$ $200$ ms dual-loss weight $\lambda$ $1.0$ max stream length $L_{\max}$ $60$ chunks ($24$ s) Data WER threshold $\tau_{\text{wer}}$ $0.10$ ASR retries $R$ $2$ SNR distribution $P_{\text{snr}}$ $\mathcal{U}(5,\,20)$ dB role gain (fg / bg / amb) $0$ / $-6$ / $-12$ dB history-review prob $p_{\text{hr}}$ --- --- --- $0.30$ silent mix prob $p_{\text{sil}}$ --- --- --- $0.40$ proactive mix prob $p_{\text{pro}}$ --- --- --- $0.30$ Training trainable modules LM head + emb. adapter adapter + LM adapter + LM batch size (per GPU) $8$ $8$ $4$ $2$ gradient accum. steps $2$ $4$ $8$ $16$ effective batch size $512$ $1024$ $1024$ $1024$ learning rate $1\!\times\!10^{-4}$ $1\!\times\!10^{-4}$ $5\!\times\!10^{-5}$ $1\!\times\!10^{-5}$ training steps $5$ k $20$ k $80$ k $15$ k warmup ratio $0.03$ $0.03$ $0.03$ $0.03$ optimizer AdamW ($\beta_1\!=\!0.9$, $\beta_2\!=\!0.95$, $\varepsilon\!=\!10^{-8}$) scheduler Cosine decay with linear warmup weight decay $0.01$ max grad norm $1.0$ Hardware GPUs $32\!\times\!\textsc{NVIDIA H100}$ $80$ GB precision & sharding bf16 mixed precision, DeepSpeed ZeRO-2 total wall-clock time $\sim 10$ days : Configurations of parameters in Audio-Interaction. {#tab:hyperparams} \newpage # Full Related Work #### Streaming Audio Models. In the streaming setting there is no single unified model. Instead, each task is handled by a dedicated family of models that specializes in a particular function. Representative examples include streaming speech recognition [@gao2022paramformer], streaming speech translation [@seamless], and full-duplex spoken dialogue, which has become an important and rapidly developing direction [@lslm; @miniomni2; @silent; @chronological]. DuplexSLA [@duplexsla] further adds action to duplex models. Audio-interaction shares several characteristics with this last class of models. It operates over fixed-size audio chunks, ingesting acoustic frames sequentially and deciding, on the basis of acoustic and semantic cues, whether and when to intervene, as exemplified by Moshi [@defossez2024moshi]. The decision required in audio-interaction, however, is substantially more complex. Beyond local acoustic and semantic signals, it must additionally reason over full-audio understanding, environmental sounds, paralinguistic information, and explicit user instructions, which together make the intervention policy far richer than that of prior streaming systems. #### Audio Large Models. Audio large models represent a milestone toward a single unified model that can perform general audio-based tasks [@chu2024qwen2; @qwen25omni2025; @diffa2; @stepaudio2]. This unification has given rise to a broad spectrum of capabilities, such as speech understanding [@sakshi2024mmau], spoken-dialogue understanding [@mmsu]. Serving as a general-purpose foundation, these models have been further extended to a wide range of downstream tasks, including speech recognition [@xu2025fireredasr; @seed-asr; @qwen3-asr; @megaasr], emotion understanding [@emotionthinker], and audio reasoning [@afcot; @thinkwith; @audiocog; @audio-reasoner]. Despite this progress, current audio large models remain exclusively offline. None of them offers a unified model that can understand sound and the surrounding environment while executing instructions in real time, and closing this gap is precisely the motivation behind our work. #### Streaming AI Systems. Artificial general intelligence cannot remain permanently behind the screen. To be genuinely useful it must move to the foreground and interact with humans directly, which motivates the development of streaming models and systems. In the visual domain, this line of research has produced continuous, online video understanding that processes incoming frames as they arrive [@chen2024videollm; @li2025videochat]. A more readily deployable alternative is the cascaded AI system, such as proactive agents [@proactive; @proagent; @pask; @needllm], which place the text modality at the center of processing and coordinate several specialized components. In contrast to these designs, our work aims to open a new paradigm by realizing this capability within a single end-to-end model. \newpage # Error Analyses {#app:analysis} `\label{app:analysis:err:breakdown}`{=latex} - **LibriSpeech(ASR).** On the LibriSpeech error analysis of the 98 non‑empty and non‑crash predictions identifies four primary error categories. Local Token Deviation---grouping phonetically or orthographically motivated substitutions together with minor insertions and deletions---constitutes the largest error class, accounting for 60.2% of all analyzed errors. Rare‑Word & Long‑Utterance Degradation forms the second major category (21.4%), characterized by the misrecognition of named entities and structural breakdown in syntactically complex sentences; literary character names and extended utterances prove particularly challenging. Function Word Bias (14.3%) and Decoding Loop phenomena (4.1%) appear at lower frequencies---the former arising from language model preferences for certain function words, and the latter manifested as phrase‑level repetition. Overall, these error patterns underscore targeted opportunities for improvement, while the model's strong baseline accuracy remains competitive with other approaches of comparable scale. - **CoVoST2(Speech-to-Text Translation).** In this error analysis, we examined the low-BLEU translations (BLEU \< 20) produced by our S2TT model on the CoVoST2 English-to-Chinese test set. We categorized the errors into two main types. Semantic hallucinations, where the model generates a translation completely unrelated to the source audio, dominate the low-score set, accounting for 82% of the cases. The remaining 18% are incomplete or mixed-language outputs that contain untranslated English fragments, garbled symbols, or broken phrases, failing to form a coherent Chinese sentence. Then,we conduct an error analysis on the lowest-BLEU sentences in the zh→en CoVoST2 subset. Low-score cases fall into two dominant categories: off-topic or hallucinated translations likely caused by severe recognition/misalignment failures, accounting for 75.5% of errors; and omissions or uncontrolled paraphrasing that preserve partial meaning but break n-gram overlap, accounting for 24.5%. - **MMAU.(Audio Understanding)** The error analysis on our model's MMAU results uncovers two primary failure categories. Approximately 20% arise from generation collapse, characterized by unparseable outputs that prevent any valid assessment. The remaining represent genuine recognition or reasoning errors, where the model confused acoustically similar sources, misclassified speaker attributes like age or gender, or selected an incorrect category despite partially correct reasoning. - **SpokenQA (Llama Questions & Web Questions).** After excluding empty predictions (35 instances) and correct responses that were erroneously flagged as errors due to overly strict evaluation formatting, LlamaQA's valid predictions contained a total of 37 actual model errors. These errors can be categorized into three types: Factual Hallucinations (56.8%) were the most prominent, manifesting as the fabrication of non-existent names of people, places, or events, accompanied by fluent descriptions; Temporal and Quantitative Errors (16.2%) involved providing incorrect specific figures or values in response to questions requiring precise numerical data; Irrelevant or Generalized Responses (27%) substituted direct answers with poetic, vacuous, or evasive language; Overall, the errors observed on the WebQuestions dataset can be categorized into three main types. Factual hallucinations constitute the largest share---approximately 71%---referring to instances where the model fabricates factual content out of thin air that appears plausible yet is entirely unrelated to the correct answer, lacking any external knowledge support. Irrelevant or generalized responses account for roughly 15%; this occurs when the output fails to provide the direct information requested by the query, instead offering roundabout replies characterized by hollow, flippant, or evasive language. Errors regarding time and quantity make up approximately 15%, reflecting the model's tendency to provide incorrect specific values when addressing questions involving particular years, dates, time zones, or numerical figures. - **VoiceBench (AlpacaEval-full & SD-QA).** On VoiceBench's Alpaca-Eval subset, We categorize these low score samples into three types. (1) Hallucination (53.5%): the model generates factually incorrect statements that contradict established knowledge, including fabricated entities, misattributed events, or erroneous numbers. (2) Irrelevant response or inappropriate refusal (46.4%): the model produces content unrelated to the prompt or rejects a harmless request, often due to keyword misinterpretation or over-triggered safety filters. The incorrect answers in the SD‑QA subset exhibit three primary failure modes. Factual hallucination accounts for roughly 63% of the errors, where the model confidently generates false details . Irrelevant or miscomprehending responses constitute about 24%, where the question is misheard and an off‑topic answer is given . The remaining 13% are over‑refusals, in which innocuous factual queries are wrongly rejected as sensitive . - **ProactiveSound-Bench.** Among errors. False positives(59.8%) were dominated by overreactions to benign daily sounds such as tearing paper, appliance noises, drinking, or sighs, generating unnecessary alerts . Conversely, false negatives(40.2%) clustered in safety‑critical domains like traffic alarms, natural hazard. \newpage \bibliographystyle{plainnat}