Audio Interaction Model

Source

Raw Markdown: paper_audio-interaction-model-2026.md
PDF: paper_audio-interaction-model-2026.pdf
Preprint: arXiv 2606.05121
Official project page: Audio Interaction Model
Official code: xzf-thu/Audio-Interaction
Official Hugging Face model: zhifeixie/AudioInteraction
Official Hugging Face dataset: zhifeixie/StreamAudio-2M

Credibility

This is a fresh arXiv preprint submitted on 2026-06-03 by authors from NTU, NUS, and CUHK. The arXiv page labels it as “work in progress.” Credibility is strengthened by public code, model weights, dataset artifacts, a project page, and a detailed appendix, but the work is not yet peer-reviewed and should not be treated as settled SOTA.

After review, this source is demoted to context-level evidence for the TSFM agenda. It is useful as a cautionary streaming-audio example, but much of its gain comes from heavy data construction, silence/noise preprocessing, synthetic stream composition, and scheduling heuristics rather than a generally reusable long-context state mechanism.

Core Claim

The paper formalizes an Audio Interaction Model regime and implements it as Audio-Interaction: a unified streaming audio-language model that consumes audio chunks, decides whether to remain silent or respond, and generates text when a response is warranted.

The paper’s central interface is:

(d_{t}, r_{t}) = f (a_{\leq t}, d_{< t}, r_{< t})

where $a_{t}$ is the current audio chunk, $d_{t}$ is the streaming intervention decision, and $r_{t}$ is the generated response.

Key Contributions

Frames audio interaction as an always-on perceive-decide-respond loop rather than offline clip answering or task-specific streaming ASR/dialogue.
Introduces Audio-Interaction, initialized from Qwen2.5-Omni-3B, with chunk-level silent/response control tokens.
Proposes SoundFlow, covering streaming-native data construction, comprehension-aware training, and FIFO asynchronous inference.
Constructs StreamAudio-2M, reported in the paper as a 2.6M-item, 302k-hour corpus across 7 capability categories and 28 sub-tasks.
Introduces Proactive-Sound-Bench, a 644-event benchmark for deciding when to proactively intervene or remain silent in audio streams.

Method Notes

Audio-Interaction uses 400 ms audio chunks. During streaming modeling, each chunk contributes a decision target such as <silent> or <response>; if the model emits <response>, it switches into autoregressive text generation.

SoundFlow handles three linked problems, but the first two are data-engineering heavy rather than clean model-side solutions:

streaming data construction by composing long, coherent audio interactions from shorter clips, using time-frequency joint preprocessing, silence trimming, denoising, boundary smoothing, and hierarchical event curation;
training with history review and comprehension-aware silence examples to reduce context loss and false triggering;
asynchronous FIFO inference, where the encoder appends chunk features to a queue and the decoder drains queued features at interruption points.

The FIFO scheduler is the most transferable systems idea, but it should be read narrowly. It separates input ingestion from response generation so the system can keep wall-clock audio aligned after a long response; it does not solve unbounded audio-history growth, context eviction, or learned state compression.

Evidence And Results

The paper reports that Audio-Interaction preserves mainstream audio capability while adding streaming behavior:

MMAU audio-instruction average: 58.15 for Audio-Interaction-3B versus 42.51 for Qwen2.5-Omni-3B and 49.58 for Qwen2.5-Omni-7B in the reported table.
CoVoST2 speech-to-text translation BLEU: 55.22 en-zh and 35.21 zh-en, above the reported Qwen2.5-Omni baselines.
Proactive-Sound-Bench: 61.2 single-tier average and 62.8 multi-tier average, above all listed baselines in the paper’s proactive-response table.
FIFO ablation: average first-chunk latency is 392 ms with FIFO versus 831 ms without FIFO, and stall rate is 0.0% versus 5.2%.
Chunk-size ablation: 0.4 s chunks trade lower latency for comparable accuracy; 0.2 s hurts semantic context, while 0.6 s and 0.8 s increase latency.

The paper also reports that early decoder layers reconstruct cross-chunk continuity: the continuity ratio rises from 0.25 at encoder output to 0.80 at GPT Layer 0.

The appendix adds a small natural-recording sanity check rather than a broad deployment proof: about 2 hours of real recordings across travel, work, home, and commute scenarios show 58.9% trigger accuracy versus 62.0% on a matched synthetic split, first-chunk latency within about 25 ms of the synthetic measurement, and 0.91 silence-rate correlation in 2 s bins.

Demoted TSFM Read

This is a context-level real-time audio example for the wiki because it makes the serving contract explicit. The model is not merely “streaming” because it accepts partial input; it has a decision loop, a silence action, a response trigger, and latency/stall ablations. However, it should not be treated as a strong path for foundation time-series streaming state.

For foundation time-series work, the closest analogy is:

incoming observations/events -> retained latent state/context
  -> decide whether to stay silent, alert, ask, forecast, or act
  -> update state without replaying the whole history

The caveat is stronger than modality. Audio-Interaction works over audio, speech, and acoustic event streams, not numeric multivariate time series, telemetry graphs, or typed operational control inputs. More importantly, its long-stream behavior relies on curated synthetic stream construction, explicit silence/response labels, history-review supervision, and FIFO scheduling. Those are useful diagnostics and engineering baselines, but they are mostly workarounds for the target problem rather than reusable answers to retained latent state, memory pressure, or context eviction.

Limitations

The arXiv version is explicitly “work in progress” and not peer-reviewed.
StreamAudio-2M is heavily constructed: TFJP trims silence, estimates noise profiles, denoises clips, locates informative spans, aligns boundaries, smooths joins, and then composes streams with controlled foreground, background, ambient, and added noise tracks.
History retention is trained through synthetic history-review probes; the paper does not introduce a general memory, summarization, eviction, or learned compression mechanism for growing audio history.
FIFO scheduling fixes encoder/decoder coordination and stale silence decisions, but it appends flushed features into KV-cache; it is not a bounded-memory solution.
The released model produces text, not speech; a full voice assistant would need a separate TTS component.
The Hugging Face model card states that 0.4-second chunking creates a floor on response-onset latency.
Proactive-Sound-Bench error analysis reports false positives dominated by benign daily sounds and false negatives in safety-critical domains.
Spoken QA and translation errors still include factual hallucinations, irrelevant responses, and recognition or alignment failures.
StreamAudio-2M artifact metadata is not perfectly aligned across sources: the paper/project page report 2.6M items and 302k hours, the GitHub README also says about 2.6M items but 66.7k hours, and the live Hugging Face dataset card shows 381,177 rows and 757 GB. Treat dataset-size claims as release-dependent until a pinned manifest resolves the counts.
The model’s “intervention decision” is a response-trigger decision, not an evaluated physical or operational action. It should not be treated as an action-conditioned world model.

Foundation TSFM Relevance

Agenda slot	Verdict	Evidence	Missing pieces
Streaming state and long context	warning	Implements an always-on audio loop with chunked input and reports FIFO latency/stall ablations.	No bounded-memory state, no explicit eviction audit, no learned compression, and no numeric multivariate time-series evidence.
Context interface	weak analogy	Conditions decisions on accumulated audio context and explicit instructions; history-review training probes earlier turns.	Relies on synthetic probes and curated audio streams; no channel metadata, topology, exogenous variable schema, or operational context contract.
Event streams and proactive intervention	adjacent	Proactive-Sound-Bench evaluates whether the model should respond or remain silent under acoustic events.	Response triggering is not a typed action/control-input/intervention log with downstream outcome modeling.
Dynamic compute and serving contracts	limited	FIFO asynchronous inference and chunk-size ablations make wall-clock latency and stall rate part of the model contract.	FIFO is scheduling, not state compression; need matched serving tests for numeric TSFMs under channel count, sample rate, memory pressure, and state-refresh cost.
Benchmark level	warning	Proactive-Sound-Bench creates a streaming trigger benchmark, but error analysis shows false-positive and safety-critical false-negative patterns.	Need calibrated abstention, cost-sensitive intervention metrics, and real deployment traces.

Links Into The Wiki

Open Questions

What is the time-series analogue of the silence/response control token: alert, abstain, ask for more context, forecast, propose action, or execute control input?
Can FIFO-style decoupling between ingestion and decoding help serving latency without pretending to solve long-history state retention?
What benchmark would penalize both over-triggering on benign events and missing rare safety-critical events in operational telemetry?
How should released dataset manifests be pinned so paper-level corpus counts and live artifact counts remain auditable?

Alex Open Research Wiki

Explorer

Audio Interaction Model

Audio Interaction Model

Source

Credibility

Core Claim

Key Contributions

Method Notes

Evidence And Results

Demoted TSFM Read

Limitations

Foundation TSFM Relevance

Links Into The Wiki

Open Questions

Graph View

Table of Contents

Backlinks