Wan-Streamer: End-to-End Real-Time Audio-Visual Foundation Model

Contents

Introduction: Beyond Cascaded Systems

Current multimodal AI systems are built as cascaded pipelines — separate VAD, ASR, LLM, TTS, and video modules stitched together. This creates fundamental problems: accumulated latency, compounding errors, hand-crafted turn-taking rules, cross-modal asynchrony (lip-sync mismatch), and no true full-duplex capability.

Wan-Streamer v0.1 (arXiv, June 2026, Alibaba WAN team) introduces a single unified Transformer that models language, audio, and video as both input and output in one natively streaming framework — no external modules required.

Architecture: One Transformer, Block-Causal Attention

Wan-Streamer processes all modalities — text, audio, video — as tokens in a single interleaved causal sequence. Every token flows through the same Transformer layers; modality is conveyed through positional and type embeddings, not routing logic.

Instead of standard per-token causality, it uses block-causal attention at the 160ms streaming unit granularity:

Unit k attends to all prior units (full history)
Unit k cannot attend to future units
Within a 160ms unit, all tokens see each other freely

Wan-Streamer Architecture Overview — Figure 1: Wan-Streamer end-to-end architecture — a single unified Transformer processes text, audio, and video tokens in an interleaved causal streaming sequence (source: arXiv 2606.25041)

A strict streaming contract ensures every component (VAEs, encoders, decoders, attention, flow-matching solver) is causal by design — no component depends on future information, enabling incremental KV-cache updates and constant streaming latency.

Training: Three-Stage Curriculum

Stage 1 — Initialize from Qwen2.5/Qwen3 LLM. Train on understanding tasks (QA, ASR, dialogue) and generation tasks (audio/video synthesis) in unified sequence format. Builds multimodal foundation.
Stage 2 — Train on full-duplex interaction data: normal turns, interruptions, backchanneling, overlapping speech, proactive initiation. Model learns turn management and cross-modal coordination from data, not rules.
Stage 3 — Distill a high-quality teacher (many solver steps + CFG) into an efficient student (4-8 steps, no CFG) using rolling distillation with self-forcing and distribution matching. Student matches teacher quality at fraction of compute.

Inference: Thinker-Performer Decomposition

The model splits across two GPUs with pipeline overlap:

Thinker (GPU 1) — Runs all Transformer computation: encoders, block-causal attention forward pass, decoders. Produces KV slices.
Performer (GPU 2) — Runs flow-matching solver to denoise next unit’s audio-visual latents, conditioned on KV cache from Thinker.

At any step, both GPUs work concurrently: Thinker encodes current input + decodes previous output, while Performer generates the next unit’s latents. Effective latency = max(Thinker, Performer) ≈ ~200ms model-side. CUDA graphs, torch.compile/TensorRT, and custom block-causal attention kernels reduce GPU overhead by 3-5x.

Results & Key Takeaways

Wan-Streamer Latency Benchmarks — Figure 2: Latency and quality comparison — Wan-Streamer vs. cascaded baselines (source: arXiv 2606.25041)

Wan-Streamer achieves the lowest reported end-to-end latency for audio-visual interactive AI:

Model-side latency: ~200ms signal-to-signal
End-to-end latency: ~550ms (including bidirectional network)
Streaming unit: 160ms at 25 FPS
Evaluation: Outperforms cascaded baselines on MOS (speech quality), FID (visual quality), and sync accuracy metrics

This is the first demonstration of end-to-end full-duplex audio-visual interaction within a single model — no cascaded modules, no hand-crafted rules, no post-hoc alignment. Wan-Streamer shows that unified streaming Transformers are a viable path toward truly natural human-AI interaction.

Paper: arxiv.org/html/2606.25041

Introduction: Beyond Cascaded Systems

Architecture: One Transformer, Block-Causal Attention

Training: Three-Stage Curriculum

Inference: Thinker-Performer Decomposition

Results & Key Takeaways

Like this:

Comments

Leave a Reply Cancel reply

Introduction: Beyond Cascaded Systems

Architecture: One Transformer, Block-Causal Attention

Training: Three-Stage Curriculum

Inference: Thinker-Performer Decomposition

Results & Key Takeaways

Share this:

Like this:

Sponsored Links

Comments

Leave a Reply Cancel reply