Contents
Introduction: Beyond Cascaded Systems
Current multimodal AI systems are built as cascaded pipelines — separate VAD, ASR, LLM, TTS, and video modules stitched together. This creates fundamental problems: accumulated latency, compounding errors, hand-crafted turn-taking rules, cross-modal asynchrony (lip-sync mismatch), and no true full-duplex capability.
Wan-Streamer v0.1 (arXiv, June 2026, Alibaba WAN team) introduces a single unified Transformer that models language, audio, and video as both input and output in one natively streaming framework — no external modules required.
Architecture: One Transformer, Block-Causal Attention
Wan-Streamer processes all modalities — text, audio, video — as tokens in a single interleaved causal sequence. Every token flows through the same Transformer layers; modality is conveyed through positional and type embeddings, not routing logic.
Instead of standard per-token causality, it uses block-causal attention at the 160ms streaming unit granularity:
- Unit k attends to all prior units (full history)
- Unit k cannot attend to future units
- Within a 160ms unit, all tokens see each other freely

A strict streaming contract ensures every component (VAEs, encoders, decoders, attention, flow-matching solver) is causal by design — no component depends on future information, enabling incremental KV-cache updates and constant streaming latency.
Training: Three-Stage Curriculum
- Stage 1 — Initialize from Qwen2.5/Qwen3 LLM. Train on understanding tasks (QA, ASR, dialogue) and generation tasks (audio/video synthesis) in unified sequence format. Builds multimodal foundation.
- Stage 2 — Train on full-duplex interaction data: normal turns, interruptions, backchanneling, overlapping speech, proactive initiation. Model learns turn management and cross-modal coordination from data, not rules.
- Stage 3 — Distill a high-quality teacher (many solver steps + CFG) into an efficient student (4-8 steps, no CFG) using rolling distillation with self-forcing and distribution matching. Student matches teacher quality at fraction of compute.
Inference: Thinker-Performer Decomposition
The model splits across two GPUs with pipeline overlap:
- Thinker (GPU 1) — Runs all Transformer computation: encoders, block-causal attention forward pass, decoders. Produces KV slices.
- Performer (GPU 2) — Runs flow-matching solver to denoise next unit’s audio-visual latents, conditioned on KV cache from Thinker.
At any step, both GPUs work concurrently: Thinker encodes current input + decodes previous output, while Performer generates the next unit’s latents. Effective latency = max(Thinker, Performer) ≈ ~200ms model-side. CUDA graphs, torch.compile/TensorRT, and custom block-causal attention kernels reduce GPU overhead by 3-5x.
Results & Key Takeaways

Wan-Streamer achieves the lowest reported end-to-end latency for audio-visual interactive AI:
- Model-side latency: ~200ms signal-to-signal
- End-to-end latency: ~550ms (including bidirectional network)
- Streaming unit: 160ms at 25 FPS
- Evaluation: Outperforms cascaded baselines on MOS (speech quality), FID (visual quality), and sync accuracy metrics
This is the first demonstration of end-to-end full-duplex audio-visual interaction within a single model — no cascaded modules, no hand-crafted rules, no post-hoc alignment. Wan-Streamer shows that unified streaming Transformers are a viable path toward truly natural human-AI interaction.
Paper: arxiv.org/html/2606.25041
