Engineering TTS Inference in vLLM-Omni
vLLM-Omni's engineering team details how they optimized TTS inference for four models: Qwen3-TTS, VoxCPM2, Fish Speech S2 Pro, and Higgs Audio V3. Key challenges include decoupling streaming chunk sizes from decode windows to balance TTFP and audio quality, batching per-request Python preprocessing to reduce hot-path overhead, applying whole-model torch.compile to reduce kernel launch boundaries, moving multi-codebook decode state to GPU-resident tensors, and implementing model-specific Triton attention kernels for pure decode shapes. Results include a 61.5% audio throughput improvement for Qwen3-TTS, 172% for VoxCPM2, and 2.70× speedup for Higgs Audio V3. The post also documents rejected designs like staging-overlap under dynamic batching and explains why PIECEWISE CUDA Graph lost to eager plus local MLP graph for Higgs v3.