vLLM00 bình luận21 phút đọc3 giờ trước

Engineering TTS Inference in vLLM-Omni

vLLM-Omni's engineering team details how they optimized TTS inference for four models: Qwen3-TTS, VoxCPM2, Fish Speech S2 Pro, and Higgs Audio V3. Key challenges include decoupling streaming chunk sizes from decode windows to balance TTFP and audio quality, batching per-request Python preprocessing to reduce hot-path overhead, applying whole-model torch.compile to reduce kernel launch boundaries, moving multi-codebook decode state to GPU-resident tensors, and implementing model-specific Triton attention kernels for pure decode shapes. Results include a 61.5% audio throughput improvement for Qwen3-TTS, 172% for VoxCPM2, and 2.70× speedup for Higgs Audio V3. The post also documents rejected designs like staging-overlap under dynamic batching and explains why PIECEWISE CUDA Graph lost to eager plus local MLP graph for Higgs v3.

Đọc bài gốc

#text-to-speech #vllm

Nguồn: https://vllm.ai/blog/2026-06-23-vllm-omni-tts. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Red Hat Developer017 phút2 ngày trước

Deploying distributed AI inference: Blueprints & troubleshooting

A practical guide to deploying distributed AI inference using vLLM and llm-d across six traffic-shaped blueprints: high-concurrency chat, long-context RAG, high-throughput batch, distributed AI-grid (Model-as-a-Service), hybrid sovereign-to-cloud-burst, and edge inference on workstation GPUs. Each blueprint covers workload signature, topology, key mechanisms (prefill/decode disaggregation, KV-cache tiering, speculative decoding, model cascading), and cost shape. The post also provides inference troubleshooting recipes for TTFT/TPOT regressions using vLLM Prometheus metrics and NVIDIA Nsight tools, and closes with a four-step scaling roadmap from a single vLLM instance to a full distributed AI grid on Red Hat OpenShift AI.

#distributed-systems #ai-inference

Engineering TTS Inference in vLLM-Omni

Đề xuất cho bạn

Deploying distributed AI inference: Blueprints & troubleshooting

Run a vLLM Server on HF Jobs in One Command

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

Optimizing distributed AI inference: Advanced deployment patterns