vLLM-Omni's engineering team details how they optimized TTS inference for four models: Qwen3-TTS, VoxCPM2, Fish Speech S2 Pro, and Higgs Audio V3. Key challenges include decoupling streaming chunk sizes from decode windows to balance TTFP and audio quality, batching per-request Python preprocessing to reduce hot-path overhead, applying whole-model torch.compile to reduce kernel launch boundaries, moving multi-codebook decode state to GPU-resident tensors, and implementing model-specific Triton attention kernels for pure decode shapes. Results include a 61.5% audio throughput improvement for Qwen3-TTS, 172% for VoxCPM2, and 2.70× speedup for Higgs Audio V3. The post also documents rejected designs like staging-overlap under dynamic batching and explains why PIECEWISE CUDA Graph lost to eager plus local MLP graph for Higgs v3.
Nguồn: https://vllm.ai/blog/2026-06-23-vllm-omni-tts. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

A practical guide to deploying distributed AI inference using vLLM and llm-d across six traffic-shaped blueprints: high-concurrency chat, long-context RAG, high-throughput batch, distributed AI-grid (Model-as-a-Service), hybrid sovereign-to-cloud-burst, and edge inference on workstation GPUs. Each blueprint covers workload signature, topology, key mechanisms (prefill/decode disaggregation, KV-cache tiering, speculative decoding, model cascading), and cost shape. The post also provides inference troubleshooting recipes for TTFT/TPOT regressions using vLLM Prometheus metrics and NVIDIA Nsight tools, and closes with a four-step scaling roadmap from a single vLLM instance to a full distributed AI grid on Red Hat OpenShift AI.
A step-by-step guide to launching a vLLM inference server on Hugging Face Jobs using a single CLI command. Covers prerequisites, launching with hf jobs run using the official vLLM Docker image, querying via curl or the OpenAI Python client, SSH debugging, scaling to large multi-GPU models (e.g., Qwen3.5-122B on 2×H200 with tensor parallelism), adding a Gradio chat UI, and using the endpoint as a backend for the Pi coding agent. Also explains when to choose HF Jobs vs. Inference Endpoints.
DFlash is an open source block-diffusion speculative decoding method that replaces sequential autoregressive drafting with parallel block-level token prediction. On NVIDIA Blackwell hardware, it delivers up to 15x throughput improvement over autoregressive decoding for gpt-oss-120b at high interactivity targets, and outperforms EAGLE-3 speculative decoding by 1.5x. The technique uses three key mechanisms: block-diffusion drafting, target hidden-state conditioning, and KV injection. Twenty model checkpoints covering Qwen, Llama, Gemma, Kimi K2.6, and gpt-oss families are available on Hugging Face, with integration support for TensorRT-LLM, vLLM (via the Speculators library), and SGLang requiring minimal config changes and no application refactoring.
SkyPilot Endpoints is a production-ready LLM inference system that deploys a full serving stack — inference engine, autoscaler, gateway, TLS, metrics — from a single YAML across multiple Kubernetes clusters under one endpoint URL. It handles cross-cluster placement, autoscaling, and failure recovery automatically. A key feature is unified GPU pool management: training jobs run as preemptible workloads that yield GPUs to latency-sensitive inference when demand spikes, then resume from checkpoints when capacity frees up. The stack builds on vLLM, KServe, llm-d, and KEDA, and includes KV cache-aware routing, prefill/decode disaggregation, scale-to-zero, rolling updates, and a unified observability dashboard across all clusters.

Bài viết hướng dẫn kỹ thuật sâu về ba phương pháp tối ưu hóa inference AI phân tán ở quy mô lớn: tách rời prefill/decode (P/D), chiến lược KV cache, và giải mã dự đoán (speculative decoding). P/D disaggregation đề xuất tỷ lệ worker 1:3 đến 1:5, sử dụng KV-transfer connector (NixlConnector, LMCacheConnector, MooncakeConnector) và routing thông minh (llm-d) giúp cải thiện TTFT lên tới 57 lần. KV cache được phân cấp (HBM/DRAM/NVMe), tối ưu chia sẻ tiền tố (prefix sharing) và tái sử dụng (reuse), cân nhắc lượng tử hóa FP8/FP4, cùng so sánh kiến trúc PagedAttention và RadixAttention. Phần speculative decoding so sánh EAGLE 3.1, self-speculative, Medusa heads, MTP, đồng thời cảnh báo rằng chế độ giải mã hạn chế (JSON mode, tool calls) có thể làm giảm tỷ lệ chấp nhận.
Lập trình viên chuyên phát triển hệ thống AI quy mô lớn cần đọc để tối ưu hóa hiệu suất và chi phí của các ứng dụng phân tán, từ cách phân tán tiền xử lý/giải mã đến lựa chọn cache KV hiệu quả và chiến lược dự đoán để giảm thời gian phản hồi mà không ảnh hưởng đến độ chính xác.