#vllm · 8sync News

SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own

SkyPilot Endpoints is a production-ready LLM inference system that deploys a full serving stack — inference engine, autoscaler, gateway, TLS, metrics — from a single YAML across multiple Kubernetes clusters under one endpoint URL. It handles cross-cluster placement, autoscaling, and failure recovery automatically. A key feature is unified GPU pool management: training jobs run as preemptible workloads that yield GPUs to latency-sensitive inference when demand spikes, then resume from checkpoints when capacity frees up. The stack builds on vLLM, KServe, llm-d, and KEDA, and includes KV cache-aware routing, prefill/decode disaggregation, scale-to-zero, rolling updates, and a unified observability dashboard across all clusters.

Optimizing distributed AI inference: Advanced deployment patterns

Bài viết hướng dẫn kỹ thuật sâu về ba phương pháp tối ưu hóa inference AI phân tán ở quy mô lớn: tách rời prefill/decode (P/D), chiến lược KV cache, và giải mã dự đoán (speculative decoding). P/D disaggregation đề xuất tỷ lệ worker 1:3 đến 1:5, sử dụng KV-transfer connector (NixlConnector, LMCacheConnector, MooncakeConnector) và routing thông minh (llm-d) giúp cải thiện TTFT lên tới 57 lần. KV cache được phân cấp (HBM/DRAM/NVMe), tối ưu chia sẻ tiền tố (prefix sharing) và tái sử dụng (reuse), cân nhắc lượng tử hóa FP8/FP4, cùng so sánh kiến trúc PagedAttention và RadixAttention. Phần speculative decoding so sánh EAGLE 3.1, self-speculative, Medusa heads, MTP, đồng thời cảnh báo rằng chế độ giải mã hạn chế (JSON mode, tool calls) có thể làm giảm tỷ lệ chấp nhận.

Lập trình viên chuyên phát triển hệ thống AI quy mô lớn cần đọc để tối ưu hóa hiệu suất và chi phí của các ứng dụng phân tán, từ cách phân tán tiền xử lý/giải mã đến lựa chọn cache KV hiệu quả và chiến lược dự đoán để giảm thời gian phản hồi mà không ảnh hưởng đến độ chính xác.

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

DFlash is an open source block-diffusion speculative decoding method that replaces sequential autoregressive drafting with parallel block-level token prediction. On NVIDIA Blackwell hardware, it delivers up to 15x throughput improvement over autoregressive decoding for gpt-oss-120b at high interactivity targets, and outperforms EAGLE-3 speculative decoding by 1.5x. The technique uses three key mechanisms: block-diffusion drafting, target hidden-state conditioning, and KV injection. Twenty model checkpoints covering Qwen, Llama, Gemma, Kimi K2.6, and gpt-oss families are available on Hugging Face, with integration support for TensorRT-LLM, vLLM (via the Speculators library), and SGLang requiring minimal config changes and no application refactoring.