vLLM0 Hot0 bình luận16 phút đọc1 giờ trước

Experience and Lessons Learned from Serving Multi-Stage Qwen3-Omni in vLLM-Omni

vLLM-Omni serves Qwen3-Omni as a three-stage pipeline (Thinker → Talker → Code2Wav) and applies a layered set of optimizations to improve throughput and latency for online speech generation workloads. The post walks through each optimization in order: stage decomposition with per-stage batching as the baseline, CUDA Graph capture per stage (yielding ~4× throughput jump), async chunk handoffs to pipeline inter-stage transfers (largest audio TTFP reduction, from 2790ms to 655ms), async output for non-blocking payload construction, stage replicas to scale only the bottleneck Talker/Code2Wav stages, and hot-path cleanup targeting per-step Python/allocation overhead. Combined, these bring throughput from 2.2 to 11.7 req/s at concurrency 64, audio TTFP from ~5884ms to ~632ms, and audio RTF from 1.15 to 0.47 — moving from above-real-time to comfortably below it.

Đọc bài gốc

#multimodal #cuda #ai-inference #vllm

Nguồn: https://vllm.ai/blog/2026-07-01-qwen3-omni-optimization. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

DigitalOcean1 Hot7 phút11 giờ trướcAI

Built for Mass Scale: Hard-Won Lessons from Teams Running High Volume Inference Workloads in Production

Các nhà lãnh đạo từ Workato, Hippocratic AI và ISMG chia sẻ kinh nghiệm vận hành khối lượng lớn suy luận AI trong sản xuất, nhấn mạnh: hiệu suất suy giảm nhanh khi AI dùng trên 50 công cụ; độ trễ P99 gây nguy hiểm cho bệnh nhân trong ứng dụng giọng nói lâm sàng; AI không nên có quyền admin mà hoạt động như ủy quyền theo thời gian cho từng hành động; trì hoãn cấu trúc dữ liệu và quy trình trước khi áp dụng AI khiến doanh nghiệp tụt hậu 2 năm về mô hình vận hành. Nhóm thống nhất rằng mở rộng suy luận AI là vấn đề cơ sở hạ tầng và quản trị, không phải mô hình.

Những kinh nghiệm thực tế từ các đội phát triển AI ở quy mô lớn sẽ giúp bạn tránh những sai lầm gây tốn kém về thời gian và chi phí khi thiết kế hệ thống inference, từ đó tối ưu hóa hiệu suất và an toàn ngay từ giai đoạn xây dựng.

Experience and Lessons Learned from Serving Multi-Stage Qwen3-Omni in vLLM-Omni

Đề xuất cho bạn

Built for Mass Scale: Hard-Won Lessons from Teams Running High Volume Inference Workloads in Production

AI inference is obviously profitable

Inside Thinking Machines’ Interaction Models

“Bring it to our shop”: Workday’s pitch for keeping AI agents close to your most valuable data

Ahmad Osman on why local AI is catching up

How to Deploy a Production-Grade vLLM Stack on T Cloud Public CCE

Meet Penny, Pick n Pay’s new AI shopping companion

Inside the vLLM-Omni architecture: Serving Qwen3-Omni