Red Hat Developer0 Hot0 bình luận7 phút đọc2 giờ trước

Inside the vLLM-Omni architecture: Serving Qwen3-Omni

vLLM-Omni extends the vLLM serving engine to handle multimodal output models like Qwen3-Omni, which produce text, audio, and images rather than just text tokens. The architecture decomposes inference into a graph of stages — Thinker (~30B MoE), Talker (~3B MoE), and Code2Wav vocoder — each with its own GPU memory budget and independent scaling. Key features include a single OpenAI-compatible endpoint, inherited vLLM primitives (PagedAttention, continuous batching, prefix caching extended to hidden-state tensors), shared-memory transport via OmniConnector, and async chunked pipeline execution that lets stages overlap so audio streams out before earlier stages finish. A demo on a single NVIDIA B200 shows an insurance claim triage use case with concurrent adjuster and customer-callback requests. Benchmarks against Hugging Face Transformers show vLLM-Omni achieves a real-time factor below 1.0 for audio generation versus 2.64 for the baseline. The engine also supports diffusion model stages alongside autoregressive ones on the same abstractions.

Đọc bài gốc

#multimodal #ai-inference #vllm

Nguồn: https://developers.redhat.com/articles/2026/07/01/inside-vllm-omni-architecture-serving-qwen3-omni. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

ByteByteGo1 Hot13 phút18 giờ trướcAI

Inside Thinking Machines’ Interaction Models

Phòng thí nghiệm AI mới Thinking Machines đề xuất mô hình "interaction model" thay thế kiến trúc turn-based truyền thống bằng cách tích hợp tương tác trực tiếp vào mô hình, sử dụng các micro-turns (200ms) và phối hợp hai mô hình (tương tác nhanh + suy luận nền). Mô hình 276B tham số (12B tham số hoạt động) của họ thể hiện khả năng dịch thuật live, đếm nhịp real-time và sửa lỗi codeswitching giữa câu, nhưng vẫn gặp hạn chế về quản lý ngữ cảnh dài, yêu cầu kết nối và độ trễ.

Lập trình viên AI nên đọc bài này để hiểu cách thiết kế lại mô hình tương tác thực tế bằng cách loại bỏ giới hạn của hệ thống dựa trên vòng lặp ngôn ngữ truyền thống, giúp tối ưu hóa hiệu suất và khả năng tương tác đa phương tiện trong ứng dụng AI hiện đại.

Inside the vLLM-Omni architecture: Serving Qwen3-Omni

Đề xuất cho bạn

Inside Thinking Machines’ Interaction Models

AI inference is obviously profitable

“Bring it to our shop”: Workday’s pitch for keeping AI agents close to your most valuable data

Engineering TTS Inference in vLLM-Omni

Deploy secure agentic AI: Protocols and performance tuning

Nvidia rival Etched raises $800M with backing from Jane Street and a TSMC-linked fund

How NVIDIA’s Inference Software Stack Powers the Lowest Token Cost

Ahmad Osman on why local AI is catching up