DigitalOcean Community00 bình luận26 phút đọc3 ngày trước

The HBM Tax: Why Vision Encoders and Language Decoders Fight Over Your GPU

Vision-language models (VLMs) suffer a hidden performance penalty called the HBM tax when both vision encoding and language decoding run on the same GPU. Vision encoding is compute-bound (85%+ GPU utilization, <5% memory bandwidth), while language decoding is memory-bound (80%+ HBM bandwidth, <10% compute). Running both on one GPU permanently underserves each phase. The root cause is image tokens entering the KV cache at prefill and persisting through every decode step, inflating memory traffic per generated token. The fix is modality-level disaggregation: split at the boundary between the vision encoder output and the language model input. This transfers only ~4.5 MB of embeddings (vs ~350 MB KV cache) per request — a 78x–196x reduction depending on model architecture — small enough to work over standard 25 Gbps cloud networking without NVLink. A heterogeneous deployment (compute-dense GPU like L40S for encoding, HBM-rich GPU like H100/H200 for decoding) delivers ~37–40% cost savings with no latency regression. The approach is validated by a March 2026 paper (arXiv:2603.12707) and is practical on DigitalOcean GPU Droplets using two pools connected via private VPC networking.

Đọc bài gốc

#vlm

Nguồn: https://www.digitalocean.com/community/tutorials/hbm-tax-gpu-inference. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

databricks05 phút5 giờ trướcAI

How Databricks is turning video into searchable, actionable intelligence

Databricks biến video thành dữ liệu có thể tìm kiếm và xử lý bằng cách ứng dụng kỹ thuật dữ liệu quy mô lớn, sử dụng Serverless GPU Compute, Lakeflow pipelines và vision language models (VLM) như SAM3 của Meta. Hệ thống cho phép truy vấn bằng ngôn ngữ tự nhiên để tìm kiếm và tóm tắt nội dung video, ví dụ giảm 26 phút video camera giao thông xuống dưới 2 phút đoạn quan trọng nhờ AI. Pipeline hỗ trợ nhiều mô hình qua MLflow, kích hoạt sự kiện tự động, xử lý đồng thời và có thể mở rộng cho các trường hợp như kiểm tra cơ sở hạ tầng, an ninh công cộng hay hoạt động sân bay, với mã nguồn mở trên GitHub.

Lập trình viên nên đọc bài này để khám phá cách Databricks biến phân tích video thành một giải pháp hiệu quả bằng công nghệ data engineering, từ việc xử lý dữ liệu lớn đến tích hợp mô hình AI tiên tiến, giúp tự động hóa và tối ưu hóa các ứng dụng thực tế từ các thiết bị giám sát đến công tác an ninh.