Modal00 bình luận11 phút đọc1 ngày trước

Achieve state-of-the-art inference latencies with speculative decoding

Modal and Decagon collaborated to achieve state-of-the-art LLM inference latency using speculative decoding. The post outlines a four-part low-latency playbook: minimizing client-server communication, reducing host overhead, using speed-of-light GPU kernels (e.g., Flash Attention 4 on Blackwell GPUs), and applying speculative decoding with high-quality draft models. The key breakthrough was the DFlash speculative decoding technique from Z Lab, which uses KV projections from the target model and generates draft tokens in parallel. On top of a generic DFlash speculator, they performed task-specific 'mid-training' using synthetic data to fine-tune the speculator for Decagon's voice AI workload. This custom speculator cut an additional 100ms off end-to-end latency — roughly 40% of server-side decode latency — resulting in a system 60ms faster than the best proprietary inference providers. The post also previews an 'autospec' feature for continual speculator improvement.

Đọc bài gốc

#deep-learning #ai-inference

Nguồn: https://modal.com/blog/achieve-sota-specdec. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

JetBrains022 phút10 giờ trước

Our Research on Membership Inference Attacks and Preventing Privacy Leaks

JetBrains researchers present EZ MIA (Error Zone Membership Inference Attack), a lightweight method for detecting whether specific data was used to train fine-tuned LLMs. Unlike existing approaches that rely on aggregate sequence loss or expensive shadow model training, EZ MIA focuses on token-level error positions where memorization signals are most concentrated, requiring only two forward passes per sequence. Experiments on GPT-2, GPT-2-XL, and Llama-2 show EZ MIA outperforms baselines like LOSS, Min-K++, and SPV-MIA by up to 9x. The research also confirms that full fine-tuning creates significantly more membership leakage than LoRA-based fine-tuning, though LoRA does not eliminate the risk entirely — especially for larger models.

#llm #deep-learning

Achieve state-of-the-art inference latencies with speculative decoding

Đề xuất cho bạn

Our Research on Membership Inference Attacks and Preventing Privacy Leaks

AMD Contributes ONNX Runtime Backend To FFmpeg DNN Filter

Which tokens does a hybrid model predict better?

The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal

Databricks’ former AI chief thinks he can cut AI’s power bill by 1,000x

NVIDIA and AWS Collaborate to Bring AI to Production at Scale

TokenSpeed-Kernel: Portable APIs and High-Performance Kernels for Multi-Silicon LLM Inference – PyTorch