Modal0 Hot0 bình luận14 phút đọc2 giờ trước

Multi-token Residual Prediction

Multi-Token Residual Prediction (MRP) is a lightweight transformer module (3 layers) that accelerates diffusion language model (DLM) inference by predicting inter-step logit residuals rather than full distributions. Naive multi-token prediction collapses on DLMs beyond one step, but predicting the small correction between adjacent denoising steps is a low-complexity target a tiny module can handle. MRP serves two inference regimes: in static denoising it enables speculative decoding (up to 1.56× throughput in SGLang with lossless quality) or direct decoding (up to 1.9× with minor quality cost); in dynamic denoising it remasks over-eagerly revealed tokens using the residual signal, recovering up to +22.6 accuracy points on benchmarks like GSM8K, MATH500, HumanEval, and MBPP across SDAR 1.7B/4B/8B models. The module attaches to a frozen backbone, requires no backbone retraining, and composes with existing DLM inference methods.

Đọc bài gốc

#ai-inference

Nguồn: https://modal.com/blog/multi-token-residual-prediction. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

sean goedecke14 Hot6 phút5 ngày trướcAI

AI inference is obviously profitable

Phân tích chi phí sơ lược cho thấy suy luận (inference) AI thực sự sinh lời, với chi phí ước tính khoảng 1 USD cho mỗi triệu token đầu ra, thấp hơn nhiều so với mức giá 4,5 USD trở lên của các nhà cung cấp như OpenAI, qua đó đạt biên lợi nhuận gộp 70–80%. Suy luận AI có lợi nhuận, nhưng các phòng thí nghiệm AI như OpenAI và Anthropic sử dụng khoản lợi nhuận này để bù đắp chi phí đào tạo mô hình tốn kém.

Là người phát triển muốn tối ưu chi phí cho ứng dụng AI của mình, bài viết này giúp bạn hiểu rõ về lợi nhuận thực tế của quá trình inference AI, từ đó có thể xây dựng mô hình kinh doanh hiệu quả và tránh bỏ lỡ cơ hội tiết kiệm chi phí mà không phụ thuộc vào sự hỗ trợ từ các công ty lớn.

#llm

Multi-token Residual Prediction

Đề xuất cho bạn

AI inference is obviously profitable

“Bring it to our shop”: Workday’s pitch for keeping AI agents close to your most valuable data

Nvidia rival Etched raises $800M with backing from Jane Street and a TSMC-linked fund

Announcing our $800M Series C to accelerate the shift to open-source AI

Continuous Batching vs. Static Batching in LLM Inference

How NVIDIA’s Inference Software Stack Powers the Lowest Token Cost

Persistent Latent Memory for Multi-Hop LLM Agents: How a 6G Handover Paper Closes the Agent Cold-Start

Ahmad Osman on why local AI is catching up