Modal and Decagon collaborated to achieve state-of-the-art LLM inference latency using speculative decoding. The post outlines a four-part low-latency playbook: minimizing client-server communication, reducing host overhead, using speed-of-light GPU kernels (e.g., Flash Attention 4 on Blackwell GPUs), and applying speculative decoding with high-quality draft models. The key breakthrough was the DFlash speculative decoding technique from Z Lab, which uses KV projections from the target model and generates draft tokens in parallel. On top of a generic DFlash speculator, they performed task-specific 'mid-training' using synthetic data to fine-tune the speculator for Decagon's voice AI workload. This custom speculator cut an additional 100ms off end-to-end latency — roughly 40% of server-side decode latency — resulting in a system 60ms faster than the best proprietary inference providers. The post also previews an 'autospec' feature for continual speculator improvement.
Nguồn: https://modal.com/blog/achieve-sota-specdec. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.
JetBrains researchers present EZ MIA (Error Zone Membership Inference Attack), a lightweight method for detecting whether specific data was used to train fine-tuned LLMs. Unlike existing approaches that rely on aggregate sequence loss or expensive shadow model training, EZ MIA focuses on token-level error positions where memorization signals are most concentrated, requiring only two forward passes per sequence. Experiments on GPT-2, GPT-2-XL, and Llama-2 show EZ MIA outperforms baselines like LOSS, Min-K++, and SPV-MIA by up to 9x. The research also confirms that full fine-tuning creates significantly more membership leakage than LoRA-based fine-tuning, though LoRA does not eliminate the risk entirely — especially for larger models.

An AMD engineer has contributed an ONNX Runtime backend to FFmpeg's DNN (Deep Neural Network) processing filter. The addition enables inferencing across multiple GPU and NPU platforms, including NVIDIA CUDA, Windows DirectML for all major GPU vendors, and AMD Ryzen AI NPU support via the ONNX Runtime VitisAI execution provider. This marks AMD's effort to make the Ryzen AI NPU useful within FFmpeg workflows.
Researchers at Ai2 compare token-level prediction differences between their 7B transformer (OLMo 3) and hybrid model (OLMo Hybrid), which combines attention and recurrent layers. The study finds hybrid models outperform transformers on meaning-bearing tokens like nouns, verbs, and adjectives, and on tokens requiring contextual tracking such as pronoun resolution. However, the hybrid's advantage nearly vanishes on verbatim repeated text, where attention's ability to directly look up earlier tokens gives transformers the edge. The work also proposes using filtered token losses — scoring only specific token categories — as a more fine-grained evaluation metric to surface architectural differences during pretraining that aggregate loss metrics would miss.
A reproducible benchmark comparing gradient-boosted decision trees (GBDTs) vs. LLM-based scoring for payment fraud detection across three dimensions: latency, cost, and determinism. On a single CPU core, GBDTs hit p99 latency of 0.15ms vs. ~1,200ms for LLMs — well outside the 100ms ISO 8583 authorization budget. Cost-wise, GBDTs run ~$54/hour at 50K TPS vs. $16,200–$351,000 for LLM tiers. Determinism is the most critical issue for regulated environments: GBDTs return identical scores on identical inputs while LLMs produce hundreds of distinct outputs even at temperature=0. The recommended architecture keeps deterministic tree ensembles on the synchronous hot path and deploys LLM agents on the asynchronous cold path for SAR drafting, evidence gathering, and agent-as-a-judge validation before human review. All benchmark code is open-source and reproducible on a laptop.
Running three different LLMs simultaneously on a single 8GB GPU fails because llama.cpp pre-allocates the full KV cache upfront, causing OOM errors for the second and third processes. The solution is a C++ daemon called lmxd that implements Connection Admission Control (borrowed from 5G/telecom) as a VRAM ledger: it tracks allocated bytes, enforces a 90% cap, and refuses new agent registrations before any GPU allocation is attempted. The daemon also handles KV-cache swapping to host RAM between agent switches, enabling multiple agents to share one GPU context slot. Additionally, a layer streaming technique using two CUDA streams overlaps compute and weight transfer, achieving ~22-32% wall-clock savings on a GTX 1080. The repo ships the admission control daemon and the streaming primitive as separate, composable components.
Unconventional AI, led by former Databricks AI chief Naveen Rao, has released Un0, an image-generation model built on a software simulation of a novel oscillator-based computing architecture. The company claims this architecture could reduce AI inference power consumption by up to 1,000x compared to conventional chips. Un0 performs comparably to state-of-the-art diffusion models like Stable Diffusion, serving as a proof-of-concept for the new architecture. The company plans to release actual chip schematics soon and eventually build a full inference stack, positioning itself as a compute provider running at a fraction of current energy costs.
NVIDIA and AWS have announced several joint infrastructure advancements for enterprise AI at scale. New Amazon EC2 G7 instances powered by NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs deliver up to 4.6x AI inference performance and 2.1x graphics performance over G6 instances, with support for up to 8 GPUs and 700 Gbps networking. Amazon OpenSearch Serverless now uses NVIDIA cuVS for GPU-accelerated vector indexing by default, enabling up to 10x faster vector indexing at a quarter of the CPU-only cost, making billion-scale vector databases buildable in under an hour. Additionally, AWS has achieved NVIDIA Exemplar Cloud status for GB300 training workloads, certifying that AWS meets NVIDIA's rigorous performance benchmarks for large-scale AI training.
TokenSpeed-kernel is an open-source, standalone subsystem that provides a clean layered API and registry system for LLM inference kernels across multiple hardware backends. It decouples the high-level runtime from hardware-specific kernel implementations using a decorator-based registration system where kernels declare their platform capabilities, tensor format signatures, and priorities. The selector then dispatches to the best available implementation at runtime. Using GPT-OSS 120B on AMD MI355X (CDNA4) as a validation target, the post demonstrates how Gluon-backed attention and MoE kernels achieve 1.6–3.6x end-to-end throughput improvements over portable Triton baselines, while NVIDIA paths (via FlashInfer/TensorRT-LLM wrappers) use the same public APIs. The AMD-specific kernels are published as a standalone pip package (tokenspeed-kernel-amd) reusable by other inference engines like vLLM.