#deep-learning · 8sync News

Which tokens does a hybrid model predict better?

Researchers at Ai2 compare token-level prediction differences between their 7B transformer (OLMo 3) and hybrid model (OLMo Hybrid), which combines attention and recurrent layers. The study finds hybrid models outperform transformers on meaning-bearing tokens like nouns, verbs, and adjectives, and on tokens requiring contextual tracking such as pronoun resolution. However, the hybrid's advantage nearly vanishes on verbatim repeated text, where attention's ability to directly look up earlier tokens gives transformers the edge. The work also proposes using filtered token losses — scoring only specific token categories — as a more fine-grained evaluation metric to surface architectural differences during pretraining that aggregate loss metrics would miss.

Our Research on Membership Inference Attacks and Preventing Privacy Leaks

JetBrains researchers present EZ MIA (Error Zone Membership Inference Attack), a lightweight method for detecting whether specific data was used to train fine-tuned LLMs. Unlike existing approaches that rely on aggregate sequence loss or expensive shadow model training, EZ MIA focuses on token-level error positions where memorization signals are most concentrated, requiring only two forward passes per sequence. Experiments on GPT-2, GPT-2-XL, and Llama-2 show EZ MIA outperforms baselines like LOSS, Min-K++, and SPV-MIA by up to 9x. The research also confirms that full fine-tuning creates significantly more membership leakage than LoRA-based fine-tuning, though LoRA does not eliminate the risk entirely — especially for larger models.

AMD Contributes ONNX Runtime Backend To FFmpeg DNN Filter

An AMD engineer has contributed an ONNX Runtime backend to FFmpeg's DNN (Deep Neural Network) processing filter. The addition enables inferencing across multiple GPU and NPU platforms, including NVIDIA CUDA, Windows DirectML for all major GPU vendors, and AMD Ryzen AI NPU support via the ONNX Runtime VitisAI execution provider. This marks AMD's effort to make the Ryzen AI NPU useful within FFmpeg workflows.

Achieve state-of-the-art inference latencies with speculative decoding

Modal and Decagon collaborated to achieve state-of-the-art LLM inference latency using speculative decoding. The post outlines a four-part low-latency playbook: minimizing client-server communication, reducing host overhead, using speed-of-light GPU kernels (e.g., Flash Attention 4 on Blackwell GPUs), and applying speculative decoding with high-quality draft models. The key breakthrough was the DFlash speculative decoding technique from Z Lab, which uses KV projections from the target model and generates draft tokens in parallel. On top of a generic DFlash speculator, they performed task-specific 'mid-training' using synthetic data to fine-tune the speculator for Decagon's voice AI workload. This custom speculator cut an additional 100ms off end-to-end latency — roughly 40% of server-side decode latency — resulting in a system 60ms faster than the best proprietary inference providers. The post also previews an 'autospec' feature for continual speculator improvement.

Open models, global networks: How AT&T and GSMA are accelerating innovation with Gemma

AT&T and GSMA have collaborated to build OTel, a family of open telecom-specific AI models fine-tuned on Gemma (Google's open-source model family). Trained on a specialized telco dataset curated by GSMA, operators, equipment vendors, and academia, the initiative produced 30 models across various sizes. Gemma-4-E4B-it achieved 91.74% accuracy — the highest among all tested architectures. The models use RAG-based training to reduce hallucinations, critical for regulated telecom environments. OTel has surpassed 18 million downloads and ranks among the top models on Open Telco Benchmarks, demonstrating that smaller domain-specific models can outperform larger general-purpose frontier models in specialized tasks like network configuration and self-healing systems.