TokenSpeed-kernel is an open-source, standalone subsystem that provides a clean layered API and registry system for LLM inference kernels across multiple hardware backends. It decouples the high-level runtime from hardware-specific kernel implementations using a decorator-based registration system where kernels declare their platform capabilities, tensor format signatures, and priorities. The selector then dispatches to the best available implementation at runtime. Using GPT-OSS 120B on AMD MI355X (CDNA4) as a validation target, the post demonstrates how Gluon-backed attention and MoE kernels achieve 1.6–3.6x end-to-end throughput improvements over portable Triton baselines, while NVIDIA paths (via FlashInfer/TensorRT-LLM wrappers) use the same public APIs. The AMD-specific kernels are published as a standalone pip package (tokenspeed-kernel-amd) reusable by other inference engines like vLLM.
Nguồn: https://pytorch.org/blog/lightseek-tokenspeed-kernel. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.
Google Cloud vừa giới thiệu TPU Developer Hub, một nền tảng giáo dục tập trung dành cho nhà phát triển ML sử dụng TPU, bao gồm kiến trúc phần cứng, stack phần mềm (XLA, Pallas kernels), công cụ gỡ lỗi XProf, chiến lược tối ưu hóa (như offloading KV cache) cùng networking và bảo mật. Nội dung đa dạng từ Colabs tương tác, mã nguồn mở đến tài liệu chuyên sâu, hỗ trợ tích hợp AI-assisted development.
Lập trình viên ML nên đọc để hiểu cách tối ưu hóa hiệu suất và chi phí của mô hình trên TPU với các công cụ mới như XLA, Pallas và các chiến lược parallelism, từ đó tiết kiệm thời gian và nguồn lực trong triển khai sản phẩm AI.
Unconventional AI, led by former Databricks AI chief Naveen Rao, has released Un0, an image-generation model built on a software simulation of a novel oscillator-based computing architecture. The company claims this architecture could reduce AI inference power consumption by up to 1,000x compared to conventional chips. Un0 performs comparably to state-of-the-art diffusion models like Stable Diffusion, serving as a proof-of-concept for the new architecture. The company plans to release actual chip schematics soon and eventually build a full inference stack, positioning itself as a compute provider running at a fraction of current energy costs.
A survey of 321 Kubernetes practitioners reveals a sharp trust asymmetry: 82% trust automated code delivery, but only 27% allow automation to change CPU and memory without human review. The core reason is that resource changes alter the invisible contract between workloads and the scheduler, with failure modes that are delayed and hard to diagnose. AI inference workloads are intensifying this problem because GPU compute is expensive, inference jobs are bursty and unfamiliar, and manual optimization breaks down past ~250 changes per day. The post argues that closing the trust gap requires 'adaptive autonomy' — automation designed to work at every stage of the trust curve, from read-only recommendations to guardrailed execution to closed-loop optimization — rather than forcing full delegation upfront.
Cerebras posted strong Q1 results — revenue up 92% to $193.4m and a narrowed net loss — yet its stock fell ~10% after the company warned gross margins would drop sharply from 46.5% to 36–38% in Q2. The culprit is not chip supply but a shortage of data-centre space and power. Cerebras is renting back its own systems and building capacity at speed, costs that will shave 10–15 margin points this year. CEO Andrew Feldman called it a 'grand irony' that buildings, not chips, are now the limiting factor. The company guided full-year revenue of $855–865m, above analyst estimates, and highlighted a $20bn+ OpenAI inference deal and a 178% jump in cloud/services revenue. Still, the stock has fallen ~28% from its post-IPO peak, weighed down by high expectations, a broader chip-sector sell-off, and concentration risk around a few large customers.
DFlash is an open source block-diffusion speculative decoding method that replaces sequential autoregressive drafting with parallel block-level token prediction. On NVIDIA Blackwell hardware, it delivers up to 15x throughput improvement over autoregressive decoding for gpt-oss-120b at high interactivity targets, and outperforms EAGLE-3 speculative decoding by 1.5x. The technique uses three key mechanisms: block-diffusion drafting, target hidden-state conditioning, and KV injection. Twenty model checkpoints covering Qwen, Llama, Gemma, Kimi K2.6, and gpt-oss families are available on Hugging Face, with integration support for TensorRT-LLM, vLLM (via the Speculators library), and SGLang requiring minimal config changes and no application refactoring.
Upbound has launched Modelplane, an open source control plane built on Crossplane that lets IT teams manage AI inference engines using the same declarative workflows they use for Kubernetes clusters. Modelplane supports deploying inference engines based on available GPU capacity across cluster fleets, autoscaling replicas, caching and distributing model weights, and routing inference requests through a unified gateway. Available under Apache 2 license with no usage caps, it aims to integrate AI inference workload management into existing cloud-native operations without requiring specialized staff.
NVIDIA and AWS have announced several joint infrastructure advancements for enterprise AI at scale. New Amazon EC2 G7 instances powered by NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs deliver up to 4.6x AI inference performance and 2.1x graphics performance over G6 instances, with support for up to 8 GPUs and 700 Gbps networking. Amazon OpenSearch Serverless now uses NVIDIA cuVS for GPU-accelerated vector indexing by default, enabling up to 10x faster vector indexing at a quarter of the CPU-only cost, making billion-scale vector databases buildable in under an hour. Additionally, AWS has achieved NVIDIA Exemplar Cloud status for GB300 training workloads, certifying that AWS meets NVIDIA's rigorous performance benchmarks for large-scale AI training.
Running three different LLMs simultaneously on a single 8GB GPU fails because llama.cpp pre-allocates the full KV cache upfront, causing OOM errors for the second and third processes. The solution is a C++ daemon called lmxd that implements Connection Admission Control (borrowed from 5G/telecom) as a VRAM ledger: it tracks allocated bytes, enforces a 90% cap, and refuses new agent registrations before any GPU allocation is attempted. The daemon also handles KV-cache swapping to host RAM between agent switches, enabling multiple agents to share one GPU context slot. Additionally, a layer streaming technique using two CUDA streams overlaps compute and weight transfer, achieving ~22-32% wall-clock savings on a GTX 1080. The repo ships the admission control daemon and the streaming primitive as separate, composable components.