PyTorch00 bình luận18 phút đọc6 giờ trước

TokenSpeed-Kernel: Portable APIs and High-Performance Kernels for Multi-Silicon LLM Inference – PyTorch

TokenSpeed-kernel is an open-source, standalone subsystem that provides a clean layered API and registry system for LLM inference kernels across multiple hardware backends. It decouples the high-level runtime from hardware-specific kernel implementations using a decorator-based registration system where kernels declare their platform capabilities, tensor format signatures, and priorities. The selector then dispatches to the best available implementation at runtime. Using GPT-OSS 120B on AMD MI355X (CDNA4) as a validation target, the post demonstrates how Gluon-backed attention and MoE kernels achieve 1.6–3.6x end-to-end throughput improvements over portable Triton baselines, while NVIDIA paths (via FlashInfer/TensorRT-LLM wrappers) use the same public APIs. The AMD-specific kernels are published as a standalone pip package (tokenspeed-kernel-amd) reusable by other inference engines like vLLM.

Đọc bài gốc

#pytorch #ai-inference

Nguồn: https://pytorch.org/blog/lightseek-tokenspeed-kernel. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Google Developers613 phút9 ngày trướcAI

Unlocking the Power of the TPU Stack: Introducing our new Developer Hub

Google Cloud vừa giới thiệu TPU Developer Hub, một nền tảng giáo dục tập trung dành cho nhà phát triển ML sử dụng TPU, bao gồm kiến trúc phần cứng, stack phần mềm (XLA, Pallas kernels), công cụ gỡ lỗi XProf, chiến lược tối ưu hóa (như offloading KV cache) cùng networking và bảo mật. Nội dung đa dạng từ Colabs tương tác, mã nguồn mở đến tài liệu chuyên sâu, hỗ trợ tích hợp AI-assisted development.

Lập trình viên ML nên đọc để hiểu cách tối ưu hóa hiệu suất và chi phí của mô hình trên TPU với các công cụ mới như XLA, Pallas và các chiến lược parallelism, từ đó tiết kiệm thời gian và nguồn lực trong triển khai sản phẩm AI.

#machine-learning

TokenSpeed-Kernel: Portable APIs and High-Performance Kernels for Multi-Silicon LLM Inference – PyTorch

Đề xuất cho bạn

Unlocking the Power of the TPU Stack: Introducing our new Developer Hub

Databricks’ former AI chief thinks he can cut AI’s power bill by 1,000x

Kubernetes teams trust automation to ship code but not to touch CPU, and AI is raising the stakes

Cerebras stock falls as a building shortage bites

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

Upbound Unfurls Control Plane for Managing AI Inference Workloads

NVIDIA and AWS Collaborate to Bring AI to Production at Scale

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal