Together AI00 bình luận8 phút đọc2 ngày trước

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

ParallelKernelBench (PKB) is a new benchmark testing whether frontier LLMs can write optimized multi-GPU CUDA kernels across 87 real-world workloads. Results show significant gaps: the best model (zero-shot) solves only 28 of 87 problems correctly, with just 22 beating the PyTorch+NCCL baseline. Models struggle with rank coordination, data partitioning, and choosing optimal GPU-to-GPU transfer mechanisms like TMA and NVLS. An agentic feedback loop with Gemini 3 Pro improved results modestly (35 correct, 26 faster-than-baseline) but plateaued after ~20 refinement steps. Despite poor overall performance, a few generated kernels — for NeMo-RL GRPO, Hyena context parallelism, and SAM 3 mask suppression — outperformed any publicly available implementation. PKB is released as an open benchmark to drive further research into LLM-driven distributed kernel optimization.

Đọc bài gốc

#distributed-systems #cuda

Nguồn: https://www.together.ai/blog/parallelkernelbench. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Redpanda112 phút4 giờ trướcAI

Kafka's log compaction corrupts data. Here's how we fixed it

Apache Kafka có lỗ hổng trong cơ chế log compaction khiến dữ liệu bị hỏng do xung đột giữa compaction và replication, gây ra bốn vấn đề: dữ liệu đã xóa tái xuất hiện, giao dịch bị hủy hiện dưới dạng đã commit, dữ liệu đã commit bị ẩn, và consumers read_committed bị đóng băng partition. Redpanda Streaming khắc phục bằng giao thức compaction phối hợp, sử dụng các cặp offset (MCCO/MTRO, MXFO/MXRO) để đảm bảo tombstones và transaction markers không bị xóa trước khi tất cả replicas xử lý xong. Lỗi này có thể tái hiện trên Kafka phiên bản 3.9 đến 4.2 bằng Docker Compose.

Lập trình viên cần đọc bài này để hiểu cách giải quyết vấn đề lỗi race condition trong log compaction của Kafka, giúp tránh mất dữ liệu và bảo đảm tính nhất quán khi xử lý các trường hợp đồng bộ hóa dữ liệu trên nhiều broker.

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

Đề xuất cho bạn

Kafka's log compaction corrupts data. Here's how we fixed it

How to use traces to avoid breaking changes

How we built saga rollbacks for Cloudflare Workflows

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal

Building a custom Red Hat Enterprise Linux kernel for NVIDIA DGX Spark

Client-Side Load Balancing at a Million Requests Per Second

Event Sourcing: Aggregates, Dynamic Consistency Boundaries, or what?

Buying an AMD GPU instead of Nvidia? — You're sacrificing more than just ray tracing performance