ParallelKernelBench (PKB) is a new benchmark testing whether frontier LLMs can write optimized multi-GPU CUDA kernels across 87 real-world workloads. Results show significant gaps: the best model (zero-shot) solves only 28 of 87 problems correctly, with just 22 beating the PyTorch+NCCL baseline. Models struggle with rank coordination, data partitioning, and choosing optimal GPU-to-GPU transfer mechanisms like TMA and NVLS. An agentic feedback loop with Gemini 3 Pro improved results modestly (35 correct, 26 faster-than-baseline) but plateaued after ~20 refinement steps. Despite poor overall performance, a few generated kernels — for NeMo-RL GRPO, Hyena context parallelism, and SAM 3 mask suppression — outperformed any publicly available implementation. PKB is released as an open benchmark to drive further research into LLM-driven distributed kernel optimization.
Nguồn: https://www.together.ai/blog/parallelkernelbench. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.
Apache Kafka có lỗ hổng trong cơ chế log compaction khiến dữ liệu bị hỏng do xung đột giữa compaction và replication, gây ra bốn vấn đề: dữ liệu đã xóa tái xuất hiện, giao dịch bị hủy hiện dưới dạng đã commit, dữ liệu đã commit bị ẩn, và consumers read_committed bị đóng băng partition. Redpanda Streaming khắc phục bằng giao thức compaction phối hợp, sử dụng các cặp offset (MCCO/MTRO, MXFO/MXRO) để đảm bảo tombstones và transaction markers không bị xóa trước khi tất cả replicas xử lý xong. Lỗi này có thể tái hiện trên Kafka phiên bản 3.9 đến 4.2 bằng Docker Compose.
Lập trình viên cần đọc bài này để hiểu cách giải quyết vấn đề lỗi race condition trong log compaction của Kafka, giúp tránh mất dữ liệu và bảo đảm tính nhất quán khi xử lý các trường hợp đồng bộ hóa dữ liệu trên nhiều broker.
Việc sử dụng tracing giúp phát hiện sớm các vấn đề tiềm ẩn khi thay đổi hệ thống bằng cách theo dõi luồng dữ liệu và sự kiện trong môi trường phân tán. Các thư viện phổ biến như OpenTracing, OpenTelemetry, Zipkin và Jaeger hỗ trợ giám sát, trong khi Digma cung cấp phản hồi tức thì trong quá trình phát triển.
Lập trình viên nên đọc bài này để hiểu cách sử dụng tracing để phát hiện và tránh các break changes trong hệ thống phân tán, từ đó giảm thiểu rủi ro khi cập nhật hoặc mở rộng ứng dụng.
Cloudflare Workflows now supports saga-style rollbacks, letting developers attach compensation logic directly to each step.do() call. When a multi-step workflow fails, registered rollback handlers execute in reverse step-start order, each running through Workflows' durable step machinery with retries, timeouts, and lifecycle events. The post explains the API design decisions (fluent vs. builder vs. options object), how rollback handlers are stored as callable stubs, how replay rebuilds handlers after engine restarts, and the key behavioral rules around ordering and eligibility for failed steps.
Running three different LLMs simultaneously on a single 8GB GPU fails because llama.cpp pre-allocates the full KV cache upfront, causing OOM errors for the second and third processes. The solution is a C++ daemon called lmxd that implements Connection Admission Control (borrowed from 5G/telecom) as a VRAM ledger: it tracks allocated bytes, enforces a 90% cap, and refuses new agent registrations before any GPU allocation is attempted. The daemon also handles KV-cache swapping to host RAM between agent switches, enabling multiple agents to share one GPU context slot. Additionally, a layer streaming technique using two CUDA streams overlaps compute and weight transfer, achieving ~22-32% wall-clock savings on a GTX 1080. The repo ships the admission control daemon and the streaming primitive as separate, composable components.

A step-by-step guide for building a custom RHEL 10 kernel and NVIDIA GPU driver from source to enable compatibility with the NVIDIA DGX Spark (GB10 Grace Blackwell Superchip) platform. Covers cloning the RHEL 10 kernel repo customized for GB10, installing build dependencies, compiling kernel RPMs (kernel-64k variant for ARM64), creating an RPM spec file for NVIDIA open-source kernel modules, building and installing the GPU driver RPM, installing CUDA via the NVIDIA repo and EPEL, and handling kernel updates via git reset. Intended as a Developer Preview workflow, not for production use.
Zalando's engineering team built an in-process client-side load balancer (CSLB) to handle over a million requests per second of internal fan-out traffic for their Product Read API, replacing shared Skipper ingress hops. The implementation replicates Skipper's xxHash64 consistent-hash ring for cache locality, uses a Kubernetes watch-based informer for pod discovery, and adds N-ring fade-in to prevent cold-cache spikes on scale-up. A key innovation is occupancy-based bounded load using Little's Law (seconds of work per second) rather than in-flight counts or throughput, combined with a latency multiplier borrowed from Finagle. Results include eliminating Skipper's fleet from 50+ pods to 8, reducing their own pod fleet by 25%, and saving over $1,000/day. AZ-aware routing was prototyped but paused due to edge cases around bounded-load threshold miscalculation during dual fade-in. The post also covers pipeline improvements, retry hardening, FIFO buffering, and how detailed logging revealed mysterious node-level network freezes that had previously been invisible.
Part eleven of an event sourcing series explores how to handle consistency boundaries without relying on DDD aggregates or Dynamic Consistency Boundaries (DCBs). The author argues that the best approach depends on the actual problems at hand. Two alternatives are discussed: replacing concurrent designs with non-concurrent ones (e.g., a draft-registration phase processed by a single-threaded algorithm), and using Azure Service Bus sessions to serialize workday validation, eliminating race conditions within a consistency boundary. The post emphasizes solving real problems holistically rather than applying patterns preemptively, and shows how task-based UIs and small data models reduce the likelihood of concurrency conflicts in the first place.
AMD's GPU market share has collapsed from 36% in 2018 to just 5% by end of 2025. Beyond ray tracing, AMD users face real compromises: Nvidia's DLSS suite (including Multi Frame Generation, Ray Reconstruction, and Dynamic Frame Generation) still leads FSR in image quality, game support (97% vs 72%), and backward compatibility. For AI workloads, CUDA remains the dominant platform with better day-one library support over AMD's ROCm. Creative professionals using Adobe Premiere, After Effects, or Blender also benefit from Nvidia's CUDA acceleration and NVENC encoder quality. AMD has closed gaps significantly, especially with RDNA 4 and FSR 4.1, but Nvidia retains a genuine lead across gaming, AI, and creative software ecosystems.