#cuda · 8sync News

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal

Running three different LLMs simultaneously on a single 8GB GPU fails because llama.cpp pre-allocates the full KV cache upfront, causing OOM errors for the second and third processes. The solution is a C++ daemon called lmxd that implements Connection Admission Control (borrowed from 5G/telecom) as a VRAM ledger: it tracks allocated bytes, enforces a 90% cap, and refuses new agent registrations before any GPU allocation is attempted. The daemon also handles KV-cache swapping to host RAM between agent switches, enabling multiple agents to share one GPU context slot. Additionally, a layer streaming technique using two CUDA streams overlaps compute and weight transfer, achieving ~22-32% wall-clock savings on a GTX 1080. The repo ships the admission control daemon and the streaming primitive as separate, composable components.

Accelerating BEV Pooling on NVIDIA GPUs for Physical AI Applications

BEVPoolV3 is a new CUDA kernel optimization for bird's-eye-view (BEV) pooling used in autonomous vehicles and robotics. The post walks through a practical GPU optimization workflow: classify whether the working set fits in L2 cache, remove redundant scatter traffic via a five-array INT32 scatter map, implement interval-owned scatter-reduce to avoid atomics, and validate with NVIDIA Nsight Compute. On RTX PRO 6000 Blackwell Max-Q (large L2), BEVPoolV3 FP8 achieves up to 42x speedup over the V2 baseline. On RTX A6000 (small L2, DRAM-bound), the adapted FP16 path reaches 19x speedup. The post also explains why FP8 outperforms NVFP4 for L2-resident scatter-reduce workloads, and how the same methodology applies to sparse embeddings, voxelization, and other irregular memory-bound kernels.

Buying an AMD GPU instead of Nvidia? — You're sacrificing more than just ray tracing performance

AMD's GPU market share has collapsed from 36% in 2018 to just 5% by end of 2025. Beyond ray tracing, AMD users face real compromises: Nvidia's DLSS suite (including Multi Frame Generation, Ray Reconstruction, and Dynamic Frame Generation) still leads FSR in image quality, game support (97% vs 72%), and backward compatibility. For AI workloads, CUDA remains the dominant platform with better day-one library support over AMD's ROCm. Creative professionals using Adobe Premiere, After Effects, or Blender also benefit from Nvidia's CUDA acceleration and NVENC encoder quality. AMD has closed gaps significantly, especially with RDNA 4 and FSR 4.1, but Nvidia retains a genuine lead across gaming, AI, and creative software ecosystems.

ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)

ParallelKernelBench (PKB) is a new benchmark testing whether frontier LLMs can write optimized multi-GPU CUDA kernels across 87 real-world workloads. Results show significant gaps: the best model (zero-shot) solves only 28 of 87 problems correctly, with just 22 beating the PyTorch+NCCL baseline. Models struggle with rank coordination, data partitioning, and choosing optimal GPU-to-GPU transfer mechanisms like TMA and NVLS. An agentic feedback loop with Gemini 3 Pro improved results modestly (35 correct, 26 faster-than-baseline) but plateaued after ~20 refinement steps. Despite poor overall performance, a few generated kernels — for NeMo-RL GRPO, Hyena context parallelism, and SAM 3 mask suppression — outperformed any publicly available implementation. PKB is released as an open benchmark to drive further research into LLM-driven distributed kernel optimization.

Building a custom Red Hat Enterprise Linux kernel for NVIDIA DGX Spark

A step-by-step guide for building a custom RHEL 10 kernel and NVIDIA GPU driver from source to enable compatibility with the NVIDIA DGX Spark (GB10 Grace Blackwell Superchip) platform. Covers cloning the RHEL 10 kernel repo customized for GB10, installing build dependencies, compiling kernel RPMs (kernel-64k variant for ARM64), creating an RPM spec file for NVIDIA open-source kernel modules, building and installing the GPU driver RPM, installing CUDA via the NVIDIA repo and EPEL, and handling kernel updates via git reset. Intended as a Developer Preview workflow, not for production use.