Modal has launched Auto Endpoints, a self-serve product for deploying production-grade LLM inference with a single CLI command. Unlike traditional managed inference providers, Modal exposes the underlying code, engine-level metrics (TTFT, ITL, speculative decoding acceptance length), and full infrastructure controls to users. Auto Endpoints are built on Modal's serverless GPU platform with ultra-low-latency routing via Modal Servers (5ms overhead). Performance is achieved through optimized inference recipes using SGLang, FlashAttention-4, and DFlash block-diffusion speculative decoding — delivering over 4x speed improvements vs. baseline. The system is designed to evolve toward full automation via internal agentic pipelines (autoinference, autospec, autodistill, autoresearch) that configure, benchmark, and improve inference endpoints without manual engineering effort.
Nguồn: https://modal.com/blog/introducing-auto-endpoints. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.
A reproducible benchmark comparing gradient-boosted decision trees (GBDTs) vs. LLM-based scoring for payment fraud detection across three dimensions: latency, cost, and determinism. On a single CPU core, GBDTs hit p99 latency of 0.15ms vs. ~1,200ms for LLMs — well outside the 100ms ISO 8583 authorization budget. Cost-wise, GBDTs run ~$54/hour at 50K TPS vs. $16,200–$351,000 for LLM tiers. Determinism is the most critical issue for regulated environments: GBDTs return identical scores on identical inputs while LLMs produce hundreds of distinct outputs even at temperature=0. The recommended architecture keeps deterministic tree ensembles on the synchronous hot path and deploys LLM agents on the asynchronous cold path for SAR drafting, evidence gathering, and agent-as-a-judge validation before human review. All benchmark code is open-source and reproducible on a laptop.
Running three different LLMs simultaneously on a single 8GB GPU fails because llama.cpp pre-allocates the full KV cache upfront, causing OOM errors for the second and third processes. The solution is a C++ daemon called lmxd that implements Connection Admission Control (borrowed from 5G/telecom) as a VRAM ledger: it tracks allocated bytes, enforces a 90% cap, and refuses new agent registrations before any GPU allocation is attempted. The daemon also handles KV-cache swapping to host RAM between agent switches, enabling multiple agents to share one GPU context slot. Additionally, a layer streaming technique using two CUDA streams overlaps compute and weight transfer, achieving ~22-32% wall-clock savings on a GTX 1080. The repo ships the admission control daemon and the streaming primitive as separate, composable components.
Unconventional AI, led by former Databricks AI chief Naveen Rao, has released Un0, an image-generation model built on a software simulation of a novel oscillator-based computing architecture. The company claims this architecture could reduce AI inference power consumption by up to 1,000x compared to conventional chips. Un0 performs comparably to state-of-the-art diffusion models like Stable Diffusion, serving as a proof-of-concept for the new architecture. The company plans to release actual chip schematics soon and eventually build a full inference stack, positioning itself as a compute provider running at a fraction of current energy costs.
NVIDIA and AWS have announced several joint infrastructure advancements for enterprise AI at scale. New Amazon EC2 G7 instances powered by NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs deliver up to 4.6x AI inference performance and 2.1x graphics performance over G6 instances, with support for up to 8 GPUs and 700 Gbps networking. Amazon OpenSearch Serverless now uses NVIDIA cuVS for GPU-accelerated vector indexing by default, enabling up to 10x faster vector indexing at a quarter of the CPU-only cost, making billion-scale vector databases buildable in under an hour. Additionally, AWS has achieved NVIDIA Exemplar Cloud status for GB300 training workloads, certifying that AWS meets NVIDIA's rigorous performance benchmarks for large-scale AI training.
Cerebras posted strong Q1 results — revenue up 92% to $193.4m and a narrowed net loss — yet its stock fell ~10% after the company warned gross margins would drop sharply from 46.5% to 36–38% in Q2. The culprit is not chip supply but a shortage of data-centre space and power. Cerebras is renting back its own systems and building capacity at speed, costs that will shave 10–15 margin points this year. CEO Andrew Feldman called it a 'grand irony' that buildings, not chips, are now the limiting factor. The company guided full-year revenue of $855–865m, above analyst estimates, and highlighted a $20bn+ OpenAI inference deal and a 178% jump in cloud/services revenue. Still, the stock has fallen ~28% from its post-IPO peak, weighed down by high expectations, a broader chip-sector sell-off, and concentration risk around a few large customers.
DFlash is an open source block-diffusion speculative decoding method that replaces sequential autoregressive drafting with parallel block-level token prediction. On NVIDIA Blackwell hardware, it delivers up to 15x throughput improvement over autoregressive decoding for gpt-oss-120b at high interactivity targets, and outperforms EAGLE-3 speculative decoding by 1.5x. The technique uses three key mechanisms: block-diffusion drafting, target hidden-state conditioning, and KV injection. Twenty model checkpoints covering Qwen, Llama, Gemma, Kimi K2.6, and gpt-oss families are available on Hugging Face, with integration support for TensorRT-LLM, vLLM (via the Speculators library), and SGLang requiring minimal config changes and no application refactoring.
Upbound has launched Modelplane, an open source control plane built on Crossplane that lets IT teams manage AI inference engines using the same declarative workflows they use for Kubernetes clusters. Modelplane supports deploying inference engines based on available GPU capacity across cluster fleets, autoscaling replicas, caching and distributing model weights, and routing inference requests through a unified gateway. Available under Apache 2 license with no usage caps, it aims to integrate AI inference workload management into existing cloud-native operations without requiring specialized staff.
TokenSpeed-kernel is an open-source, standalone subsystem that provides a clean layered API and registry system for LLM inference kernels across multiple hardware backends. It decouples the high-level runtime from hardware-specific kernel implementations using a decorator-based registration system where kernels declare their platform capabilities, tensor format signatures, and priorities. The selector then dispatches to the best available implementation at runtime. Using GPT-OSS 120B on AMD MI355X (CDNA4) as a validation target, the post demonstrates how Gluon-backed attention and MoE kernels achieve 1.6–3.6x end-to-end throughput improvements over portable Triton baselines, while NVIDIA paths (via FlashInfer/TensorRT-LLM wrappers) use the same public APIs. The AMD-specific kernels are published as a standalone pip package (tokenspeed-kernel-amd) reusable by other inference engines like vLLM.