Modal00 bình luận8 phút đọc2 ngày trước

Introducing Modal Auto Endpoints: Optimized inference you actually own

Modal has launched Auto Endpoints, a self-serve product for deploying production-grade LLM inference with a single CLI command. Unlike traditional managed inference providers, Modal exposes the underlying code, engine-level metrics (TTFT, ITL, speculative decoding acceptance length), and full infrastructure controls to users. Auto Endpoints are built on Modal's serverless GPU platform with ultra-low-latency routing via Modal Servers (5ms overhead). Performance is achieved through optimized inference recipes using SGLang, FlashAttention-4, and DFlash block-diffusion speculative decoding — delivering over 4x speed improvements vs. baseline. The system is designed to evolve toward full automation via internal agentic pipelines (autoinference, autospec, autodistill, autoresearch) that configure, benchmark, and improve inference endpoints without manual engineering effort.

Đọc bài gốc

#ai-inference

Nguồn: https://modal.com/blog/introducing-auto-endpoints. 8sync News chỉ tóm tắt và dẫn link; bản quyền nội dung thuộc tác giả và nguồn gốc.

Đề xuất cho bạn

Towards Data Science017 phút3 giờ trước

The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark

A reproducible benchmark comparing gradient-boosted decision trees (GBDTs) vs. LLM-based scoring for payment fraud detection across three dimensions: latency, cost, and determinism. On a single CPU core, GBDTs hit p99 latency of 0.15ms vs. ~1,200ms for LLMs — well outside the 100ms ISO 8583 authorization budget. Cost-wise, GBDTs run ~$54/hour at 50K TPS vs. $16,200–$351,000 for LLM tiers. Determinism is the most critical issue for regulated environments: GBDTs return identical scores on identical inputs while LLMs produce hundreds of distinct outputs even at temperature=0. The recommended architecture keeps deterministic tree ensembles on the synchronous hot path and deploys LLM agents on the asynchronous cold path for SAR drafting, evidence gathering, and agent-as-a-judge validation before human review. All benchmark code is open-source and reproducible on a laptop.

#machine-learning

Introducing Modal Auto Endpoints: Optimized inference you actually own

Đề xuất cho bạn

The Hot Path Belongs to GBDTs, Agents Own the Cold Path: A Payment-Fraud Benchmark

3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal

Databricks’ former AI chief thinks he can cut AI’s power bill by 1,000x

NVIDIA and AWS Collaborate to Bring AI to Production at Scale

Cerebras stock falls as a building shortage bites

Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

Upbound Unfurls Control Plane for Managing AI Inference Workloads

TokenSpeed-Kernel: Portable APIs and High-Performance Kernels for Multi-Silicon LLM Inference – PyTorch