SkyPilot Endpoints: Production-Ready Inference on Every Cluster You Own
SkyPilot Endpoints is a production-ready LLM inference system that deploys a full serving stack — inference engine, autoscaler, gateway, TLS, metrics — from a single YAML across multiple Kubernetes clusters under one endpoint URL. It handles cross-cluster placement, autoscaling, and failure recovery automatically. A key feature is unified GPU pool management: training jobs run as preemptible workloads that yield GPUs to latency-sensitive inference when demand spikes, then resume from checkpoints when capacity frees up. The stack builds on vLLM, KServe, llm-d, and KEDA, and includes KV cache-aware routing, prefill/decode disaggregation, scale-to-zero, rolling updates, and a unified observability dashboard across all clusters.
