How Do Self‑Hosted AI Models Change Your Kubernetes Decisions?
Running self-hosted AI models on Kubernetes introduces significant changes to how platform teams manage capacity, security, and operations. The post covers when self-hosting makes sense over API-based AI (cost predictability, data residency, vendor lock-in), what changes in cluster design (GPU node groups, autoscaling, scheduling patterns, observability), and how to split ownership between platform and ML teams. Key operational concerns include GPU utilization, new failure modes like queue depth and token latency, compliance mapping, and FinOps for GPU spend. The post also addresses when to keep Kubernetes management in-house versus using a managed service.