Kubernetes for ML Pipelines: Six Months on EKS
Running batch ML workloads on EKS taught me things the documentation won't. Node pools, spot interruptions, GPU scheduling — the hard lessons.
Kubernetes is not the obvious choice for ML pipelines — managed solutions like SageMaker or Vertex AI take care of a lot of infrastructure complexity. But when you need full control over your stack, want to avoid vendor lock-in, and have workloads that span training, inference, and data prep, EKS with a well-configured node pool setup is hard to beat.
The critical decision is GPU node pool architecture. We ended up with three pools: a small on-demand pool for interactive inference (always warm, expensive), a spot pool for batch training (90% cheaper, needs interruption handling), and a CPU pool for preprocessing. Spot interruption handling with node affinity rules and graceful checkpointing was the engineering challenge that took longest.
Autoscaling takes time to tune. Cluster Autoscaler works well for CPU workloads; for GPU nodes, the scale-up latency (3–5 minutes for a g4dn instance to join) means you need to keep a warm floor. We used Karpenter in the end — its bin-packing and faster node provisioning made a meaningful difference.
If I were starting over, I'd containerise more aggressively from day one. Debugging Kubernetes issues where half your config lives in the container and half lives in Helm charts is a nightmare. Single-source-of-truth infrastructure, ideally Terraform managing EKS config and Helm charts from the same repo, is the sane path.