GPU Workloads on Kubernetes: What Your Platform Actually Needs
How we run AI and ML workloads on managed Kubernetes - GPU node pools, model serving with vLLM, scheduling with taints and tolerations, and GPU-aware observability. No separate AI platform required.
Every other month, a customer asks us the same question: "We need to run AI workloads. Should we set up a separate infrastructure for that?"
The answer is almost always no. If you already have a managed Kubernetes platform, you have 90% of what you need. The remaining 10% is GPU node pools, the right scheduling configuration, and observability that understands GPU metrics. That is what this post covers.
We have been running GPU workloads on our managed clusters since 2024, for use cases ranging from LLM inference and RAG pipelines to batch embedding generation and model fine-tuning. The workloads run on the same platform, with the same GitOps workflows, the same observability stack, and the same SLA guarantees as every other workload.
GPU node pools: separate by design
The first thing you need is a dedicated node pool for GPU workloads. You do not want GPU nodes running your ingress controller or Prometheus. GPUs are expensive, and every minute a GPU sits idle running platform services is money wasted.
We use Kubernetes taints and tolerations to enforce this separation. GPU nodes get a taint (nvidia.com/gpu=present:NoSchedule) that prevents any pod from scheduling on them unless it explicitly tolerates the taint. Only pods that request GPU resources (nvidia.com/gpu: 1) get the toleration.
This means:
- Regular workloads never land on GPU nodes (no accidental cost)
- GPU workloads never compete with CPU workloads for scheduling
- You can scale GPU and CPU node pools independently
- Cluster autoscaler can add/remove GPU nodes based on actual GPU demand
On Natron Cloud and Flex Stack, we provision GPU nodes with NVIDIA A100 or H100 cards, depending on the workload requirements. On BYOC with Azure, we configure AKS node pools with NC-series or ND-series VMs. The platform stack on top is identical regardless of where the GPU hardware lives.
What happens when a GPU pod gets scheduled
The scheduling flow for GPU workloads has a few more steps than regular pods, and each one can fail silently if not configured correctly.
The critical component here is the NVIDIA device plugin. It runs as a DaemonSet on every GPU node and exposes the GPUs as schedulable resources to the Kubernetes API. Without it, the scheduler does not know GPUs exist. We deploy and manage this as part of our platform stack via FluxCD, so customers do not have to think about it.
Model weights are typically large (7B parameter models are 14 GB, 70B models are 140 GB). They need to be pulled from object storage (our Ceph S3 or Azure Blob) into GPU memory before the model can serve requests. This startup time matters. We configure readiness probes that account for model loading, so the ingress does not route traffic to a pod that is still loading weights.
The AI/ML stack on Kubernetes
You do not need a specialized AI platform. You need the right components deployed as regular Kubernetes workloads:
Model serving is the most common GPU workload we run. vLLM is our default recommendation for LLM inference. It handles batching, paged attention, and multi-GPU tensor parallelism out of the box. For customers who want to run open-source models locally (Llama, Mistral, Qwen), vLLM as a Kubernetes Deployment with GPU requests is the simplest path. Ollama works for lighter workloads and local development patterns.
Vector databases for RAG pipelines run on regular CPU node pools. pgvector (PostgreSQL extension) is our default because most customers already have PostgreSQL on the platform. For larger-scale vector search, we deploy Qdrant or Milvus.
Model weight storage uses our existing Ceph S3 (on Natron Cloud / Flex Stack) or the customer's cloud object storage (Azure Blob / GCS). Models are pulled at pod startup. For frequently used models, we use PersistentVolumes with ReadWriteMany to share weights across replicas without re-downloading.
GPU observability: DCGM Exporter + our existing stack
This is where most self-managed setups fall short. Standard Prometheus metrics do not include GPU utilization, memory, temperature, or power draw. You need the NVIDIA DCGM Exporter, which exposes GPU metrics as Prometheus-compatible endpoints.
We deploy DCGM Exporter as a DaemonSet on every GPU node, alongside our standard node exporter. The metrics feed into the same Prometheus instance that monitors the rest of the cluster. We have pre-built Grafana dashboards that show:
The alerting rules are GPU-aware:
- GPU utilization below 10% for 30 minutes: you are paying for a GPU that is not being used. Consider scaling down or using spot/preemptible instances.
- GPU memory above 95%: the model is close to OOM. Next request batch might fail.
- Inference latency p99 above threshold: the model is becoming a bottleneck. Consider adding replicas or switching to a larger GPU.
- GPU temperature above 85°C: thermal throttling is imminent. Check datacenter cooling or reduce batch sizes.
This is the same approach we take with every platform component: install it, configure sane defaults, build dashboards, write alert rules, and manage it as part of the platform. The customer sees GPU metrics in their Grafana dashboard next to their application metrics.
Scaling patterns
GPU workloads scale differently from CPU workloads. You cannot just add replicas like a web service because each replica needs a dedicated GPU, and GPUs are scarce and expensive.
Horizontal scaling works for inference workloads. If you need more throughput, add more replicas (each with a GPU). The cluster autoscaler provisions new GPU nodes as needed. We configure scale-up thresholds based on inference queue depth, not CPU utilization.
Vertical scaling works for model size. If you need to run a larger model, you need a bigger GPU (or multiple GPUs per pod with tensor parallelism). This is a configuration change in the pod spec, not a scaling event.
Batch scheduling works for fine-tuning and embedding generation. These are not latency-sensitive. We use Kubernetes Jobs with GPU requests, scheduled during off-peak hours when GPU nodes have available capacity. This maximizes GPU utilization without competing with real-time inference.
What stays the same
The entire platform stack we describe in our managed Kubernetes post works unchanged for GPU workloads:
- Cilium handles networking for GPU pods the same as CPU pods. Network policies apply identically.
- cert-manager terminates TLS for your model serving endpoints. Your vLLM API gets HTTPS automatically.
- ArgoCD deploys your model serving configuration from Git. New model version? Update the image tag, push, ArgoCD syncs.
- Kyverno enforces policies on GPU pods. Resource requests required, image registries restricted, privileged containers blocked.
- Velero backs up the configuration (not the model weights, those live in object storage).
- Loki collects logs from GPU pods. When inference fails, you search logs the same way as any other workload.
This is the point: GPU workloads are just workloads. They need a GPU resource, a toleration, and proper observability. Everything else, the networking, the security, the GitOps, the backup, the monitoring, is the same platform your other workloads already use.
Get started
If your team is exploring AI workloads and you are already on our managed Kubernetes platform, GPU node pools are an add-on, not a new platform. We configure the hardware, deploy the NVIDIA toolchain, set up the observability, and your team deploys models through the same ArgoCD workflow they use for everything else.
If you are evaluating options, schedule a call. We will look at your workload requirements (model size, latency targets, throughput needs) and design a GPU node pool configuration that fits.

About the author
Jan Fuhrer
Platform Engineer and Architect at Natron Tech, designing Kubernetes platforms with GPU compute for AI workloads across Switzerland.
“Your AI workloads do not need a separate platform. They need your existing platform to handle GPUs properly.”
