May 8, 2026|9 min read

GPU Workloads on Kubernetes: What Your Platform Actually Needs

How we run AI and ML workloads on managed Kubernetes - GPU node pools, model serving with vLLM, scheduling with taints and tolerations, and GPU-aware observability. No separate AI platform required.

By Jan Fuhrer

1.GPU node pools: separate by design
2.What happens when a GPU pod gets scheduled
3.The AI/ML stack on Kubernetes
4.GPU observability: DCGM Exporter + our existing stack
5.Scaling patterns
6.What stays the same
7.Get started

Every other month, a customer asks us the same question: "We need to run AI workloads. Should we set up a separate infrastructure for that?"

The answer is almost always no. If you already have a managed Kubernetes platform, you have 90% of what you need. The remaining 10% is GPU node pools, the right scheduling configuration, and observability that understands GPU metrics. That is what this post covers.

We have been running GPU workloads on our managed clusters since 2024, for use cases ranging from quantized LLM inference and RAG pipelines to batch embedding generation and computer vision. The workloads run on the same platform, with the same GitOps workflows, the same observability stack, and the same SLA guarantees as every other workload.

GPU node pools: separate by design

The first thing you need is a dedicated node pool for GPU workloads. You do not want GPU nodes running your ingress controller or Prometheus. GPUs are expensive, and every minute a GPU sits idle running platform services is money wasted.

Managed Kubernetes ClusterTaints + Tolerations + Node Affinity

System Pool3x nodes

8 vCPU / 32 GB RAM

Platform services

Observability

Ingress

ArgoCD

Application Pool3-10x nodes

16 vCPU / 64 GB RAM

APIs

Web apps

Workers

Databases

GPU Pool1-4x nodes

NVIDIA A100/H100 + 32 vCPU

Model inference

Fine-tuning

Embedding generation

Batch processing

We use Kubernetes taints and tolerations to enforce this separation. GPU nodes get a taint (nvidia.com/gpu=present:NoSchedule) that prevents any pod from scheduling on them unless it explicitly tolerates the taint. Only pods that request GPU resources (nvidia.com/gpu: 1) get the toleration.

This means:

Regular workloads never land on GPU nodes (no accidental cost)
GPU workloads never compete with CPU workloads for scheduling
You can scale GPU and CPU node pools independently
Cluster autoscaler can add/remove GPU nodes based on actual GPU demand

On Natron Cloud and Flex Stack, our GPU nodes run NVIDIA RTX A2000-class workstation cards (12 GB VRAM). They cover quantized language models in the 7B range, computer vision, embedding generation, and video processing. When a workload needs large-model inference or training on A100 or H100 cards, we configure AKS node pools with NC-series or ND-series VMs on BYOC with Azure. The platform stack on top is identical regardless of where the GPU hardware lives.

What happens when a GPU pod gets scheduled

The scheduling flow for GPU workloads has a few more steps than regular pods, and each one can fail silently if not configured correctly.

Request GPUPod spec: nvidia.com/gpu: 1

Toleration matchScheduler finds GPU node with matching taint

NVIDIA pluginDevice plugin exposes GPU to container runtime

Model loadsWeights pulled from S3/PVC into GPU memory

Serving readyHealth check passes, ingress routes traffic

The critical component here is the NVIDIA device plugin. It runs as a DaemonSet on every GPU node and exposes the GPUs as schedulable resources to the Kubernetes API. Without it, the scheduler does not know GPUs exist. We deploy and manage this as part of our platform stack via FluxCD, so customers do not have to think about it.

Model weights are typically large (7B parameter models are 14 GB, 70B models are 140 GB). They need to be pulled from object storage (our Ceph S3 or Azure Blob) into GPU memory before the model can serve requests. This startup time matters. We configure readiness probes that account for model loading, so the ingress does not route traffic to a pod that is still loading weights.

The AI/ML stack on Kubernetes

You do not need a specialized AI platform. You need the right components deployed as regular Kubernetes workloads:

Model Serving

vLLM

Ollama

Triton

TGI

Data & Storage

pgvector

Ceph S3 (model weights)

Redis (cache)

Qdrant / Milvus

Platform (always included)

GPU metrics (DCGM)

Grafana dashboards

Cilium + Ingress

Kyverno policies

Velero backups

Alertmanager

Model serving is the most common GPU workload we run. vLLM is our default recommendation for LLM inference. It handles batching, paged attention, and multi-GPU tensor parallelism out of the box. For customers who want to run open-source models locally (Llama, Mistral, Qwen), vLLM as a Kubernetes Deployment with GPU requests is the simplest path. Ollama works for lighter workloads and local development patterns.

Vector databases for RAG pipelines run on regular CPU node pools. pgvector (PostgreSQL extension) is our default because most customers already have PostgreSQL on the platform. For larger-scale vector search, we deploy Qdrant or Milvus.

Model weight storage uses our existing Ceph S3 (on Natron Cloud / Flex Stack) or the customer's cloud object storage (Azure Blob / GCS). Models are pulled at pod startup. For frequently used models, we use PersistentVolumes with ReadWriteMany to share weights across replicas without re-downloading.

GPU observability: DCGM Exporter + our existing stack

This is where most self-managed setups fall short. Standard Prometheus metrics do not include GPU utilization, memory, temperature, or power draw. You need the NVIDIA DCGM Exporter, which exposes GPU metrics as Prometheus-compatible endpoints.

We deploy DCGM Exporter as a DaemonSet on every GPU node, alongside our standard node exporter. The metrics feed into the same Prometheus instance that monitors the rest of the cluster. We have pre-built Grafana dashboards that show:

GPU Cluster DashboardGrafana + DCGM Exporter

GPU Utilization

78%

GPU Memory

31.2 / 40 GB

Inference Latency (p99)

142ms

Tokens/sec

1,247

GPU Temperature

67°C

Pending GPU Requests

The alerting rules are GPU-aware:

GPU utilization below 10% for 30 minutes: you are paying for a GPU that is not being used. Consider scaling down or using spot/preemptible instances.
GPU memory above 95%: the model is close to OOM. Next request batch might fail.
Inference latency p99 above threshold: the model is becoming a bottleneck. Consider adding replicas or switching to a larger GPU.
GPU temperature above 85°C: thermal throttling is imminent. Check datacenter cooling or reduce batch sizes.

This is the same approach we take with every platform component: install it, configure sane defaults, build dashboards, write alert rules, and manage it as part of the platform. The customer sees GPU metrics in their Grafana dashboard next to their application metrics.

Scaling patterns

GPU workloads scale differently from CPU workloads. You cannot just add replicas like a web service because each replica needs a dedicated GPU, and GPUs are scarce and expensive.

Horizontal scaling works for inference workloads. If you need more throughput, add more replicas (each with a GPU). The cluster autoscaler provisions new GPU nodes as needed. We configure scale-up thresholds based on inference queue depth, not CPU utilization.

Vertical scaling works for model size. If you need to run a larger model, you need a bigger GPU (or multiple GPUs per pod with tensor parallelism). This is a configuration change in the pod spec, not a scaling event.

Batch scheduling works for fine-tuning and embedding generation. These are not latency-sensitive. We use Kubernetes Jobs with GPU requests, scheduled during off-peak hours when GPU nodes have available capacity. This maximizes GPU utilization without competing with real-time inference.

What stays the same

The entire platform stack we describe in our managed Kubernetes post works unchanged for GPU workloads:

Cilium handles networking for GPU pods the same as CPU pods. Network policies apply identically.
cert-manager terminates TLS for your model serving endpoints. Your vLLM API gets HTTPS automatically.
ArgoCD deploys your model serving configuration from Git. New model version? Update the image tag, push, ArgoCD syncs.
Kyverno enforces policies on GPU pods. Resource requests required, image registries restricted, privileged containers blocked.
Velero backs up the configuration (not the model weights, those live in object storage).
Loki collects logs from GPU pods. When inference fails, you search logs the same way as any other workload.

This is the point: GPU workloads are just workloads. They need a GPU resource, a toleration, and proper observability. Everything else, the networking, the security, the GitOps, the backup, the monitoring, is the same platform your other workloads already use.

Get started

If your team is exploring AI workloads and you are already on our managed Kubernetes platform, GPU node pools are an add-on, not a new platform. We configure the hardware, deploy the NVIDIA toolchain, set up the observability, and your team deploys models through the same ArgoCD workflow they use for everything else.

If you are evaluating options, schedule a call. We will look at your workload requirements (model size, latency targets, throughput needs) and design a GPU node pool configuration that fits.