Kubernetes Multi-Tenancy Done Right
The technical decisions behind our multi-tenant Kubernetes platform - why we chose Kyverno over OPA, Helm over custom operators, and how a single values.yaml provisions an entire tenant
We have been running multi-tenant Kubernetes clusters for enterprise customers since 2021. In that time, we went from manually provisioning namespaces and writing RBAC YAMLs by hand to a system where onboarding a new team is a single values.yaml file and a merge request.
This post walks through the technical decisions we made along the way, the things that broke, and what the architecture looks like today.
Where most setups go wrong
Almost every organization we work with starts the same way: someone creates a namespace, gives the team cluster-admin (or something close to it), and moves on. It works for two teams. By the time there are ten, the problems are everywhere.
The issue is not that people are lazy. It is that Kubernetes does not have a built-in concept of a "tenant". You get namespaces, RBAC, and resource quotas as separate primitives. Wiring them together consistently across teams, environments, and clusters is the actual engineering challenge.
We learned this the hard way. Early on, a customer's staging workload had no resource quotas. A memory leak in a Java application consumed 48 GB of RAM on a shared node, which triggered OOM kills across three production namespaces belonging to other teams. The fix took 20 minutes. Rebuilding trust took months.
The architecture: layers, not monoliths
After a few iterations, we settled on a layered architecture. Each layer has one job and clear boundaries.
The key insight: the managed cluster is just the foundation. Everything above it is what makes the difference between "we have Kubernetes" and "we have a platform".
Layer 1 (Managed Cluster) handles the undifferentiated heavy lifting - node provisioning, API server, etcd, CNI. We run this on our own infrastructure (Natron Cloud) or on customer infrastructure via Flex Stack.
Layers 2-5 are where the tenancy model lives. And this is where our Helm-based toolkit comes in.
Why Helm and not a custom operator
We evaluated three approaches for tenant provisioning:
- Manual YAMLs in a Git repo. Works for five tenants. Falls apart at twenty. Every tenant needs 8-12 resources, and copy-paste errors are inevitable.
- Custom Kubernetes operator. A CRD like
Tenantthat reconciles all resources. Elegant in theory. In practice, you are now maintaining a Go codebase, handling upgrade paths, and debugging controller crashes at 3 AM. - Helm chart + ArgoCD. One chart, one
values.yamlper tenant, ArgoCD handles reconciliation. No custom code to maintain. The Helm templating language is ugly, but it is battle-tested and every platform engineer already knows it.
We chose option 3. The Helm chart templates out everything a tenant needs from a single values file:
The values.yaml for a tenant looks like this:
tenant:
name: team-data
namespaces:
- team-data-dev
- team-data-staging
- team-data-prod
quotas:
cpu: "8"
memory: 16Gi
storage: 100Gi
rbac:
clusterRole: namespace-admin
groups:
- "oidc:team-data-devs"
networkPolicy: restricted
registry:
project: team-data
allowList:
- "registry.natron.io/team-data/**"
- "docker.io/library/**"
vault:
path: kv/team-data/*
kyverno:
disallowPrivileged: true
requireRunAsNonRoot: true
requireReadOnlyRoot: trueThat is the entire tenant definition. The Helm chart renders it into 15-20 Kubernetes resources: namespaces, resource quotas, limit ranges, role bindings, network policies, Kyverno policies, ClusterSecretStores, registry pull secrets, and more.
No tickets. No manual steps. No "I forgot to add the network policy".
Why Kyverno over OPA Gatekeeper
This is probably the decision we get asked about most. Both are CNCF projects. Both do admission control. We went with Kyverno for three concrete reasons.
Policies in Rego (custom language)
Separate ConstraintTemplates + Constraints
Validation only (no mutation, no generation)
Steep learning curve for platform teams
Policies in YAML (native to K8s)
Single ClusterPolicy resource
Validate + Mutate + Generate
Platform teams already know YAML
Reason 1: YAML-native policies. Platform teams already think in YAML. Asking them to learn Rego (OPA's policy language) creates a knowledge bottleneck. With Kyverno, a new policy looks like every other Kubernetes manifest they already write.
Reason 2: Mutation and generation. Kyverno does not just validate. It can mutate resources (inject labels, set defaults) and generate new resources (create a NetworkPolicy when a namespace is created). OPA Gatekeeper is validation-only. We use mutation heavily to enforce consistent labeling and inject sidecar configurations.
Reason 3: Per-tenant scoping. Kyverno policies can be scoped to specific namespaces using label selectors. We template these selectors in the Helm chart, so each tenant gets exactly the policies defined in their values.yaml. A tenant that handles PCI data gets stricter image policies than an internal tooling team.
The trade-off: Kyverno uses more memory than OPA Gatekeeper in large clusters (100+ policies). We mitigate this by running Kyverno in HA mode with dedicated node pools.
How onboarding actually works
When a customer wants to add a new team to their platform, this is the actual workflow:
feat/onboard-team-data→mainThe merge request goes through three automated checks before it can merge:
- helm/template validates that the
values.yamlrenders valid Kubernetes manifests - kyverno/validate runs the cluster's policy set against the rendered manifests in CI
- argocd/sync performs a dry-run sync to catch any conflicts with existing resources
After merge, ArgoCD detects the change and syncs within 3 minutes. The new team has their namespaces, RBAC, network policies, secrets, and registry access. They can start deploying immediately.
If someone manually deletes a network policy or modifies a resource quota in the cluster, ArgoCD detects the drift and syncs back to the desired state defined in Git. We have seen this happen exactly once in production (an engineer who "just wanted to test something"). The self-healing kicked in within 90 seconds.
What we are still figuring out
This architecture is not perfect. A few things we are actively working on:
Cross-namespace communication. Some teams need to talk to each other. Our default-deny network policies block this. Today, we handle it with explicit exceptions in the Helm chart values. It works, but it does not scale well when you have 30 teams with complex dependency graphs. We are evaluating Cilium's ClusterWide Network Policies for a more declarative approach.
Tenant-scoped observability. Each tenant gets their own Grafana dashboards and alert rules, but the underlying Prometheus is shared. At scale, cardinality becomes a problem. We are looking at tenant-level metric isolation via Thanos or Mimir multi-tenancy features.
Cost attribution. Quotas tell you what a team is allowed to use, not what they actually use. We are building integration with OpenCost to give each tenant visibility into their actual resource consumption.
Explore further
We have documented the full architecture with interactive diagrams on our Platform Design page. It covers the nested isolation model, the Helm render flow, and the GitOps onboarding workflow in detail.
If you are building a multi-tenant Kubernetes platform, or struggling with the one you have, we are happy to talk through the specifics. Schedule a call and bring your architecture diagrams. This is a design engagement, not a sales pitch.

About the author
Jan Lauber
Cloud Engineer and Partner at Natron Tech, building multi-tenant Kubernetes platforms for enterprise organizations across Switzerland.
“A new team should be a pull request, not a support ticket.”