March 14, 2026|9 min read

Kubernetes Multi-Tenancy Done Right

The technical decisions behind our multi-tenant Kubernetes platform - why we chose Kyverno over OPA, Helm over custom operators, and how a single values.yaml provisions an entire tenant

By Jan Lauber

1.Where most setups go wrong
2.The architecture: layers, not monoliths
3.Why Helm and not a custom operator
4.Why Kyverno over OPA Gatekeeper
5.How onboarding actually works
6.What we are still figuring out
7.Explore further

We have been running multi-tenant Kubernetes clusters for enterprise customers since 2021. In that time, we went from manually provisioning namespaces and writing RBAC YAMLs by hand to a system where onboarding a new team is a single values.yaml file and a merge request.

This post walks through the technical decisions we made along the way, the things that broke, and what the architecture looks like today.

Where most setups go wrong

Almost every organization we work with starts the same way: someone creates a namespace, gives the team cluster-admin (or something close to it), and moves on. It works for two teams. By the time there are ten, the problems are everywhere.

Typical Setup

Shared Cluster

team-alpha

team-beta

team-gamma

No resource quotas

No network policies

No image restrictions

Manual RBAC setup

With Tenancy Toolkit

Managed Cluster + Guardrails

team-alpha

QuotasNetPolRBACKyverno

team-beta

QuotasNetPolRBACKyverno

team-gamma

QuotasNetPolRBACKyverno

Enforced quotas & limits

Default-deny network policies

Image allow-lists per tenant

RBAC from Helm chart

The issue is not that people are lazy. It is that Kubernetes does not have a built-in concept of a "tenant". You get namespaces, RBAC, and resource quotas as separate primitives. Wiring them together consistently across teams, environments, and clusters is the actual engineering challenge.

We learned this the hard way. Early on, a customer's staging workload had no resource quotas. A memory leak in a Java application consumed 48 GB of RAM on a shared node, which triggered OOM kills across three production namespaces belonging to other teams. The fix took 20 minutes. Rebuilding trust took months.

The architecture: layers, not monoliths

After a few iterations, we settled on a layered architecture. Each layer has one job and clear boundaries.

Observability & AuditMonitoring, logging, alerting across tenants

Platform ServicesSecrets, registry, certificates

Smart GuardrailsKyverno admission, image policies, pod security

RBAC & AuthenticationOIDC/SSO, role bindings, service accounts

Tenant IsolationNamespaces, quotas, network policies

Managed ClusterKubernetes API, node pools, CNI

The key insight: the managed cluster is just the foundation. Everything above it is what makes the difference between "we have Kubernetes" and "we have a platform".

Layer 1 (Managed Cluster) handles the undifferentiated heavy lifting - node provisioning, API server, etcd, CNI. We run this on our own infrastructure (Natron Cloud) or on customer infrastructure via Flex Stack.

Layers 2-5 are where the tenancy model lives. And this is where our Helm-based toolkit comes in.

Why Helm and not a custom operator

We evaluated three approaches for tenant provisioning:

Manual YAMLs in a Git repo. Works for five tenants. Falls apart at twenty. Every tenant needs 8-12 resources, and copy-paste errors are inevitable.
Custom Kubernetes operator. A CRD like Tenant that reconciles all resources. Elegant in theory. In practice, you are now maintaining a Go codebase, handling upgrade paths, and debugging controller crashes at 3 AM.
Helm chart + ArgoCD. One chart, one values.yaml per tenant, ArgoCD handles reconciliation. No custom code to maintain. The Helm templating language is ugly, but it is battle-tested and every platform engineer already knows it.

We chose option 3. The Helm chart templates out everything a tenant needs from a single values file:

values.yamlTenant definition

Tenant Helm ChartTemplate rendering

ArgoCDSync to cluster

Namespaces

RBAC & RoleBindings

Network Policies

Kyverno Policies

Vault SecretStores

Registry Pull Secrets

The values.yaml for a tenant looks like this:

tenant:
  name: team-data
  namespaces:
    - team-data-dev
    - team-data-staging
    - team-data-prod
  quotas:
    cpu: "8"
    memory: 16Gi
    storage: 100Gi
  rbac:
    clusterRole: namespace-admin
    groups:
      - "oidc:team-data-devs"
  networkPolicy: restricted
  registry:
    project: team-data
    allowList:
      - "registry.natron.io/team-data/**"
      - "docker.io/library/**"
  vault:
    path: kv/team-data/*
  kyverno:
    disallowPrivileged: true
    requireRunAsNonRoot: true
    requireReadOnlyRoot: true

That is the entire tenant definition. The Helm chart renders it into 15-20 Kubernetes resources: namespaces, resource quotas, limit ranges, role bindings, network policies, Kyverno policies, ClusterSecretStores, registry pull secrets, and more.

No tickets. No manual steps. No "I forgot to add the network policy".

Why Kyverno over OPA Gatekeeper

This is probably the decision we get asked about most. Both are CNCF projects. Both do admission control. We went with Kyverno for three concrete reasons.

OPA Gatekeeper

Policies in Rego (custom language)

Separate ConstraintTemplates + Constraints

Validation only (no mutation, no generation)

Steep learning curve for platform teams

# Rego policy

violation[{"msg": msg}] {

input.review.object.spec

.containers[_].securityContext

.privileged == true

msg := "privileged not allowed"

}

KyvernoOur choice

Policies in YAML (native to K8s)

Single ClusterPolicy resource

Validate + Mutate + Generate

Platform teams already know YAML

# Kyverno policy

apiVersion: kyverno.io/v1

kind: ClusterPolicy

spec:

rules:

- name: disallow-privileged

match:

resources:

kinds: [Pod]

validate:

deny:

conditions:

- key: privileged

operator: Equals

value: true

Reason 1: YAML-native policies. Platform teams already think in YAML. Asking them to learn Rego (OPA's policy language) creates a knowledge bottleneck. With Kyverno, a new policy looks like every other Kubernetes manifest they already write.

Reason 2: Mutation and generation. Kyverno does not just validate. It can mutate resources (inject labels, set defaults) and generate new resources (create a NetworkPolicy when a namespace is created). OPA Gatekeeper is validation-only. We use mutation heavily to enforce consistent labeling and inject sidecar configurations.

Reason 3: Per-tenant scoping. Kyverno policies can be scoped to specific namespaces using label selectors. We template these selectors in the Helm chart, so each tenant gets exactly the policies defined in their values.yaml. A tenant that handles PCI data gets stricter image policies than an internal tooling team.

The trade-off: Kyverno uses more memory than OPA Gatekeeper in large clusters (100+ policies). We mitigate this by running Kyverno in HA mode with dedicated node pools.

How onboarding actually works

When a customer wants to add a new team to their platform, this is the actual workflow:

feat: onboard team-data with full isolation#287

feat/onboard-team-data→main

tenants/team-data/values.yaml+26

1+tenant:

2+ name: team-data

3+ namespaces:

4+ - team-data-dev

5+ - team-data-staging

6+ - team-data-prod

7+ quotas:

8+ cpu: "8"

9+ memory: 16Gi

10+ storage: 100Gi

11+ rbac:

12+ clusterRole: namespace-admin

13+ groups:

14+ - "oidc:team-data-devs"

15+ networkPolicy: restricted

16+ registry:

17+ project: team-data

18+ allowList:

19+ - "registry.natron.io/team-data/**"

20+ - "docker.io/library/**"

21+ vault:

22+ path: kv/team-data/*

23+ kyverno:

24+ disallowPrivileged: true

25+ requireRunAsNonRoot: true

26+ requireReadOnlyRoot: true

Checks

helm/template— Rendered 18 resources

kyverno/validate— 9 policies passed

argocd/sync— Synced to cluster-prod

vault/secrets— SecretStore configured

Merged

The merge request goes through three automated checks before it can merge:

helm/template validates that the values.yaml renders valid Kubernetes manifests
kyverno/validate runs the cluster's policy set against the rendered manifests in CI
argocd/sync performs a dry-run sync to catch any conflicts with existing resources

After merge, ArgoCD detects the change and syncs within 3 minutes. The new team has their namespaces, RBAC, network policies, secrets, and registry access. They can start deploying immediately.

If someone manually deletes a network policy or modifies a resource quota in the cluster, ArgoCD detects the drift and syncs back to the desired state defined in Git. We have seen this happen exactly once in production (an engineer who "just wanted to test something"). The self-healing kicked in within 90 seconds.

What we are still figuring out

This architecture is not perfect. A few things we are actively working on:

Cross-namespace communication. Some teams need to talk to each other. Our default-deny network policies block this. Today, we handle it with explicit exceptions in the Helm chart values. It works, but it does not scale well when you have 30 teams with complex dependency graphs. We are evaluating Cilium's ClusterWide Network Policies for a more declarative approach.

Tenant-scoped observability. Each tenant gets their own Grafana dashboards and alert rules, but the underlying Prometheus is shared. At scale, cardinality becomes a problem. We are looking at tenant-level metric isolation via Thanos or Mimir multi-tenancy features.

Cost attribution. Quotas tell you what a team is allowed to use, not what they actually use. We are building integration with OpenCost to give each tenant visibility into their actual resource consumption.

Explore further

We have documented the full architecture with interactive diagrams on our Platform Design page. It covers the nested isolation model, the Helm render flow, and the GitOps onboarding workflow in detail.

If you are building a multi-tenant Kubernetes platform, or struggling with the one you have, we are happy to talk through the specifics. Schedule a call and bring your architecture diagrams. This is a design engagement, not a sales pitch.