Skip to main content
Back to Overview
March 14, 2026|9 min read

Kubernetes Multi-Tenancy Done Right

The technical decisions behind our multi-tenant Kubernetes platform - why we chose Kyverno over OPA, Helm over custom operators, and how a single values.yaml provisions an entire tenant

Jan LauberBy Jan Lauber

We have been running multi-tenant Kubernetes clusters for enterprise customers since 2021. In that time, we went from manually provisioning namespaces and writing RBAC YAMLs by hand to a system where onboarding a new team is a single values.yaml file and a merge request.

This post walks through the technical decisions we made along the way, the things that broke, and what the architecture looks like today.

Where most setups go wrong

Almost every organization we work with starts the same way: someone creates a namespace, gives the team cluster-admin (or something close to it), and moves on. It works for two teams. By the time there are ten, the problems are everywhere.

Typical Setup
Shared Cluster
team-alpha
team-beta
team-gamma
No resource quotas
No network policies
No image restrictions
Manual RBAC setup
With Tenancy Toolkit
Managed Cluster + Guardrails
team-alpha
QuotasNetPolRBACKyverno
team-beta
QuotasNetPolRBACKyverno
team-gamma
QuotasNetPolRBACKyverno
Enforced quotas & limits
Default-deny network policies
Image allow-lists per tenant
RBAC from Helm chart

The issue is not that people are lazy. It is that Kubernetes does not have a built-in concept of a "tenant". You get namespaces, RBAC, and resource quotas as separate primitives. Wiring them together consistently across teams, environments, and clusters is the actual engineering challenge.

We learned this the hard way. Early on, a customer's staging workload had no resource quotas. A memory leak in a Java application consumed 48 GB of RAM on a shared node, which triggered OOM kills across three production namespaces belonging to other teams. The fix took 20 minutes. Rebuilding trust took months.

The architecture: layers, not monoliths

After a few iterations, we settled on a layered architecture. Each layer has one job and clear boundaries.

Observability & AuditMonitoring, logging, alerting across tenants
L6
Platform ServicesSecrets, registry, certificates
L5
Smart GuardrailsKyverno admission, image policies, pod security
L4
RBAC & AuthenticationOIDC/SSO, role bindings, service accounts
L3
Tenant IsolationNamespaces, quotas, network policies
L2
Managed ClusterKubernetes API, node pools, CNI
L1

The key insight: the managed cluster is just the foundation. Everything above it is what makes the difference between "we have Kubernetes" and "we have a platform".

Layer 1 (Managed Cluster) handles the undifferentiated heavy lifting - node provisioning, API server, etcd, CNI. We run this on our own infrastructure (Natron Cloud) or on customer infrastructure via Flex Stack.

Layers 2-5 are where the tenancy model lives. And this is where our Helm-based toolkit comes in.

Why Helm and not a custom operator

We evaluated three approaches for tenant provisioning:

  1. Manual YAMLs in a Git repo. Works for five tenants. Falls apart at twenty. Every tenant needs 8-12 resources, and copy-paste errors are inevitable.
  2. Custom Kubernetes operator. A CRD like Tenant that reconciles all resources. Elegant in theory. In practice, you are now maintaining a Go codebase, handling upgrade paths, and debugging controller crashes at 3 AM.
  3. Helm chart + ArgoCD. One chart, one values.yaml per tenant, ArgoCD handles reconciliation. No custom code to maintain. The Helm templating language is ugly, but it is battle-tested and every platform engineer already knows it.

We chose option 3. The Helm chart templates out everything a tenant needs from a single values file:

values.yamlTenant definition
Tenant Helm ChartTemplate rendering
ArgoCDSync to cluster
Namespaces
RBAC & RoleBindings
Network Policies
Kyverno Policies
Vault SecretStores
Registry Pull Secrets

The values.yaml for a tenant looks like this:

tenant:
  name: team-data
  namespaces:
    - team-data-dev
    - team-data-staging
    - team-data-prod
  quotas:
    cpu: "8"
    memory: 16Gi
    storage: 100Gi
  rbac:
    clusterRole: namespace-admin
    groups:
      - "oidc:team-data-devs"
  networkPolicy: restricted
  registry:
    project: team-data
    allowList:
      - "registry.natron.io/team-data/**"
      - "docker.io/library/**"
  vault:
    path: kv/team-data/*
  kyverno:
    disallowPrivileged: true
    requireRunAsNonRoot: true
    requireReadOnlyRoot: true

That is the entire tenant definition. The Helm chart renders it into 15-20 Kubernetes resources: namespaces, resource quotas, limit ranges, role bindings, network policies, Kyverno policies, ClusterSecretStores, registry pull secrets, and more.

No tickets. No manual steps. No "I forgot to add the network policy".

Why Kyverno over OPA Gatekeeper

This is probably the decision we get asked about most. Both are CNCF projects. Both do admission control. We went with Kyverno for three concrete reasons.

OPA Gatekeeper

Policies in Rego (custom language)

Separate ConstraintTemplates + Constraints

Validation only (no mutation, no generation)

Steep learning curve for platform teams

# Rego policy
violation[{"msg": msg}] {
input.review.object.spec
.containers[_].securityContext
.privileged == true
msg := "privileged not allowed"
}
KyvernoOur choice

Policies in YAML (native to K8s)

Single ClusterPolicy resource

Validate + Mutate + Generate

Platform teams already know YAML

# Kyverno policy
apiVersion: kyverno.io/v1
kind: ClusterPolicy
spec:
rules:
- name: disallow-privileged
match:
resources:
kinds: [Pod]
validate:
deny:
conditions:
- key: privileged
operator: Equals
value: true

Reason 1: YAML-native policies. Platform teams already think in YAML. Asking them to learn Rego (OPA's policy language) creates a knowledge bottleneck. With Kyverno, a new policy looks like every other Kubernetes manifest they already write.

Reason 2: Mutation and generation. Kyverno does not just validate. It can mutate resources (inject labels, set defaults) and generate new resources (create a NetworkPolicy when a namespace is created). OPA Gatekeeper is validation-only. We use mutation heavily to enforce consistent labeling and inject sidecar configurations.

Reason 3: Per-tenant scoping. Kyverno policies can be scoped to specific namespaces using label selectors. We template these selectors in the Helm chart, so each tenant gets exactly the policies defined in their values.yaml. A tenant that handles PCI data gets stricter image policies than an internal tooling team.

The trade-off: Kyverno uses more memory than OPA Gatekeeper in large clusters (100+ policies). We mitigate this by running Kyverno in HA mode with dedicated node pools.

How onboarding actually works

When a customer wants to add a new team to their platform, this is the actual workflow:

feat: onboard team-data with full isolation#287
feat/onboard-team-datamain
tenants/team-data/values.yaml+26
1+tenant:
2+ name: team-data
3+ namespaces:
4+ - team-data-dev
5+ - team-data-staging
6+ - team-data-prod
7+ quotas:
8+ cpu: "8"
9+ memory: 16Gi
10+ storage: 100Gi
11+ rbac:
12+ clusterRole: namespace-admin
13+ groups:
14+ - "oidc:team-data-devs"
15+ networkPolicy: restricted
16+ registry:
17+ project: team-data
18+ allowList:
19+ - "registry.natron.io/team-data/**"
20+ - "docker.io/library/**"
21+ vault:
22+ path: kv/team-data/*
23+ kyverno:
24+ disallowPrivileged: true
25+ requireRunAsNonRoot: true
26+ requireReadOnlyRoot: true
Checks
helm/template
kyverno/validate
argocd/sync
vault/secrets
Merged

The merge request goes through three automated checks before it can merge:

  1. helm/template validates that the values.yaml renders valid Kubernetes manifests
  2. kyverno/validate runs the cluster's policy set against the rendered manifests in CI
  3. argocd/sync performs a dry-run sync to catch any conflicts with existing resources

After merge, ArgoCD detects the change and syncs within 3 minutes. The new team has their namespaces, RBAC, network policies, secrets, and registry access. They can start deploying immediately.

If someone manually deletes a network policy or modifies a resource quota in the cluster, ArgoCD detects the drift and syncs back to the desired state defined in Git. We have seen this happen exactly once in production (an engineer who "just wanted to test something"). The self-healing kicked in within 90 seconds.

What we are still figuring out

This architecture is not perfect. A few things we are actively working on:

Cross-namespace communication. Some teams need to talk to each other. Our default-deny network policies block this. Today, we handle it with explicit exceptions in the Helm chart values. It works, but it does not scale well when you have 30 teams with complex dependency graphs. We are evaluating Cilium's ClusterWide Network Policies for a more declarative approach.

Tenant-scoped observability. Each tenant gets their own Grafana dashboards and alert rules, but the underlying Prometheus is shared. At scale, cardinality becomes a problem. We are looking at tenant-level metric isolation via Thanos or Mimir multi-tenancy features.

Cost attribution. Quotas tell you what a team is allowed to use, not what they actually use. We are building integration with OpenCost to give each tenant visibility into their actual resource consumption.

Explore further

We have documented the full architecture with interactive diagrams on our Platform Design page. It covers the nested isolation model, the Helm render flow, and the GitOps onboarding workflow in detail.

If you are building a multi-tenant Kubernetes platform, or struggling with the one you have, we are happy to talk through the specifics. Schedule a call and bring your architecture diagrams. This is a design engagement, not a sales pitch.

Jan Lauber

About the author

Jan Lauber

Cloud Engineer and Partner at Natron Tech, building multi-tenant Kubernetes platforms for enterprise organizations across Switzerland.

A new team should be a pull request, not a support ticket.

Read Next