June 11, 2026|9 min read

MCP Servers on Kubernetes: AI-Assisted Operations for Platform Teams

How we run MCP servers as production Kubernetes services - giving AI coding assistants access to Prometheus, Loki, Grafana, and cluster state for faster troubleshooting and true self-service on managed platforms.

By Jan Lauber

1.The architecture
2.The MCP servers we deploy
3.What this looks like for a developer
4.GitOps makes this work
5.Self-service layers
6.How we manage MCP servers
7.Get started

The Model Context Protocol (MCP) is changing how developers interact with infrastructure. Instead of switching between Grafana, Loki, kubectl, and GitLab to debug an issue, an AI assistant queries all of them in seconds and gives you a correlated answer.

Most teams run MCP servers locally, as dev tools on individual machines. That works for prototyping. But when you have 10 teams on a shared managed Kubernetes platform, each needing access to metrics, logs, and cluster state, local MCP servers do not scale. They need credentials, network access, and configuration that varies per cluster and per team.

We run MCP servers as production Kubernetes services. They are deployed via FluxCD, managed as part of the platform, and scoped to each team's namespace and permissions. This post explains the architecture, the MCP servers we deploy, and how this fits into the self-service model we have been building.

The architecture

AI Clientsdeveloper machines

Claude CodeCursorVS Code CopilotCustom agents

MCP protocol (SSE/stdio)

MCP Servers on Kubernetesmanaged by Natron

Prometheus MCPQuery metrics

Loki MCPSearch logs

Grafana MCPDashboard links

Kubernetes MCPCluster state

GitLab MCPRepos & pipelines

Alertmanager MCPActive alerts

Platform Services (data sources)

PrometheusLokiGrafanaKubernetes APIGitLab APIAlertmanagerVault

MCP servers run as Kubernetes Deployments inside the cluster, next to the platform services they connect to. Each MCP server is a thin API layer that translates MCP protocol requests into queries against the underlying service (PromQL for Prometheus, LogQL for Loki, Kubernetes API calls for cluster state).

The AI client (Claude Code, Cursor, or a custom agent) connects to the MCP servers via the MCP protocol. The connection is authenticated and scoped: a developer on team-data can only query metrics and logs from namespaces they have access to. The same RBAC model that governs kubectl access and ArgoCD projects governs MCP access.

The MCP servers we deploy

Prometheus MCP is the most immediately useful. A developer asks "what is the error rate for my API?" and the AI agent translates that into a PromQL query, runs it against the cluster's Prometheus, and returns the result with context. No more hunting for the right Grafana dashboard or remembering PromQL syntax. This works for any metric we collect: request latency, pod resource usage, GPU utilization, custom application metrics.

Loki MCP gives AI assistants access to log search. "Show me errors from api-service in the last 30 minutes" becomes a LogQL query executed against the cluster's Loki instance. The AI can correlate log patterns with metric anomalies without the developer opening a single browser tab.

Kubernetes MCP exposes cluster state: pods, events, deployments, resource quotas. When a developer asks "why is my pod pending?", the AI checks the pod status, node resources, events, and scheduling constraints in one query. It does the same multi-step kubectl investigation that would take a human 5 minutes in 5 seconds.

Grafana MCP provides dashboard links and annotation context. When the AI finds an anomaly in metrics, it can link to the relevant Grafana dashboard with the correct time range and namespace filter pre-applied. The developer gets a clickable link, not a wall of JSON.

GitLab MCP connects to the customer's Git repositories and CI pipelines. "What changed in the last deploy?" is answered by looking at the most recent merge request and its pipeline status. This closes the loop between "something is broken" and "what caused it".

Alertmanager MCP exposes active alerts and silences. The AI can check if there is already an alert firing for the issue the developer is investigating, or if the issue was recently silenced (meaning someone is already working on it).

What this looks like for a developer

Here is a real troubleshooting flow:

Developer"My API is returning 500s since the last deploy"

AI AgentQueries Prometheus MCP: error rate for api-service

AI AgentQueries Loki MCP: logs from api-service last 30min

AI AgentQueries Kubernetes MCP: recent pod restarts & events

AI AgentFound: OOMKilled after deploy, memory limit too low for new feature

DeveloperFixes memory limit in values.yaml, pushes to Git, ArgoCD syncs

The developer types one sentence. The AI agent queries three platform services, correlates the data, identifies the root cause, and suggests a fix. The developer updates one line in their Helm values, pushes to Git, and ArgoCD syncs the change.

Compare this to the traditional workflow:

Without MCP

1.Open Grafana, find the right dashboard

2.Open Loki, build a LogQL query

3.kubectl get pods, kubectl describe pod

4.kubectl logs pod-name --since=30m

5.Open GitLab, find the last merge request

6.Correlate across 4 browser tabs

7.Write the fix, push, wait for sync

15-30 minutes to root cause

With MCP

1."My API is throwing 500s since the last deploy"

2.Agent queries metrics, logs, and cluster state

3.Agent correlates: OOMKilled after memory-heavy feature

4.Agent suggests: increase memory limit to 512Mi

5.Developer updates values.yaml, pushes

6.ArgoCD syncs automatically

2-5 minutes to root cause

The difference is not just speed. It is accessibility. A junior developer who does not know PromQL, LogQL, or kubectl debug commands can now troubleshoot production issues with the same effectiveness as a senior platform engineer. The AI agent knows the query languages. The developer knows the context.

GitOps makes this work

MCP servers are powerful for reading platform state. But the fix still needs to go through Git. This is where our GitOps architecture becomes critical.

When the AI agent suggests "increase memory limit to 512Mi", the developer does not SSH into a node or run kubectl edit. They update values.yaml in their Git repository, push, and ArgoCD reconciles. The change is audited, reviewed, and reproducible.

The structured data model of GitOps (Helm values, Kustomize overlays, plain YAML manifests) is what makes AI-assisted operations practical. The AI can read the current state from MCP, suggest a change in the deployment manifest, and the developer applies it through the same Git workflow they use for every other change. No special tooling, no ad-hoc kubectl commands, no configuration drift.

This is the flywheel: MCP servers read platform state, AI correlates and suggests, GitOps applies the fix, and the platform reconciles. Every step is auditable, reversible, and team-scoped.

Self-service layers

MCP is the newest layer in our self-service model. Each layer serves a different need:

AI-Assisted (MCP)Natural language queries against platform data

"Show me error rates for team-data namespace""Why is my pod pending?""What changed in the last deploy?"

Self-Service (ArgoCD + GitOps)Structured deployments through Git

Deploy new version via PRScale replicas in values.yamlAdd environment variables

Dashboards (Grafana)Pre-built views for every team

Namespace resource usageApplication error ratesDeployment history

3rd-Level Support (Natron)Expert engineers for complex issues

Network policy debuggingPerformance tuningMigration planning

The layers build on each other. Grafana dashboards give teams visibility. ArgoCD gives teams deployment autonomy. MCP gives teams troubleshooting autonomy. And when the issue exceeds what self-service can solve, complex networking problems, cross-namespace interactions, platform-level failures, our 3rd-level support team steps in with the experience of running this across our entire fleet.

How we manage MCP servers

MCP servers are platform components. They are deployed, monitored, and upgraded the same way as every other platform service:

Deployed via FluxCD from our internal platform repository. Customers do not manage MCP server lifecycle.
Monitored by Prometheus with dashboards in Grafana. We track query latency, error rates, and connection counts per team.
Secured via RBAC that mirrors the existing team permissions. If a team has read access to a namespace, the MCP server scopes queries to that namespace.
Updated continuously as the MCP ecosystem evolves. New MCP servers, protocol updates, and security patches are rolled out across the fleet.

This is the same long-term maintenance story as every other platform component. We track the ecosystem, evaluate new tools, and operate them so your teams can use them.

Get started

MCP servers are available as an add-on on our managed Kubernetes platform. If your teams are already using AI coding assistants (Claude Code, Cursor, GitHub Copilot), connecting them to your actual platform data through MCP servers is the next step.

Schedule a call to discuss which MCP servers make sense for your setup. We will look at your observability stack, your team structure, and your RBAC model, and deploy the right MCP servers scoped to each team.