K8s MCP Server

Created By

christian-schlichtherle6 months ago

An MCP server for comprehensive access to one or more Kubernetes clusters

# k8s

# mcp

Overview Content Tools Comments

Content

K8s MCP Server

Enhanced Kubernetes access with AI-friendly port-forward management supporting headless services.

✅ Implementation Complete - All core features have been implemented.

Features

kubectl-Inspired Output Formats: Multiple output formats (table, wide, json, yaml, custom-columns, jsonpath) for efficient token usage
Resource Kind Discovery: Natural language queries to find Custom Resource Definitions and builtin resources by keywords
URL-based Configuration: Share port-forward configs via GitHub Gists or local files
Headless Service Support: Direct pod targeting via label selectors
AI-Optimized: Clean service/target structure for intelligent fuzzy matching
Multi-format Support: YAML, JSON, TOML configuration files
kubectl-like Behavior: Selects first ready pod, same as kubectl port-forward
Session Management: In-memory port-forward tracking with list and teardown capabilities
Parameterless Discovery: Easy cluster, alias, and service discovery for AI agents

Quick Start

1. Install Dependencies

pip install -e .

2. Set Environment Variables

# Required: Kubernetes cluster configurations
export KUBECONFIG_URLS="dev=file:///path/to/dev.yaml prod=file:///path/to/prod.yaml"

# Optional: Cluster aliases
export KUBECONFIG_ALIASES="development=dev production=prod"

# Required: Port-forward configuration URL
export PORT_FORWARD_CONFIG_URL="file://$(pwd)/example-port-forward-config.yaml"
# Or use a GitHub Gist:
# export PORT_FORWARD_CONFIG_URL="https://gist.githubusercontent.com/user/abc123/raw/port-forward.yaml"

3. Run the Server

python main.py

Usage Examples

AI Agent Interactions

User: "Setup port-forward to the FRET Druid in development"
AI Agent:
1. Calls discover_port_forward_services() to understand available services
2. Maps "FRET Druid in development" → service="druid", target="fret-dev"  
3. Calls setup_port_forward_to_target("druid", "fret-dev")
4. Returns: "✅ Connected to fret-dev Druid: http://localhost:8082 → pod/druid-router-0"

Direct Tool Calls

connect_to_service_targets("druid", ["fret-dev"]) - Connect to specific target
connect_to_service_targets("druid", []) - Connect to all Druid environments
list_resources("prod", "pods", labels={"app": "nginx"}) - List pods with labels
get_resource("dev", "deployment", "my-app", "default") - Get specific resource
discover_resource_kinds("dev", ["druid", "ingestion"]) - Find resource kinds by keywords
list_resources("dev", "pods", output_format="custom-columns", custom_columns="NAME:.metadata.name,IP:.status.podIP") - Custom output
list_port_forwards() - Show active port-forward sessions
teardown_port_forward("forward-id") - Stop a port-forward session
Supports kubectl naming variants: "pods", "pod", "po" all work

Output Format Examples

Token-Efficient Resource Listing

# Default table format (kubectl-like)
list_resources(cluster="dev", kind="pods", namespace="fret-dev")
# Returns clean table: NAME | READY | STATUS | RESTARTS | AGE

Custom Columns for Specific Data

# Extract specific fields with JSONPath
list_resources(
    cluster="dev", 
    kind="services",
    output_format="custom-columns", 
    custom_columns="NAME:.metadata.name,TYPE:.spec.type,PORTS:.spec.ports[*].port"
)

Resource Kind Discovery

# Find resource kinds by keywords
discover_resource_kinds(
    cluster="dev",
    keywords=["druid", "ingestion"],
    match_mode="OR"
)
# Returns: DruidIngestion resource kind info with all naming variants

Resource Naming Conventions

The K8s MCP server follows kubectl naming conventions:

Resource Kind Support

Plural forms: pods, services, deployments
Singular forms: pod, service, deployment
Abbreviations: po, svc, deploy, sts (for StatefulSet)
Case insensitive: Pod, POD, pod all work

Discovery vs Search

discover_* tools find available resource kinds/types
search_* and list_* tools find specific instances

Example Workflow

# 1. Discover what kinds are available
discover_resource_kinds(cluster="dev", keywords=["druid"])
# Returns: DruidIngestion kind with variants ["druidingestions", "druidingestion"]

# 2. List instances using any variant
list_resources(cluster="dev", kind="druidingestions")
# or
list_resources(cluster="dev", kind="DruidIngestion")

Configuration

Port-Forward Config Format

druid:
  fret-dev:
    k8s_cluster: "dev-us-east"
    namespace: "druid"
    labels:
      app.kubernetes.io/component: "router"
    local_port: 8082
    remote_port: 8888

grafana:
  dev:
    k8s_cluster: "dev-us-east"
    namespace: "monitoring"
    labels:
      app.kubernetes.io/name: "grafana"
    local_port: 3000
    remote_port: 3000

Common Workflows

These workflow examples demonstrate how AI agents can combine multiple tools to accomplish complex Kubernetes tasks:

1. Cluster Overview Workflow

User: "Show me the status of all my clusters"
AI Agent Flow:
1. discover_clusters() → Get available clusters ["dev", "staging", "prod"]
2. For each cluster:
   - cluster_health() → Check API server connectivity
   - list_resources("cluster", "nodes") → Get node count and status
   - list_resources("cluster", "namespaces") → Get namespace overview
3. Summarize: "3 clusters online, 12 total nodes, 8 namespaces"

2. Application Debugging Workflow

User: "My app isn't working in staging"
AI Agent Flow:
1. list_resources("staging", "pods", labels={"app": "user-app"}) → Find app pods
2. For each pod:
   - describe_resource("staging", "pod", pod_name) → Get detailed status
   - get_logs("staging", pod_name, lines=50) → Check recent logs
   - list_resources("staging", "events", field_selector=f"involvedObject.name={pod_name}") → Get events
3. Analyze patterns: "Found 2 pods crash-looping due to missing config map 'app-config'"

3. Safe Deployment Update Workflow

User: "Update nginx deployment to version 1.21"
AI Agent Flow:
1. get_resource("prod", "deployment", "nginx") → Get current deployment
2. Modify image version in manifest
3. Present proposed changes: "Will update image from nginx:1.20 to nginx:1.21"
4. User confirms: "Apply it"
5. update_resource("prod", "deployment", "nginx", new_manifest) → Apply update
6. Monitor rollout status with deployment_status()

4. Resource Cleanup Workflow

User: "Clean up all test resources from last week"
AI Agent Flow:
1. list_resources("dev", "all", labels={"purpose": "test", "created": "2024-01-15"}) → Find test resources
2. Group by type: "Found 5 pods, 3 services, 2 configmaps"
3. For each resource:
   - Show resource details and age
4. Present deletion plan with resource details
5. User confirms: "Yes, delete them"
6. Execute deletions with proper cascade handling

5. Enhanced Port-Forward Workflow

User: "Setup port-forward to the FRET Druid in development"
AI Agent Flow:
1. discover_port_forward_services() → Get available services/targets
2. AI reasoning: "FRET Druid in development" → service="druid", target="fret-dev"
3. connect_to_service_targets("druid", ["fret-dev"]) → Setup to specific target
4. Return: "✅ Connected to fret-dev Druid: http://localhost:8082 → pod/druid-router-0"

6. Troubleshooting Performance Workflow

User: "Why is my app slow in production?"
AI Agent Flow:
1. list_resources("prod", "pods", labels={"app": "slow-app"}) → Find app pods
2. For each pod:
   - describe_resource() → Check resource limits/requests
   - get_logs(since="1h") → Check for performance-related logs
   - execute_in_pod(["top", "-b", "-n", "1"]) → Get resource usage
3. list_resources("prod", "nodes") → Check node capacity
4. query_by_labels("prod", "hpa", "app=slow-app") → Check auto-scaling
5. Analyze: "App is CPU-throttled, increase CPU requests from 100m to 250m"

7. Security Audit Workflow

User: "Audit RBAC permissions for the finance team"
AI Agent Flow:
1. list_resources("prod", "rolebindings") → Get all role bindings
2. list_resources("prod", "clusterrolebindings") → Get cluster-wide bindings
3. Filter by subjects containing "finance"
4. For each binding:
   - get_resource("prod", "role", role_name) → Get role permissions
5. Summarize: "Finance team has read access to secrets in finance namespace"

8. Disaster Recovery Workflow

User: "Restore the database from backup"
AI Agent Flow:
1. list_resources("prod", "pods", labels={"app": "database"}) → Find DB pods
2. scale_resource("prod", "deployment", "database", replicas=0) → Scale down
3. Wait for pods to terminate
4. apply_manifest("prod", restore_job_manifest) → Apply restore job
5. Monitor job status until completion
6. scale_resource("prod", "deployment", "database", replicas=1) → Scale back up
7. Verify database connectivity

Best Practices for AI Agents

Always Preview Changes: Use diff/dry-run before modifications
Provide Context: Include relevant information when reporting issues
Batch Operations: Group related operations for efficiency
Handle Pagination: Be aware of large result sets
Respect Cluster Boundaries: Never assume operations across clusters

Tool Safety Classifications

All MCP tools are classified by their safety level for AI agent operations:

[READ-ONLY] Tools - Safe for autonomous use

These tools only read data and never modify cluster or local state:

Resource Operations:

get_resource() - Get specific resource details
get_resource_yaml() - Get resource as YAML
list_resources() - List resources with formatting options
list_resources_yaml() - List resources as YAML
query_by_labels() - Advanced label-based queries
query_by_labels_yaml() - Label queries as YAML

Monitoring & Analysis:

troubleshoot_pod() - Comprehensive pod debugging
troubleshoot_pod_yaml() - Pod debugging as YAML
cluster_health() - Overall cluster health assessment
deployment_status() - Deployment rollout analysis
deployment_status_yaml() - Deployment analysis as YAML
resource_usage() - Resource consumption analysis
get_logs() - Retrieve pod logs
list_port_forwards() - List active port-forward sessions

Discovery:

discover_resource_kinds() - Find resource types by keywords
discover_clusters() - List available clusters
discover_cluster_aliases() - List cluster aliases
discover_port_forward_services() - List port-forward configurations

[MODIFIES STATE] Tools - Require user approval

These tools modify cluster state or local network configuration:

Resource Management:

create_resource() - Create new Kubernetes resources
update_resource() - Update existing resources
patch_resource() - Partially update resources
scale_resource() - Change replica counts
apply_manifest() - Apply YAML manifests

Network Operations:

connect_to_service_targets() - Connect to configured service environments
setup_port_forward() - Generic port-forward setup
teardown_port_forward() - Close port-forward sessions

[DANGEROUS] Tools - Require explicit user confirmation

These tools can cause data loss or security risks:

delete_resource() - Permanently delete resources (cannot be undone)
execute_in_pod() - Run arbitrary commands in pods (potential security risk)

AI Agent Safety Guidelines

Autonomous Operations: AI agents can freely use [READ-ONLY] tools
State Changes: [MODIFIES STATE] tools should require user approval
High-Risk Operations: [DANGEROUS] tools must have explicit user confirmation
Error Handling: All tools return consistent error formats for safe parsing

Architecture

k8s-mcp/
├── main.py                 # MCP server entry point with tool definitions
├── config/                 # Configuration management  
│   ├── loader.py          # URL-based config loading
│   ├── manager.py         # Cluster and port-forward session management
│   └── config.py          # Environment validation and configuration
├── tools/                  # Modular tool implementations
│   ├── resources.py       # Common resource operation patterns
│   ├── formatting.py      # All resource formatting (table, YAML, JSON)
│   ├── columns.py         # Default column definitions for resources
│   ├── discovery.py       # Resource kind discovery and CRD enumeration
│   └── port_forward.py    # Port-forward specific operations
└── utils/                  # Shared utilities
    ├── helpers.py         # Common helper functions
    └── jsonpath.py        # JSONPath processing for custom columns

Code Organization Principles

Single Responsibility: Each module has a clear, focused purpose
DRY Compliance: No duplicate implementations across files
Separation of Concerns: Client management, formatting, and business logic are separated
Modular Design: Tools can be imported and reused independently

Server Config

{
  "mcpServers": {
    "k8s": {
      "command": "uv",
      "args": [
        "--directory",
        "/path/to/k8s-mcp",
        "run",
        "k8s-mcp"
      ],
      "env": {
        "KUBECONFIG_ALIASES": "development=dev production=prod",
        "KUBECONFIG_URLS": "dev=file:///path/to/development/k3s/kubeconfig.yaml prod=file:///path/to/production/k3s/kubeconfig.yaml",
        "PORT_FORWARD_CONFIG_URL": "file:///path/to/k8s-mcp/port-forward-config.yaml"
      }
    }
  }
}

Recommend Servers

TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.

AiimagemultistyleA Model Context Protocol (MCP) server for image generation and manipulation using fal.ai's Stable Diffusion model.

Amap Maps高德地图官方 MCP Server

WindsurfThe new purpose-built IDE to harness magic

MCP AdvisorMCP Advisor & Installation - Use the right MCP server for your needs

DeepChatYour AI Partner on Desktop

TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.

Tavily Mcp

BlenderBlenderMCP connects Blender to Claude AI through the Model Context Protocol (MCP), allowing Claude to directly interact with and control Blender. This integration enables prompt assisted 3D modeling, scene creation, and manipulation.

Y GuiA web-based graphical interface for AI chat interactions with support for multiple AI models and MCP (Model Context Protocol) servers.

Jina AI MCP ToolsA Model Context Protocol (MCP) server that integrates with Jina AI Search Foundation APIs.