Mcp Monitoring

Created By

reemshai107 months ago

A sophisticated Model Context Protocol (MCP) server that provides intelligent monitoring and observability integration. This server enables natural language interactions with Prometheus, AlertManager, and Grafana through chat-style commands, advanced query processing, and comprehensive monitoring automation. ## 🌟 Overview This MCP server transforms how you interact with monitoring infrastructure by providing: - **Natural Language Processing**: Ask monitoring questions in plain English - **Intelligent Query Translation**: Automatically converts questions to PromQL queries - **Historical Alert Analysis**: Count failures, outages, and incidents over time - **Multi-Source Integration**: Seamlessly works with Prometheus, AlertManager, and Grafana - **Automated Incident Detection**: Smart pattern recognition for service failures ## ✨ Key Features ### 🧠 **Natural Language Query Engine** - **Smart Intent Recognition**: Understands monitoring questions like "How many times did service X fail?" - **Automatic Time Range Parsing**: Handles phrases like "last 2 weeks", "yesterday", "past month" - **Service Name Detection**: Recognizes services like opengrok, jenkins, grafana, prometheus - **Alert Pattern Matching**: Identifies automation failures, service outages, and critical incidents - **Context-Aware Responses**: Provides detailed breakdowns with incident counts and durations ### 🔍 **Prometheus Integration** - **Advanced PromQL Generation**: Automatically creates complex queries based on natural language - **Historical Data Analysis**: Analyzes alert trends and service availability over time - **Metric Discovery**: Browse and search available metrics with intelligent filtering - **Range Query Optimization**: Smart step sizing for different time ranges - **Alert History Tracking**: Tracks firing periods and incident detection ### 🚨 **AlertManager Integration** - **Real-time Alert Monitoring**: Query active, pending, and resolved alerts - **Smart Alert Filtering**: Filter by service, severity, alertname, or custom labels - **Alert Fingerprinting**: Track unique alert instances and their lifecycle - **Incident Correlation**: Group related alerts and calculate total impact ### 📊 **Grafana Integration** (Optional) - **Dashboard Discovery**: Find dashboards related to specific services - **Dynamic Dashboard Links**: Generate direct links to relevant monitoring views - **Service Context Mapping**: Connect services to their monitoring dashboards

# Prometheus

# AlertManager

Overview Content Tools Comments

Content

💬 Natural Language Examples

Service Failure Analysis

Q: "How many times did prevent-opengrok automation fail in the last 2 weeks?"
A: 46 failures over 2 days and 3 hours total downtime

Q: "Show me jenkins outages yesterday"
A: Detailed breakdown of jenkins service interruptions

Q: "Count critical alerts for grafana service this month"
A: Historical analysis with incident timeline

Service Availability Queries

Q: "How many times was prometheus down last week?"
A: Service downtime incidents with duration analysis

Q: "Show cleanup-zuultmp disk usage alerts"  
A: Disk space warnings and critical alerts breakdown

Q: "What automation failures happened in the past 7 days?"
A: Comprehensive automation failure report

🔧 Integration Examples

VS Code MCP Configuration

{
  "servers": {
    "monitoring-mcp": {
      "command": "node",
      "args": [
        "/Users/MCP/mcp-monitoring/dist/index.js"
      ],
      "env": {
        "PROMETHEUS_URL": "${input:prometheus_base_url}",
        "ALERTMANAGER_URL": "${input:alertmanager_base_url}",
        "GRAFANA_URL": "${input:grafana_base_url}",
        "GRAFANA_API_KEY": "${input:grafana_api_key}"
        }
      }
    }
  }
}

For Grafana Token ask the admin to create a service user and provide the token

🎯 Use Cases

DevOps Teams

Incident Response: Quickly assess service health and failure patterns
Postmortem Analysis: Historical incident data for root cause analysis
Capacity Planning: Trend analysis and resource utilization monitoring
Alert Fatigue Management: Identify noisy alerts and optimization opportunities

SRE Teams

SLI/SLO Monitoring: Service availability and performance tracking
Error Budget Analysis: Calculate error rates and availability metrics
Automated Reporting: Generate incident reports and availability summaries
Proactive Monitoring: Identify patterns before they become critical issues

Development Teams

Deployment Monitoring: Track deployment success/failure rates
Performance Regression Detection: Compare metrics across releases
Integration Testing: Monitor test environment stability
Feature Flag Impact: Assess performance impact of feature rollouts

🧩 Architecture

Smart Query Processing Pipeline

Intent Recognition: Parse natural language to understand query type
Service Detection: Identify target services and components
Time Range Extraction: Parse temporal expressions into date ranges
PromQL Generation: Create optimized queries based on intent
Data Analysis: Process results and calculate meaningful metrics
Response Formatting: Present data in human-readable format

Supported Query Types

current_alerts: Active/firing alerts right now
historical_alerts: Past incidents and failure counts
service_availability: Uptime/downtime analysis
dashboard_discovery: Find relevant monitoring dashboards
metrics: General metric queries and analysis

📈 Performance Features

Intelligent Query Optimization: Automatic step sizing for different time ranges
Result Caching: Avoid redundant API calls for recent queries
Timeout Handling: Graceful handling of slow monitoring APIs
Batch Processing: Efficient handling of multi-service queries
Memory Management: Optimized for long-running server deployment

🔒 Security & Best Practices

Authentication

Secure API token storage for Grafana integration
Support for basic auth with Prometheus/AlertManager
Environment variable configuration for sensitive data

Network Security

HTTPS-only connections to monitoring services
Configurable timeout and retry policies
Certificate validation for secure connections

Access Control

Read-only operations by design
No data modification capabilities
Audit logging for all monitoring queries

🐛 Troubleshooting

Common Issues

# Connection errors
Error: connect ECONNREFUSED
Solution: Check PROMETHEUS_URL and network connectivity

# Authentication failures  
Error: 401 Unauthorized
Solution: Verify API tokens and authentication credentials

# Query timeouts
Error: timeout of 30000ms exceeded
Solution: Reduce query complexity or time range

# No data returned
Warning: No matching metrics found
Solution: Check service names and time range validity

Debug Mode

# Enable verbose logging
DEBUG=monitoring-mcp node dist/index.js

# Check configuration
node -e "console.log(process.env.PROMETHEUS_URL)"

🚀 Advanced Usage

Custom Service Detection

The server automatically recognizes these services:

cleanup-zuultmp, opengrok, jenkins
grafana, prometheus, alertmanager
gerrit, nginx, mysql, redis, elasticsearch

Advanced Natural Language Patterns

"How many times did [service] fail in the last [time period]?"
"Show me [severity] alerts for [service] [time range]"
"Count [alert name] incidents in [time period]"
"When was [service] down last [time period]?"

🤝 Contributing

Contributions welcome! Please ensure:

TypeScript compilation passes (npm run build)
Natural language query tests pass
Documentation updated for new features
Error handling comprehensive

Built with ❤️ for DevOps and SRE teams who want smarter monitoring interactions

Server Config

{
  "mcpServers": {
    "monitoring-mcp": {
      "command": "node",
      "args": [
        "/Users/MCP/mcp-monitoring/dist/index.js"
      ],
      "env": {
        "PROMETHEUS_URL": "${input:prometheus_base_url}",
        "ALERTMANAGER_URL": "${input:alertmanager_base_url}",
        "GRAFANA_URL": "${input:grafana_base_url}",
        "GRAFANA_API_KEY": "${input:grafana_api_key}"
      }
    }
  }
}

Recommend Servers

TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.

Playwright McpPlaywright MCP server

AiimagemultistyleA Model Context Protocol (MCP) server for image generation and manipulation using fal.ai's Stable Diffusion model.

TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.

CursorThe AI Code Editor

Baidu Map百度地图核心API现已全面兼容MCP协议，是国内首家兼容MCP协议的地图服务商。

Y GuiA web-based graphical interface for AI chat interactions with support for multiple AI models and MCP (Model Context Protocol) servers.

Jina AI MCP ToolsA Model Context Protocol (MCP) server that integrates with Jina AI Search Foundation APIs.

Visual Studio Code - Open Source ("Code - OSS")Visual Studio Code

WindsurfThe new purpose-built IDE to harness magic

MiniMax MCPOfficial MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.