Sponsored by Deepsite.site

Dataproc MCP Server

Created By
dipseth8 months ago
Private MCP Dataproc server repository
Content

Dataproc MCP Server

npm version npm downloads Build Status Coverage Status License: MIT Node.js Version TypeScript MCP Compatible

A production-ready Model Context Protocol (MCP) server for Google Cloud Dataproc operations with intelligent parameter injection, enterprise-grade security, and comprehensive tooling. Designed for seamless integration with Roo (VS Code).

🚀 Quick Start

Add this to your Roo MCP settings:

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dipseth/dataproc-mcp-server@latest"],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}

With Custom Config File

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dipseth/dataproc-mcp-server@latest"],
      "env": {
        "LOG_LEVEL": "info",
        "DATAPROC_CONFIG_PATH": "/path/to/your/config.json"
      }
    }
  }
}

Alternative: Global Installation

# Install globally
npm install -g @dipseth/dataproc-mcp-server

# Start the server
dataproc-mcp-server

# Or run directly
npx @dipseth/dataproc-mcp-server@latest

5-Minute Setup

  1. Install the package:

    npm install -g @dipseth/dataproc-mcp-server@latest
    
  2. Run the setup:

    dataproc-mcp --setup
    
  3. Configure authentication:

    # Edit the generated config file
    nano config/server.json
    
  4. Start the server:

    dataproc-mcp
    

✨ Features

🎯 Core Capabilities

  • 16 Production-Ready MCP Tools - Complete Dataproc management suite
  • 🧠 Knowledge Base Semantic Search - Natural language queries with optional Qdrant integration
  • 🚀 Response Optimization - 60-96% token reduction with Qdrant storage
  • 60-80% Parameter Reduction - Intelligent default injection
  • Multi-Environment Support - Dev/staging/production configurations
  • Service Account Impersonation - Enterprise authentication
  • Real-time Job Monitoring - Comprehensive status tracking

🚀 Response Optimization

  • 96.2% Token Reduction - list_clusters: 7,651 → 292 tokens
  • Automatic Qdrant Storage - Full data preserved and searchable
  • Resource URI Access - dataproc://responses/clusters/list/abc123
  • Graceful Fallback - Works without Qdrant, falls back to full responses
  • 9.95ms Processing - Lightning-fast optimization with <1MB memory usage

Enterprise Security

  • Input Validation - Zod schemas for all 16 tools
  • Rate Limiting - Configurable abuse prevention
  • Credential Management - Secure handling and rotation
  • Audit Logging - Comprehensive security event tracking
  • Threat Detection - Injection attack prevention

📊 Quality Assurance

  • 90%+ Test Coverage - Comprehensive test suite
  • Performance Monitoring - Configurable thresholds
  • Multi-Environment Testing - Cross-platform validation
  • Automated Quality Gates - CI/CD integration
  • Security Scanning - Vulnerability management

🚀 Developer Experience

  • 5-Minute Setup - Quick start guide
  • Interactive Documentation - HTML docs with examples
  • Comprehensive Examples - Multi-environment configs
  • Troubleshooting Guides - Common issues and solutions
  • IDE Integration - TypeScript support

🛠️ Complete MCP Tools Suite (21 Tools)

🚀 Cluster Management (8 Tools)

ToolDescriptionSmart DefaultsKey Features
start_dataproc_clusterCreate and start new clusters✅ 80% fewer paramsProfile-based, auto-config
create_cluster_from_yamlCreate from YAML configuration✅ Project/region injectionTemplate-driven setup
create_cluster_from_profileCreate using predefined profiles✅ 85% fewer params8 built-in profiles
list_clustersList all clusters with filtering✅ No params neededSemantic queries, pagination
list_tracked_clustersList MCP-created clusters✅ Profile filteringCreation tracking
get_clusterGet detailed cluster information✅ 75% fewer paramsSemantic data extraction
delete_clusterDelete existing clusters✅ Project/region defaultsSafe deletion
get_zeppelin_urlGet Zeppelin notebook URL✅ Auto-discoveryWeb interface access

💼 Job Management (6 Tools)

ToolDescriptionSmart DefaultsKey Features
submit_hive_querySubmit Hive queries to clusters✅ 70% fewer paramsAsync support, timeouts
submit_dataproc_jobSubmit Spark/PySpark/Presto jobs✅ 75% fewer paramsMulti-engine support
get_job_statusGet job execution status✅ JobID only neededReal-time monitoring
get_job_resultsGet job outputs and results✅ Auto-paginationResult formatting
get_query_statusGet Hive query status✅ Minimal paramsQuery tracking
get_query_resultsGet Hive query results✅ Smart paginationEnhanced async support

📋 Configuration & Profiles (3 Tools)

ToolDescriptionSmart DefaultsKey Features
list_profilesList available cluster profiles✅ Category filtering8 production profiles
get_profileGet detailed profile configuration✅ Profile ID onlyTemplate access
query_cluster_dataQuery stored cluster data✅ Natural languageSemantic search

📊 Analytics & Insights (4 Tools)

ToolDescriptionSmart DefaultsKey Features
check_active_jobsQuick status of all active jobs✅ No params neededMulti-project view
get_cluster_insightsComprehensive cluster analytics✅ Auto-discoveryMachine types, components
get_job_analyticsJob performance analytics✅ Success ratesError patterns, metrics
query_knowledgeQuery comprehensive knowledge base✅ Natural languageClusters, jobs, errors

🎯 Key Capabilities

  • 🧠 Semantic Search: Natural language queries with Qdrant integration
  • ⚡ Smart Defaults: 60-80% parameter reduction through intelligent injection
  • 📊 Response Optimization: 96% token reduction with full data preservation
  • 🔄 Async Support: Non-blocking job submission and monitoring
  • 🏷️ Profile System: 8 production-ready cluster templates
  • 📈 Analytics: Comprehensive insights and performance tracking

📋 Configuration

Project-Based Configuration

The server supports a project-based configuration format:

# profiles/@analytics-workloads.yaml
my-company-analytics-prod-1234:
  region: us-central1
  tags:
    - DataProc
    - analytics
    - production
  labels:
    service: analytics-service
    owner: data-team
    environment: production
  cluster_config:
    # ... cluster configuration

Authentication Methods

  1. Service Account Impersonation (Recommended)
  2. Direct Service Account Key
  3. Application Default Credentials
  4. Hybrid Authentication with fallbacks

📚 Documentation

🔧 MCP Client Integration

Claude Desktop

{
  "mcpServers": {
    "dataproc": {
      "command": "npx",
      "args": ["@dataproc/mcp-server"],
      "env": {
        "LOG_LEVEL": "info"
      }
    }
  }
}

Roo (VS Code)

{
  "mcpServers": {
    "dataproc-server": {
      "command": "npx",
      "args": ["@dataproc/mcp-server"],
      "disabled": false,
      "alwaysAllow": [
        "list_clusters",
        "get_cluster",
        "list_profiles"
      ]
    }
  }
}

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   MCP Client    │────│  Dataproc MCP    │────│  Google Cloud   │
│  (Claude/Roo) │    │     Server       │    │    Dataproc     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                       ┌──────┴──────┐
                       │   Features  │
                       ├─────────────┤
                       │ • Security  │
                       │ • Profiles  │
                       │ • Validation│
                       │ • Monitoring│
                       └─────────────┘

🚦 Performance

Response Time Achievements

  • Schema Validation: ~2ms (target: <5ms) ✅
  • Parameter Injection: ~1ms (target: <2ms) ✅
  • Credential Validation: ~25ms (target: <50ms) ✅
  • MCP Tool Call: ~50ms (target: <100ms) ✅

Throughput Achievements

  • Schema Validation: ~2000 ops/sec ✅
  • Parameter Injection: ~5000 ops/sec ✅
  • Credential Validation: ~200 ops/sec ✅
  • MCP Tool Call: ~100 ops/sec ✅

🧪 Testing

# Run all tests
npm test

# Run specific test suites
npm run test:unit
npm run test:integration
npm run test:performance

# Run with coverage
npm run test:coverage

🤝 Contributing

We welcome contributions! Please see our Contributing Guide for details.

Development Setup

# Clone the repository
git clone https://github.com/dipseth/dataproc-mcp.git
cd dataproc-mcp

# Install dependencies
npm install

# Build the project
npm run build

# Run tests
npm test

# Start development server
npm run dev

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🆘 Support

🏆 Acknowledgments


Made with ❤️ for the MCP and Google Cloud communities

Recommend Servers
TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.
Serper MCP ServerA Serper MCP Server
Howtocook Mcp基于Anduin2017 / HowToCook (程序员在家做饭指南)的mcp server,帮你推荐菜谱、规划膳食,解决“今天吃什么“的世纪难题; Based on Anduin2017/HowToCook (Programmer's Guide to Cooking at Home), MCP Server helps you recommend recipes, plan meals, and solve the century old problem of "what to eat today"
Zhipu Web SearchZhipu Web Search MCP Server is a search engine specifically designed for large models. It integrates four search engines, allowing users to flexibly compare and switch between them. Building upon the web crawling and ranking capabilities of traditional search engines, it enhances intent recognition capabilities, returning results more suitable for large model processing (such as webpage titles, URLs, summaries, site names, site icons, etc.). This helps AI applications achieve "dynamic knowledge acquisition" and "precise scenario adaptation" capabilities.
MiniMax MCPOfficial MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
Baidu Map百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
AiimagemultistyleA Model Context Protocol (MCP) server for image generation and manipulation using fal.ai's Stable Diffusion model.
Tavily Mcp
Y GuiA web-based graphical interface for AI chat interactions with support for multiple AI models and MCP (Model Context Protocol) servers.
Amap Maps高德地图官方 MCP Server
ChatWiseThe second fastest AI chatbot™
MCP AdvisorMCP Advisor & Installation - Use the right MCP server for your needs
Visual Studio Code - Open Source ("Code - OSS")Visual Studio Code
BlenderBlenderMCP connects Blender to Claude AI through the Model Context Protocol (MCP), allowing Claude to directly interact with and control Blender. This integration enables prompt assisted 3D modeling, scene creation, and manipulation.
Jina AI MCP ToolsA Model Context Protocol (MCP) server that integrates with Jina AI Search Foundation APIs.
TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.
CursorThe AI Code Editor
EdgeOne Pages MCPAn MCP service designed for deploying HTML content to EdgeOne Pages and obtaining an accessible public URL.
DeepChatYour AI Partner on Desktop
WindsurfThe new purpose-built IDE to harness magic
Playwright McpPlaywright MCP server