Sponsored by Deepsite.site

Claude Desktop Real-time Audio MCP Server (Python Implementation)

Created By
joelfuller20167 months ago
Python-based Model Context Protocol (MCP) server for real-time microphone input to Claude Desktop on Windows. FastMCP + sounddevice + multiple STT engines for sub-500ms latency voice conversations.
Content

Claude Desktop Real-time Audio MCP Server (Python Implementation)

License: MIT Python Version Platform

A Python-based Model Context Protocol (MCP) server that enables real-time microphone input for Claude Desktop on Windows. This implementation leverages Python's superior audio processing ecosystem to provide robust voice-driven conversations with Claude through WASAPI audio capture and multiple speech recognition engines.

🚀 Key Advantages of Python Implementation

  • 🐍 Mature Audio Ecosystem: Leverages sounddevice, webrtcvad, and specialized Windows audio libraries
  • 🧠 Multiple STT Engines: OpenAI Whisper (local/API), Azure Speech, Google Speech-to-Text
  • ⚡ FastMCP Framework: High-level Pythonic interface for rapid MCP development
  • 🔧 Easy Configuration: JSON/YAML configuration with environment variable support
  • 📊 Better Debugging: Comprehensive logging and performance monitoring
  • 🔄 Async Architecture: Non-blocking operations with asyncio

🏗️ Architecture

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Claude        │    │   FastMCP Server │    │  Audio Capture  │
│   Desktop       │◄──►│   (Python)       │◄──►│  (sounddevice)  │
│                 │    │                  │    │  + WASAPI       │
└─────────────────┘    └──────────────────┘    └─────────────────┘
                                │                        │
                                ▼                        ▼
                       ┌──────────────────┐    ┌─────────────────┐
                       │  STT Engines     │    │  Voice Activity │
                       │  • Whisper       │    │  Detection      │
                       │  • Azure Speech  │    │  (webrtcvad)    │
                       │  • Google Speech │    │                 │
                       └──────────────────┘    └─────────────────┘

📋 Prerequisites

  • Windows 10/11 (Windows 7+ with WASAPI support)
  • Python 3.8+
  • Claude Desktop (latest version)

🚦 Quick Start

1. Installation

# Clone the repository
git clone https://github.com/joelfuller2016/claude-desktop-realtime-audio-mcp-python.git
cd claude-desktop-realtime-audio-mcp-python

# Create virtual environment
python -m venv venv
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Install with GPU support (optional)
pip install -r requirements.txt[gpu]

2. Configuration

Create a configuration file or use environment variables:

# Set OpenAI API key (for Whisper API)
set OPENAI_API_KEY=your_api_key_here

# Set Azure Speech key (optional)
set AZURE_SPEECH_KEY=your_azure_key
set AZURE_SPEECH_REGION=eastus

# Set Google credentials (optional)
set GOOGLE_CREDENTIALS_PATH=path/to/credentials.json

3. Test Audio Setup

# Test your microphone and audio devices
python -m audio.test_setup

4. Run the MCP Server

# Start the server
python main.py

# Or with debug logging
python main.py --debug

5. Configure Claude Desktop

Add to your Claude Desktop configuration file (claude_desktop_config.json):

{
  "mcpServers": {
    "claude-audio": {
      "command": "python",
      "args": [
        "C:\\full\\path\\to\\main.py"
      ],
      "env": {
        "OPENAI_API_KEY": "your_api_key_here"
      }
    }
  }
}

🛠️ MCP Tools Available

Audio Control

  • start_recording() - Start real-time audio capture
  • stop_recording() - Stop audio capture
  • get_recording_status() - Get current status and config
  • test_audio_capture(duration=3.0) - Test microphone

Device Management

  • list_audio_devices() - List all audio input devices
  • set_audio_device(device_id) - Set audio input device
  • configure_audio_settings() - Adjust sample rate, channels, etc.

Speech Recognition

  • set_stt_engine(engine) - Switch between whisper/azure/google
  • View available engines and status

Resources

  • audio://devices - Available audio devices
  • audio://config - Current audio configuration
  • stt://engines - STT engines status

⚙️ Configuration

Audio Settings

{
  "audio": {
    "sample_rate": 16000,
    "channels": 1,
    "chunk_size": 1024,
    "device_id": null,
    "use_wasapi_exclusive": false,
    "low_latency": true
  }
}

Voice Activity Detection

{
  "vad": {
    "mode": "hybrid",
    "webrtc_aggressiveness": 2,
    "energy_threshold": 0.01,
    "min_speech_duration": 0.1,
    "min_silence_duration": 0.3
  }
}

Speech-to-Text Engines

{
  "stt": {
    "default_engine": "whisper",
    "whisper": {
      "model_size": "base",
      "use_api": false,
      "language": null
    },
    "azure": {
      "enabled": false,
      "api_key": null,
      "region": "eastus"
    },
    "google": {
      "enabled": false,
      "credentials_path": null
    }
  }
}

🔧 Advanced Usage

Using Local Whisper Models

# Different model sizes (tiny, base, small, medium, large)
config.stt.whisper.model_size = "small"  # Faster
config.stt.whisper.model_size = "large"  # More accurate

Optimizing for Real-time Performance

# Low-latency settings
config.audio.chunk_size = 512
config.audio.sample_rate = 16000
config.vad.min_speech_duration = 0.1

Using Cloud STT Services

# Azure Speech
export AZURE_SPEECH_KEY="your_key"
export AZURE_SPEECH_REGION="eastus"

# Google Speech
export GOOGLE_CREDENTIALS_PATH="/path/to/credentials.json"

🧪 Testing

Test Audio Devices

python -c "from audio.capture import list_audio_devices; print(list_audio_devices())"

Test Voice Activity Detection

python -c "from audio.vad import create_vad; vad = create_vad('hybrid')"

Benchmark STT Engines

python -m stt.benchmark --audio test_audio.wav

📊 Performance Monitoring

The server includes comprehensive logging and performance monitoring:

  • Audio Processing: Chunk processing times, dropouts, queue status
  • VAD Performance: Speech detection accuracy, false positives
  • STT Metrics: Transcription latency, confidence scores, accuracy
  • System Resources: Memory usage, CPU utilization

Enable debug logging for detailed metrics:

python main.py --debug

🔍 Troubleshooting

Common Issues

1. No audio devices detected

# Check if sounddevice can see your devices
python -c "import sounddevice as sd; print(sd.query_devices())"

2. High latency

# Reduce chunk size and enable low latency
config.audio.chunk_size = 512
config.audio.low_latency = True

3. Whisper model loading errors

# Clear Whisper cache and redownload
pip uninstall openai-whisper
pip install openai-whisper

4. WASAPI permissions on Windows

  • Check microphone privacy settings
  • Run as administrator if needed
  • Ensure Claude Desktop has microphone permissions

Debug Mode

# Enable comprehensive logging
python main.py --debug

# Or set environment variable
set LOG_LEVEL=DEBUG
python main.py

🔐 Security Considerations

  • API keys are stored in environment variables or secure config files
  • Audio data is processed locally by default (Whisper local models)
  • Cloud STT services can be disabled for maximum privacy
  • No audio data is permanently stored

🚀 Performance Optimization

For Low Latency (<200ms)

{
  "audio": {
    "chunk_size": 512,
    "sample_rate": 16000
  },
  "stt": {
    "whisper": {
      "model_size": "tiny",
      "fp16": true
    }
  }
}

For High Accuracy

{
  "audio": {
    "chunk_size": 2048
  },
  "stt": {
    "whisper": {
      "model_size": "large",
      "beam_size": 10
    }
  }
}

🤝 Contributing

We welcome contributions! Areas where help is needed:

  • 🎯 Additional STT engines (AssemblyAI, Rev.ai, etc.)
  • 🔧 Audio preprocessing (noise reduction, normalization)
  • 📱 Cross-platform support (macOS, Linux)
  • 🧪 Testing frameworks (automated audio testing)
  • 📚 Documentation (tutorials, examples)

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • Anthropic for Claude and the Model Context Protocol
  • OpenAI for Whisper speech recognition
  • Python Audio Community for excellent libraries
  • FastMCP for the high-level MCP framework

📞 Support


⭐ Star this repository if you find it useful!

This Python implementation provides a more maintainable and feature-rich alternative to the original TypeScript version, with better audio processing capabilities and easier extensibility.

Recommend Servers
TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.
MCP AdvisorMCP Advisor & Installation - Use the right MCP server for your needs
Jina AI MCP ToolsA Model Context Protocol (MCP) server that integrates with Jina AI Search Foundation APIs.
CursorThe AI Code Editor
Tavily Mcp
Visual Studio Code - Open Source ("Code - OSS")Visual Studio Code
Serper MCP ServerA Serper MCP Server
WindsurfThe new purpose-built IDE to harness magic
Amap Maps高德地图官方 MCP Server
BlenderBlenderMCP connects Blender to Claude AI through the Model Context Protocol (MCP), allowing Claude to directly interact with and control Blender. This integration enables prompt assisted 3D modeling, scene creation, and manipulation.
DeepChatYour AI Partner on Desktop
AiimagemultistyleA Model Context Protocol (MCP) server for image generation and manipulation using fal.ai's Stable Diffusion model.
Howtocook Mcp基于Anduin2017 / HowToCook (程序员在家做饭指南)的mcp server,帮你推荐菜谱、规划膳食,解决“今天吃什么“的世纪难题; Based on Anduin2017/HowToCook (Programmer's Guide to Cooking at Home), MCP Server helps you recommend recipes, plan meals, and solve the century old problem of "what to eat today"
Zhipu Web SearchZhipu Web Search MCP Server is a search engine specifically designed for large models. It integrates four search engines, allowing users to flexibly compare and switch between them. Building upon the web crawling and ranking capabilities of traditional search engines, it enhances intent recognition capabilities, returning results more suitable for large model processing (such as webpage titles, URLs, summaries, site names, site icons, etc.). This helps AI applications achieve "dynamic knowledge acquisition" and "precise scenario adaptation" capabilities.
Context7Context7 MCP Server -- Up-to-date code documentation for LLMs and AI code editors
TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.
EdgeOne Pages MCPAn MCP service designed for deploying HTML content to EdgeOne Pages and obtaining an accessible public URL.
MiniMax MCPOfficial MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
ChatWiseThe second fastest AI chatbot™
Playwright McpPlaywright MCP server
Baidu Map百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。