Web Scraper MCP

Created By

navin40788 months ago

Scrape websites and let them talk to your LLM

# web scraping

# data

Overview Content Tools Comments

Content

MCP Web Scraper

A lightweight and efficient web scraping MCP server using direct STDIO protocol

🚀 Quick Start

Option 1: Automated Setup

# Clone and setup
git clone https://github.com/navin4078/mcp-web-scraper
cd mcp-web-scraper
chmod +x setup.sh && ./setup.sh

Option 2: Manual Setup

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install minimal dependencies
pip install -r requirements.txt

⚙️ Claude Desktop Configuration

Step 1: Find Your Paths

# Get absolute paths (run this in your project directory)
echo "Python path: $(pwd)/venv/bin/python"
echo "Script path: $(pwd)/app_mcp.py"

Step 2: Configure Claude Desktop

Open your Claude Desktop config file:

macOS:

~/Library/Application Support/Claude/claude_desktop_config.json

Windows:

%APPDATA%\Claude\claude_desktop_config.json

Linux:

~/.config/Claude/claude_desktop_config.json

Step 3: Add Configuration

Add this to your config file:

{
  "mcpServers": {
    "web-scraper": {
      "command": "/full/path/to/your/venv/bin/python",
      "args": ["/full/path/to/your/app_mcp.py"]
    }
  }
}

Example:

{
  "mcpServers": {
    "web-scraper": {
      "command": "/Users/username/Desktop/scrapper/venv/bin/python",
      "args": ["/Users/username/Desktop/scrapper/app_mcp.py"]
    }
  }
}

Step 4: Restart Claude Desktop

Completely close Claude Desktop (Cmd+Q on Mac)
Restart the application
Look for the hammer icon (🔨)
You should see "web-scraper" in your MCP servers

🛠 Available Tools

scrape_website

Extract data from websites with flexible options:

extract_type: text, links, images, table
selector: CSS selector for targeting specific elements
max_results: Limit number of results (1-50)

extract_headlines

Get all headlines (h1, h2, h3) from a webpage with hierarchy and attributes.

extract_metadata

Extract comprehensive metadata:

Basic: title, description, keywords, author
Open Graph: og:title, og:description, og:image
Twitter Cards: twitter:title, twitter:description

get_page_info

Get page structure overview:

Element counts (paragraphs, headings, links, images, tables)
Basic metadata
Page statistics

💡 Usage Examples

Basic Scraping

Scrape the text content from https://example.com

Extract all links from https://news.ycombinator.com

Get headlines from https://www.bbc.com/news

Advanced Examples

Extract all images from https://example.com with their alt text

Scrape text from https://example.com using the CSS selector ".article-content p"

Get metadata and Open Graph tags from https://github.com

What's the page structure of https://stackoverflow.com?

Specific Selectors

Extract text from https://news.ycombinator.com using selector ".titleline a"

Get all table data from https://example.com/data-page

Scrape only paragraph text from articles using selector "article p"

📁 Project Structure

scrapper/
├── app_mcp.py             # Main MCP server (STDIO)
├── requirements.txt       # Minimal dependencies
├── setup.sh              # Simple setup script
├── .gitignore            # Git ignore rules
└── README.md             # This file

🔧 Features

Web Scraping Capabilities

✅ Text extraction with CSS selectors
✅ Link extraction with full attributes
✅ Image extraction with metadata
✅ Table data extraction and formatting
✅ Comprehensive metadata extraction
✅ Headline extraction with hierarchy
✅ Custom CSS selector support
✅ Configurable result limits
✅ Error handling and validation

MCP Integration

✅ Direct STDIO protocol (no HTTP needed)
✅ Native Claude Desktop integration
✅ Automatic server lifecycle management
✅ Schema validation and documentation
✅ Comprehensive error handling
✅ Minimal dependencies

🛡 Security & Best Practices

Respect robots.txt: Always check robots.txt before scraping
Rate limiting: Built-in 10-second request timeout
User-Agent: Uses modern browser headers
Input validation: URL and parameter validation
Error handling: Graceful error handling and reporting
Resource limits: Configurable result limits prevent overload

🐛 Troubleshooting

MCP Server Not Appearing

Check your paths:

# Verify files exist
ls -la /path/to/your/venv/bin/python
ls -la /path/to/your/app_mcp.py

# Test the script manually
/path/to/your/venv/bin/python /path/to/your/app_mcp.py

Validate JSON configuration:

Use a JSON validator to check syntax
Ensure no trailing commas
Use absolute paths (not relative)

Permission Issues

# Make script executable
chmod +x app_mcp.py

# Check virtual environment
source venv/bin/activate
python --version

Import Errors

# Reinstall dependencies
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Testing the MCP Server

You can test if the server works by running it manually:

source venv/bin/activate
python app_mcp.py

The server should start and wait for STDIO input from Claude Desktop.

📚 Dependencies

requests: HTTP library for web requests
beautifulsoup4: HTML/XML parsing
lxml: Fast XML and HTML processor
mcp: Model Context Protocol library

🤝 Contributing

Fork the repository
Create a feature branch
Test thoroughly with Claude Desktop
Submit a pull request

📄 License

This project is open source and available under the MIT License.

🔗 Resources

Simple, efficient web scraping for Claude Desktop! 🕷️✨

Server Config

{
  "mcpServers": {
    "web-scraper": {
      "command": "/full/path/to/your/venv/bin/python",
      "args": [
        "/full/path/to/your/app_mcp.py"
      ]
    }
  }
}

Recommend Servers

TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.

Amap Maps高德地图官方 MCP Server

WindsurfThe new purpose-built IDE to harness magic

EdgeOne Pages MCPAn MCP service designed for deploying HTML content to EdgeOne Pages and obtaining an accessible public URL.

Serper MCP ServerA Serper MCP Server

TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.

CursorThe AI Code Editor

Zhipu Web SearchZhipu Web Search MCP Server is a search engine specifically designed for large models. It integrates four search engines, allowing users to flexibly compare and switch between them. Building upon the web crawling and ranking capabilities of traditional search engines, it enhances intent recognition capabilities, returning results more suitable for large model processing (such as webpage titles, URLs, summaries, site names, site icons, etc.). This helps AI applications achieve "dynamic knowledge acquisition" and "precise scenario adaptation" capabilities.

Playwright McpPlaywright MCP server

Baidu Map百度地图核心API现已全面兼容MCP协议，是国内首家兼容MCP协议的地图服务商。

Y GuiA web-based graphical interface for AI chat interactions with support for multiple AI models and MCP (Model Context Protocol) servers.