- Web Scraper MCP
Web Scraper MCP
Content
MCP Web Scraper
A lightweight and efficient web scraping MCP server using direct STDIO protocol
🚀 Quick Start
Option 1: Automated Setup
# Clone and setup
git clone https://github.com/navin4078/mcp-web-scraper
cd mcp-web-scraper
chmod +x setup.sh && ./setup.sh
Option 2: Manual Setup
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install minimal dependencies
pip install -r requirements.txt
⚙️ Claude Desktop Configuration
Step 1: Find Your Paths
# Get absolute paths (run this in your project directory)
echo "Python path: $(pwd)/venv/bin/python"
echo "Script path: $(pwd)/app_mcp.py"
Step 2: Configure Claude Desktop
Open your Claude Desktop config file:
macOS:
~/Library/Application Support/Claude/claude_desktop_config.json
Windows:
%APPDATA%\Claude\claude_desktop_config.json
Linux:
~/.config/Claude/claude_desktop_config.json
Step 3: Add Configuration
Add this to your config file:
{
"mcpServers": {
"web-scraper": {
"command": "/full/path/to/your/venv/bin/python",
"args": ["/full/path/to/your/app_mcp.py"]
}
}
}
Example:
{
"mcpServers": {
"web-scraper": {
"command": "/Users/username/Desktop/scrapper/venv/bin/python",
"args": ["/Users/username/Desktop/scrapper/app_mcp.py"]
}
}
}
Step 4: Restart Claude Desktop
- Completely close Claude Desktop (Cmd+Q on Mac)
- Restart the application
- Look for the hammer icon (🔨)
- You should see "web-scraper" in your MCP servers
🛠 Available Tools
scrape_website
Extract data from websites with flexible options:
- extract_type:
text,links,images,table - selector: CSS selector for targeting specific elements
- max_results: Limit number of results (1-50)
extract_headlines
Get all headlines (h1, h2, h3) from a webpage with hierarchy and attributes.
extract_metadata
Extract comprehensive metadata:
- Basic: title, description, keywords, author
- Open Graph: og:title, og:description, og:image
- Twitter Cards: twitter:title, twitter:description
get_page_info
Get page structure overview:
- Element counts (paragraphs, headings, links, images, tables)
- Basic metadata
- Page statistics
💡 Usage Examples
Basic Scraping
Scrape the text content from https://example.com
Extract all links from https://news.ycombinator.com
Get headlines from https://www.bbc.com/news
Advanced Examples
Extract all images from https://example.com with their alt text
Scrape text from https://example.com using the CSS selector ".article-content p"
Get metadata and Open Graph tags from https://github.com
What's the page structure of https://stackoverflow.com?
Specific Selectors
Extract text from https://news.ycombinator.com using selector ".titleline a"
Get all table data from https://example.com/data-page
Scrape only paragraph text from articles using selector "article p"
📁 Project Structure
scrapper/
├── app_mcp.py # Main MCP server (STDIO)
├── requirements.txt # Minimal dependencies
├── setup.sh # Simple setup script
├── .gitignore # Git ignore rules
└── README.md # This file
🔧 Features
Web Scraping Capabilities
- ✅ Text extraction with CSS selectors
- ✅ Link extraction with full attributes
- ✅ Image extraction with metadata
- ✅ Table data extraction and formatting
- ✅ Comprehensive metadata extraction
- ✅ Headline extraction with hierarchy
- ✅ Custom CSS selector support
- ✅ Configurable result limits
- ✅ Error handling and validation
MCP Integration
- ✅ Direct STDIO protocol (no HTTP needed)
- ✅ Native Claude Desktop integration
- ✅ Automatic server lifecycle management
- ✅ Schema validation and documentation
- ✅ Comprehensive error handling
- ✅ Minimal dependencies
🛡 Security & Best Practices
- Respect robots.txt: Always check robots.txt before scraping
- Rate limiting: Built-in 10-second request timeout
- User-Agent: Uses modern browser headers
- Input validation: URL and parameter validation
- Error handling: Graceful error handling and reporting
- Resource limits: Configurable result limits prevent overload
🐛 Troubleshooting
MCP Server Not Appearing
Check your paths:
# Verify files exist
ls -la /path/to/your/venv/bin/python
ls -la /path/to/your/app_mcp.py
# Test the script manually
/path/to/your/venv/bin/python /path/to/your/app_mcp.py
Validate JSON configuration:
- Use a JSON validator to check syntax
- Ensure no trailing commas
- Use absolute paths (not relative)
Permission Issues
# Make script executable
chmod +x app_mcp.py
# Check virtual environment
source venv/bin/activate
python --version
Import Errors
# Reinstall dependencies
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
Testing the MCP Server
You can test if the server works by running it manually:
source venv/bin/activate
python app_mcp.py
The server should start and wait for STDIO input from Claude Desktop.
📚 Dependencies
- requests: HTTP library for web requests
- beautifulsoup4: HTML/XML parsing
- lxml: Fast XML and HTML processor
- mcp: Model Context Protocol library
🤝 Contributing
- Fork the repository
- Create a feature branch
- Test thoroughly with Claude Desktop
- Submit a pull request
📄 License
This project is open source and available under the MIT License.
🔗 Resources
Simple, efficient web scraping for Claude Desktop! 🕷️✨
Server Config
{
"mcpServers": {
"web-scraper": {
"command": "/full/path/to/your/venv/bin/python",
"args": [
"/full/path/to/your/app_mcp.py"
]
}
}
}Recommend Servers
TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.
ChatWiseThe second fastest AI chatbot™
WindsurfThe new purpose-built IDE to harness magic
BlenderBlenderMCP connects Blender to Claude AI through the Model Context Protocol (MCP), allowing Claude to directly interact with and control Blender. This integration enables prompt assisted 3D modeling, scene creation, and manipulation.
DeepChatYour AI Partner on Desktop
Context7Context7 MCP Server -- Up-to-date code documentation for LLMs and AI code editors
Howtocook Mcp基于Anduin2017 / HowToCook (程序员在家做饭指南)的mcp server,帮你推荐菜谱、规划膳食,解决“今天吃什么“的世纪难题;
Based on Anduin2017/HowToCook (Programmer's Guide to Cooking at Home), MCP Server helps you recommend recipes, plan meals, and solve the century old problem of "what to eat today"
Tavily Mcp
MCP AdvisorMCP Advisor & Installation - Use the right MCP server for your needs
Zhipu Web SearchZhipu Web Search MCP Server is a search engine specifically designed for large models. It integrates four search engines, allowing users to flexibly compare and switch between them. Building upon the web crawling and ranking capabilities of traditional search engines, it enhances intent recognition capabilities, returning results more suitable for large model processing (such as webpage titles, URLs, summaries, site names, site icons, etc.). This helps AI applications achieve "dynamic knowledge acquisition" and "precise scenario adaptation" capabilities.
AiimagemultistyleA Model Context Protocol (MCP) server for image generation and manipulation using fal.ai's Stable Diffusion model.
CursorThe AI Code Editor
Baidu Map百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Visual Studio Code - Open Source ("Code - OSS")Visual Studio Code
TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.
Playwright McpPlaywright MCP server
Jina AI MCP ToolsA Model Context Protocol (MCP) server that integrates with Jina AI Search Foundation APIs.
MiniMax MCPOfficial MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
Amap Maps高德地图官方 MCP Server
Serper MCP ServerA Serper MCP Server
EdgeOne Pages MCPAn MCP service designed for deploying HTML content to EdgeOne Pages and obtaining an accessible public URL.