- 🦊 MCPBench: A Benchmark for Evaluating MCP Servers
🦊 MCPBench: A Benchmark for Evaluating MCP Servers
🦊 MCPBench: A Benchmark for Evaluating MCP Servers
MCPBench is an evaluation framework for MCP Servers. It supports the evaluation of three types of servers: Web Search, Database Query and GAIA, and is compatible with both local and remote MCP Servers. The framework primarily evaluates different MCP Servers (such as Brave Search, DuckDuckGo, etc.) in terms of task completion accuracy, latency, and token consumption under the same LLM and Agent configurations. Here is the evaluation report.
The implementation refers to LangProBe: a Language Programs Benchmark.
Big thanks to Qingxu Fu for the initial implementation!
📋 Table of Contents
🔥 News
Apr. 29, 2025🌟 Update the code for evaluating the MCP Server Package within GAIA.Apr. 14, 2025🌟 We are proud to announce that MCPBench is now open-sourced.
🛠️ Installation
The framework requires Python version >= 3.11, nodejs and jq.
conda create -n mcpbench python=3.11 -y
conda activate mcpbench
pip install -r requirements.txt
🚀 Quick Start
Launch MCP Server
Launch stdio MCP as SSE
If the MCP does not support SSE, write the config like:
{
"mcp_pool": [
{
"name": "FireCrawl",
"description": "A Model Context Protocol (MCP) server implementation that integrates with Firecrawl for web scraping capabilities.",
"tools": [
{
"tool_name": "firecrawl_search",
"tool_description": "Search the web and optionally extract content from search results.",
"inputs": [
{
"name": "query",
"type": "string",
"required": true,
"description": "your search query"
}
]
}
],
"run_config": [
{
"command": "npx -y firecrawl-mcp",
"args": "FIRECRAWL_API_KEY=xxx",
"port": 8005
}
]
}
]
}
Save this config file in the configs folder and launch it using:
sh launch_mcps_as_sse.sh YOUR_CONFIG_FILE
For example, if the config file is mcp_config_websearch.json, then run:
sh launch_mcps_as_sse.sh mcp_config_websearch.json
Launch SSE MCP
If your server supports SSE, you can use it directly. The URL will be http://localhost:8001/sse
For SSE-supported MCP Server, write the config like:
{
"mcp_pool": [
{
"name": "browser_use",
"description": "AI-driven browser automation server implementing the Model Context Protocol (MCP) for natural language browser control and web research.",
"tools": [
{
"tool_name": "browser_use",
"tool_description": "Executes a browser automation task based on natural language instructions and waits for it to complete.",
"inputs": [
{
"name": "query",
"type": "string",
"required": true,
"description": "Your query"
}
]
}
],
"url": "http://0.0.0.0:8001/sse"
}
]
}
where the url can be generated from the MCP market on ModelScope.
Launch Evaluation
To evaluate the MCP Server's performance on Web Search tasks:
sh evaluation_websearch.sh YOUR_CONFIG_FILE
To evaluate the MCP Server's performance on Database Query tasks:
sh evaluation_db.sh YOUR_CONFIG_FILE
To evaluate the MCP Server's performance on GAIA tasks:
sh evaluation_gaia.sh YOUR_CONFIG_FILE
Datasets and Experimental Results
Our framework provides two datasets for evaluation. For the WebSearch task, the dataset is located at MCPBench/langProBe/WebSearch/data/websearch_600.jsonl, containing 200 QA pairs each from Frames, news, and technology domains. Our framework for automatically constructing evaluation datasets will be open-sourced later.
For the Database Query task, the dataset is located at MCPBench/langProBe/DB/data/car_bi.jsonl. You can add your own dataset in the following format:
{
"unique_id": "",
"Prompt": "",
"Answer": ""
}
We have evaluated mainstream MCP Servers on both tasks. For detailed experimental results, please refer to Documentation
🚰 Cite
If you find this work useful, please consider citing our project:
@misc{mcpbench,
title={MCPBench: A Benchmark for Evaluating MCP Servers},
author={Zhiling Luo, Xiaorong Shi, Xuanrui Lin, Jinyang Gao},
howpublished = {\url{https://github.com/modelscope/MCPBench}},
year={2025}
}
Alternatively, you may reference our report.
@article{mcpbench_report,
title={Evaluation Report on MCP Servers},
author={Zhiling Luo, Xiaorong Shi, Xuanrui Lin, Jinyang Gao},
year={2025},
journal={arXiv preprint arXiv:2504.11094},
url={https://arxiv.org/abs/2504.11094},
primaryClass={cs.AI}
}