Sponsored by Deepsite.site

parquet_mcp_server

Created By
MCP-Mirror8 months ago
Mirror of
Content

parquet_mcp_server

smithery badge

A powerful MCP (Model Control Protocol) server that provides tools for manipulating and analyzing Parquet files. This server is designed to work with Claude Desktop and offers five main functionalities:

  1. Text Embedding Generation: Convert text columns in Parquet files into vector embeddings using Ollama models
  2. Parquet File Analysis: Extract detailed information about Parquet files including schema, row count, and file size
  3. DuckDB Integration: Convert Parquet files to DuckDB databases for efficient querying and analysis
  4. PostgreSQL Integration: Convert Parquet files to PostgreSQL tables with pgvector support for vector similarity search
  5. Markdown Processing: Convert markdown files into chunked text with metadata, preserving document structure and links

This server is particularly useful for:

  • Data scientists working with large Parquet datasets
  • Applications requiring vector embeddings for text data
  • Projects needing to analyze or convert Parquet files
  • Workflows that benefit from DuckDB's fast querying capabilities
  • Applications requiring vector similarity search with PostgreSQL and pgvector

Installation

Installing via Smithery

To install Parquet MCP Server for Claude Desktop automatically via Smithery:

npx -y @smithery/cli install @DeepSpringAI/parquet_mcp_server --client claude

Clone this repository

git clone ...
cd parquet_mcp_server

Create and activate virtual environment

uv venv
.venv\Scripts\activate  # On Windows
source .venv/bin/activate  # On macOS/Linux

Install the package

uv pip install -e .

Environment

Create a .env file with the following variables:

EMBEDDING_URL=  # URL for the embedding service
OLLAMA_URL=    # URL for Ollama server
EMBEDDING_MODEL=nomic-embed-text  # Model to use for generating embeddings

# PostgreSQL Configuration
POSTGRES_DB=your_database_name
POSTGRES_USER=your_username
POSTGRES_PASSWORD=your_password
POSTGRES_HOST=localhost
POSTGRES_PORT=5432

Usage with Claude Desktop

Add this to your Claude Desktop configuration file (claude_desktop_config.json):

{
  "mcpServers": {
    "parquet-mcp-server": {
      "command": "uv",
      "args": [
        "--directory",
        "/home/${USER}/workspace/parquet_mcp_server/src/parquet_mcp_server",
        "run",
        "main.py"
      ]
    }
  }
}

Available Tools

The server provides five main tools:

  1. Embed Parquet: Adds embeddings to a specific column in a Parquet file

    • Required parameters:
      • input_path: Path to input Parquet file
      • output_path: Path to save the output
      • column_name: Column containing text to embed
      • embedding_column: Name for the new embedding column
      • batch_size: Number of texts to process in each batch (for better performance)
  2. Parquet Information: Get details about a Parquet file

    • Required parameters:
      • file_path: Path to the Parquet file to analyze
  3. Convert to DuckDB: Convert a Parquet file to a DuckDB database

    • Required parameters:
      • parquet_path: Path to the input Parquet file
    • Optional parameters:
      • output_dir: Directory to save the DuckDB database (defaults to same directory as input file)
  4. Convert to PostgreSQL: Convert a Parquet file to a PostgreSQL table with pgvector support

    • Required parameters:
      • parquet_path: Path to the input Parquet file
      • table_name: Name of the PostgreSQL table to create or append to
  5. Process Markdown: Convert markdown files into structured chunks with metadata

    • Required parameters:
      • file_path: Path to the markdown file to process
      • output_path: Path to save the output parquet file
    • Features:
      • Preserves document structure and links
      • Extracts section headers and metadata
      • Memory-optimized for large files
      • Configurable chunk size and overlap

Example Prompts

Here are some example prompts you can use with the agent:

For Embedding:

"Please embed the column 'text' in the parquet file '/path/to/input.parquet' and save the output to '/path/to/output.parquet'. Use 'embeddings' as the final column name and a batch size of 2"

For Parquet Information:

"Please give me some information about the parquet file '/path/to/input.parquet'"

For DuckDB Conversion:

"Please convert the parquet file '/path/to/input.parquet' to DuckDB format and save it in '/path/to/output/directory'"

For PostgreSQL Conversion:

"Please convert the parquet file '/path/to/input.parquet' to a PostgreSQL table named 'my_table'"

For Markdown Processing:

"Please process the markdown file '/path/to/input.md' and save the chunks to '/path/to/output.parquet'"

Testing the MCP Server

The project includes a comprehensive test suite in the src/tests directory. You can run all tests using:

python src/tests/run_tests.py

Or run individual tests:

# Test embedding functionality
python src/tests/test_embedding.py

# Test parquet information tool
python src/tests/test_parquet_info.py

# Test DuckDB conversion
python src/tests/test_duckdb_conversion.py

# Test PostgreSQL conversion
python src/tests/test_postgres_conversion.py

# Test Markdown processing
python src/tests/test_markdown_processing.py

You can also test the server using the client directly:

from parquet_mcp_server.client import (
    convert_to_duckdb, 
    embed_parquet, 
    get_parquet_info, 
    convert_to_postgres,
    process_markdown_file  # New markdown processing function
)

# Test DuckDB conversion
result = convert_to_duckdb(
    parquet_path="input.parquet",
    output_dir="db_output"
)

# Test embedding
result = embed_parquet(
    input_path="input.parquet",
    output_path="output.parquet",
    column_name="text",
    embedding_column="embeddings",
    batch_size=2
)

# Test parquet information
result = get_parquet_info("input.parquet")

# Test PostgreSQL conversion
result = convert_to_postgres(
    parquet_path="input.parquet",
    table_name="my_table"
)

# Test markdown processing
result = process_markdown_file(
    file_path="input.md",
    output_path="output.parquet"
)

Troubleshooting

  1. If you get SSL verification errors, make sure the SSL settings in your .env file are correct
  2. If embeddings are not generated, check:
    • The Ollama server is running and accessible
    • The model specified is available on your Ollama server
    • The text column exists in your input Parquet file
  3. If DuckDB conversion fails, check:
    • The input Parquet file exists and is readable
    • You have write permissions in the output directory
    • The Parquet file is not corrupted
  4. If PostgreSQL conversion fails, check:
    • The PostgreSQL connection settings in your .env file are correct
    • The PostgreSQL server is running and accessible
    • You have the necessary permissions to create/modify tables
    • The pgvector extension is installed in your database

API Response Format

The embeddings are returned in the following format:

{
    "object": "list",
    "data": [{
        "object": "embedding",
        "embedding": [0.123, 0.456, ...],
        "index": 0
    }],
    "model": "llama2",
    "usage": {
        "prompt_tokens": 4,
        "total_tokens": 4
    }
}

Each embedding vector is stored in the Parquet file as a NumPy array in the specified embedding column.

The DuckDB conversion tool returns a success message with the path to the created database file or an error message if the conversion fails.

The PostgreSQL conversion tool returns a success message indicating whether a new table was created or data was appended to an existing table.

The markdown chunking tool processes markdown files into chunks and saves them as a Parquet file with the following columns:

  • text: The text content of each chunk
  • metadata: Additional metadata about the chunk (e.g., headers, section info)

The tool returns a success message with the path to the created Parquet file or an error message if the processing fails.

Recommend Servers
TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.
WindsurfThe new purpose-built IDE to harness magic
DeepChatYour AI Partner on Desktop
Amap Maps高德地图官方 MCP Server
EdgeOne Pages MCPAn MCP service designed for deploying HTML content to EdgeOne Pages and obtaining an accessible public URL.
Tavily Mcp
ChatWiseThe second fastest AI chatbot™
MCP AdvisorMCP Advisor & Installation - Use the right MCP server for your needs
Howtocook Mcp基于Anduin2017 / HowToCook (程序员在家做饭指南)的mcp server,帮你推荐菜谱、规划膳食,解决“今天吃什么“的世纪难题; Based on Anduin2017/HowToCook (Programmer's Guide to Cooking at Home), MCP Server helps you recommend recipes, plan meals, and solve the century old problem of "what to eat today"
TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.
AiimagemultistyleA Model Context Protocol (MCP) server for image generation and manipulation using fal.ai's Stable Diffusion model.
Visual Studio Code - Open Source ("Code - OSS")Visual Studio Code
MiniMax MCPOfficial MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
Playwright McpPlaywright MCP server
Zhipu Web SearchZhipu Web Search MCP Server is a search engine specifically designed for large models. It integrates four search engines, allowing users to flexibly compare and switch between them. Building upon the web crawling and ranking capabilities of traditional search engines, it enhances intent recognition capabilities, returning results more suitable for large model processing (such as webpage titles, URLs, summaries, site names, site icons, etc.). This helps AI applications achieve "dynamic knowledge acquisition" and "precise scenario adaptation" capabilities.
BlenderBlenderMCP connects Blender to Claude AI through the Model Context Protocol (MCP), allowing Claude to directly interact with and control Blender. This integration enables prompt assisted 3D modeling, scene creation, and manipulation.
CursorThe AI Code Editor
Baidu Map百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
Context7Context7 MCP Server -- Up-to-date code documentation for LLMs and AI code editors
Serper MCP ServerA Serper MCP Server
Jina AI MCP ToolsA Model Context Protocol (MCP) server that integrates with Jina AI Search Foundation APIs.