Sponsored by Deepsite.site

CHM to Markdown Converter

Created By
DTDucas8 months ago
chm to markdown
Content

CHM to Markdown Converter

A Python utility for converting Compiled HTML Help (CHM) files to Markdown format, specifically optimized for Revit API documentation. This tool extracts HTML files from CHM documents and converts them to well-formatted Markdown files, making technical documentation more accessible, version control friendly, and AI-readable.

Features

  • Processes multiple Revit API documentation versions (2022-2026)
  • Creates an organized folder structure for easy reference
  • Generates core index files for AI integration and search functionality
  • Extracts CHM files using 7-Zip
  • Converts HTML content to clean Markdown format
  • Special handling for code snippets with language-specific syntax highlighting
  • Preserves and fixes tables
  • Updates internal links to maintain document references
  • Processes files asynchronously for better performance
  • Batch processes multiple CHM files with progress reporting

Output Structure

The converter creates an organized output structure:

output/
├── 2022/
│   ├── core/           # Contains index files for AI and search
│   │   ├── file_index.json
│   │   ├── id_lookup.json
│   │   └── index.md
│   └── data/           # Contains all markdown documentation files
│       ├── file1.md
│       ├── file2.md
│       └── ...
├── 2023/
│   ├── core/
│   └── data/
└── ...

Requirements

  • Python 3.7+
  • 7-Zip installed in the default location (C:\Program Files\7-Zip\7z.exe)
  • The following Python packages:
    • beautifulsoup4
    • html2text
    • aiofiles

Installation

  1. Clone or download this repository
  2. Install required Python packages:
pip install -r requirements.txt

Or install them directly:

pip install beautifulsoup4 html2text aiofiles

Usage

  1. Place your Revit API CHM files in the resources folder
  2. Run the script:
python chm_to_markdown.py
  1. Choose from the available options:
    • Process a specific CHM file by entering its number
    • Process all CHM files by entering 'a' or 'all'
    • Use command-line arguments for automation

Command-line Arguments

# Process a single CHM file
python chm_to_markdown.py --single resources/2024.chm

# Process all CHM files in the resources folder
python chm_to_markdown.py --all

# Keep HTML files after conversion (for debugging)
python chm_to_markdown.py --all --keep-html

# Adjust worker threads and batch size for performance
python chm_to_markdown.py --all --workers 4 --batch-size 25

Performance Tuning

You can adjust the following parameters to optimize performance for your system:

  • --workers or -w: Number of worker threads for CPU-bound operations
  • --batch-size or -b: Number of files to process in each batch
  • --semaphore: Maximum concurrent file I/O operations

Example:

python chm_to_markdown.py --all --workers 4 --batch-size 25 --semaphore 10

AI Integration

This tool is designed to facilitate AI integration with Revit API documentation:

  • The core/file_index.json file maps file IDs to titles and versions
  • The core/id_lookup.json file provides a lookup dictionary with extracted keywords
  • The core/index.md file provides a user-friendly navigation structure
  • All markdown files include version information in headings
  • Internal links are updated to maintain proper references between files

Customization

The script provides several customization options for content conversion:

Removing Unwanted Elements

You can customize which HTML elements to remove by editing these lists:

tags_to_remove = ["iframe", "object", "script", "br", "img"]
classes_to_remove = ["collapsibleAreaRegion", "collapsibleRegionTitle", ...]
ids_to_remove = ["PageFooter", "PageHeader", ...]

Code Snippets

The script handles code snippets with language-specific formatting. You can customize the language mapping:

id_to_lang = {
    "IDAB_code_Div1": "csharp",
    "IDAB_code_Div2": "vb",
    "IDAB_code_Div3": "cpp",
    "IDAB_code_Div4": "fsharp",
}

Troubleshooting

  • Missing modules error: Make sure you've installed all required packages and your Python environment is correctly configured.
  • 7-Zip not found: Check that 7-Zip is installed in the default location or update the path in the script.
  • Permission errors: Run your terminal or command prompt with administrator privileges.
  • Memory issues with large CHM files: Try increasing the batch size and reducing max_workers to manage memory usage.
  • Encoding issues: The tool uses error-tolerant UTF-8 encoding, but some characters may still display incorrectly. Adjust encoding settings if needed.

License

This project is open source and available under the MIT License.

Author

Duong Tran Quang - DTDucas (baymax.contact@gmail.com)

Recommend Servers
TraeBuild with Free GPT-4.1 & Claude 3.7. Fully MCP-Ready.
EdgeOne Pages MCPAn MCP service designed for deploying HTML content to EdgeOne Pages and obtaining an accessible public URL.
TimeA Model Context Protocol server that provides time and timezone conversion capabilities. This server enables LLMs to get current time information and perform timezone conversions using IANA timezone names, with automatic system timezone detection.
ChatWiseThe second fastest AI chatbot™
Tavily Mcp
AiimagemultistyleA Model Context Protocol (MCP) server for image generation and manipulation using fal.ai's Stable Diffusion model.
MiniMax MCPOfficial MiniMax Model Context Protocol (MCP) server that enables interaction with powerful Text to Speech, image generation and video generation APIs.
Playwright McpPlaywright MCP server
BlenderBlenderMCP connects Blender to Claude AI through the Model Context Protocol (MCP), allowing Claude to directly interact with and control Blender. This integration enables prompt assisted 3D modeling, scene creation, and manipulation.
Zhipu Web SearchZhipu Web Search MCP Server is a search engine specifically designed for large models. It integrates four search engines, allowing users to flexibly compare and switch between them. Building upon the web crawling and ranking capabilities of traditional search engines, it enhances intent recognition capabilities, returning results more suitable for large model processing (such as webpage titles, URLs, summaries, site names, site icons, etc.). This helps AI applications achieve "dynamic knowledge acquisition" and "precise scenario adaptation" capabilities.
CursorThe AI Code Editor
MCP AdvisorMCP Advisor & Installation - Use the right MCP server for your needs
Serper MCP ServerA Serper MCP Server
Visual Studio Code - Open Source ("Code - OSS")Visual Studio Code
Amap Maps高德地图官方 MCP Server
Baidu Map百度地图核心API现已全面兼容MCP协议,是国内首家兼容MCP协议的地图服务商。
DeepChatYour AI Partner on Desktop
Howtocook Mcp基于Anduin2017 / HowToCook (程序员在家做饭指南)的mcp server,帮你推荐菜谱、规划膳食,解决“今天吃什么“的世纪难题; Based on Anduin2017/HowToCook (Programmer's Guide to Cooking at Home), MCP Server helps you recommend recipes, plan meals, and solve the century old problem of "what to eat today"
Context7Context7 MCP Server -- Up-to-date code documentation for LLMs and AI code editors
Jina AI MCP ToolsA Model Context Protocol (MCP) server that integrates with Jina AI Search Foundation APIs.
WindsurfThe new purpose-built IDE to harness magic