Portuguese Legal Document PDF Metadata Extractor

Created By

geek2geeksa year ago

MCP server for extracting metadata from Portuguese legal documents using advanced PDF processing and database architecture

# mcp-portuguese-legal-extractor

# pdf-metadata

Overview Content Tools Comments

Content

Portuguese Legal Document PDF Metadata Extractor

A robust Python tool for extracting structured metadata from Portuguese legal document PDFs, specifically designed for European Case Law Identifier (ECLI) formatted documents.

🚀 Features

High Accuracy: 100% confidence score with 96.84% exact match rate
Production Ready: Two extractor variants optimized for different use cases
Robust Error Handling: Comprehensive validation and error recovery
Flexible Confidence Scoring: Works with or without ground truth data
User-Friendly Interface: Clear progress reporting and detailed feedback
Field Classification: Distinguishes between missing and legitimately empty fields

📁 Project Structure

├── production_extractor.py    # Production-ready extractor with user-friendly interface
├── robust_extractor.py        # Core robust extraction engine
├── run_test1.py              # Test runner for batch processing
├── ground_truth/             # Ground truth data for validation
│   └── ground_truth.json
├── pdfs/                     # Input PDF documents
│   ├── test1/               # Test subset
│   └── *.pdf                # Legal documents
├── IMPROVEMENTS_SUMMARY.md   # Performance improvements documentation
└── README.md                # This file

🔧 Installation

Prerequisites

Python 3.8+
Required packages:

pip install pdfplumber

Setup

Clone or download the project files
Install dependencies:
```
pip install pdfplumber
```
Ensure your PDF files are in the pdfs/ directory

📖 Usage

Basic Usage with Production Extractor

The PortugueseLegalPDFExtractor class provides a user-friendly, production-ready interface:

from production_extractor import PortugueseLegalPDFExtractor

# Initialize extractor with optional ground truth validation
extractor = PortugueseLegalPDFExtractor(
    ground_truth_path="ground_truth/ground_truth.json",  # Optional
    verbose=True  # Enable progress reporting
)

# Extract metadata from a single PDF
result = extractor.extract_metadata("pdfs/document.pdf")

# Process multiple PDFs in a directory
summary = extractor.extract_batch(
    pdf_directory="pdfs/",
    output_directory="results/"  # Optional - saves individual results
)

Command Line Interface

The production extractor includes a full CLI:

# Process single file
python production_extractor.py "pdfs/document.pdf" -o "results/"

# Process entire directory with ground truth validation
python production_extractor.py "pdfs/" -g "ground_truth/ground_truth.json" -o "results/"

# Quiet mode (suppress progress messages)
python production_extractor.py "pdfs/" -q

# Force single file processing
python production_extractor.py "pdfs/document.pdf" --single

📊 Performance Metrics

Current Performance (After Improvements)

Overall Confidence: 100.0%
Exact Match Rate: 96.84% (153/158 populated fields)
Acceptable Match Rate: 100.0%
Processing Speed: ~2-3 seconds per document

Field Classification

Data Fields: 14 fields with actual content
Empty Fields: 8 fields correctly identified as empty
Match Types: Exact, case differences, punctuation differences, partial matches

🎯 How It Works

Pattern Recognition

The extractor leverages discovered patterns in Portuguese legal documents:

Fixed Relative Positions: Fields appear in predictable locations
Synchronized Pairs: Left-right column field synchronization
Predictable Order: Consistent field sequence across documents
Table Structure: Metadata organized in structured tables

Confidence Calculation

Two confidence calculation modes:

With Ground Truth: Accuracy-based scoring against known correct values
Without Ground Truth: Heuristic-based scoring using extraction quality indicators

🛠️ Development

Key Components

PortugueseLegalPDFExtractor (production_extractor.py)

Main Features:

Dual Confidence Modes: Accuracy-based (with ground truth) and population-based (without)
Multiple Extraction Methods: Table-based (primary) and coordinate-based (fallback)
Comprehensive Validation: Field validation, ECLI extraction, date formatting
Batch Processing: Process entire directories with detailed summaries
Command Line Interface: Full CLI with flexible options
Progress Reporting: User-friendly status messages and error handling

📈 Recent Improvements

Version 2.0 Enhancements

Fixed Confidence Calculation: Improved from 44.4% to 100.0% by properly handling empty fields
Enhanced Field Classification: Clear distinction between missing and legitimately empty fields
Per-Field Confidence: Individual confidence scores for each extracted field

See IMPROVEMENTS_SUMMARY.md for detailed information.