NameetP/pdfmux

v1.1.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 4mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr

+8 more

opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

JSON output now produces structured data including tables, key‑value pairs, normalized dates, amounts, rates; fixed JSON validity and control‑character issues.

Full changelog

What's New in v1.1.0

Structured Extraction

JSON output with --format json — tables extracted as structured data (headers + rows), key-value pairs auto-detected from Label: Value patterns common in bank statements, invoices, and forms
Date normalization → ISO 8601 output. Handles "28 Feb 2026", "February 28, 2026", DD/MM/YYYY, and other common formats
Amount normalization → parsed floats with currency detection (AED, USD, EUR, etc.) and debit/credit direction
Rate normalization → percentage value + period (monthly/annual)
Schema-guided extraction with --schema — fuzzy-match extracted data to your JSON schema, zero LLM cost
New MCP tool: extract_structured for AI agent integration
convert_pdf MCP tool now supports JSON format

Bug Fixes

Fixed --stdout JSON output — Rich console was word-wrapping long lines, breaking JSON validity for downstream consumers
Control character sanitization — PDFs with embedded control characters no longer produce invalid output

Developer

New modules: kv_extract, normalize, schema
New types: ExtractedTable, KeyValuePair
JSON schema version bumped to 1.1.0 — includes tables, key_values, and structured fields
225 tests passing, zero new dependencies

Usage

pip install --upgrade pdfmux

# Structured JSON with tables and key-values
pdfmux convert statement.pdf -f json

# Schema-guided extraction
pdfmux convert statement.pdf --schema bank-statement.schema.json

Full Changelog: https://github.com/NameetP/pdfmux/compare/v1.0.1...v1.1.0

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track NameetP/pdfmux

Get notified when new releases ship.

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)