Skip to content

NameetP/pdfmux

v1.1.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

JSON output now produces structured data including tables, key‑value pairs, normalized dates, amounts, rates; fixed JSON validity and control‑character issues.

Full changelog

What's New in v1.1.0

Structured Extraction

  • JSON output with --format json — tables extracted as structured data (headers + rows), key-value pairs auto-detected from Label: Value patterns common in bank statements, invoices, and forms
  • Date normalization → ISO 8601 output. Handles "28 Feb 2026", "February 28, 2026", DD/MM/YYYY, and other common formats
  • Amount normalization → parsed floats with currency detection (AED, USD, EUR, etc.) and debit/credit direction
  • Rate normalization → percentage value + period (monthly/annual)
  • Schema-guided extraction with --schema — fuzzy-match extracted data to your JSON schema, zero LLM cost
  • New MCP tool: extract_structured for AI agent integration
  • convert_pdf MCP tool now supports JSON format

Bug Fixes

  • Fixed --stdout JSON output — Rich console was word-wrapping long lines, breaking JSON validity for downstream consumers
  • Control character sanitization — PDFs with embedded control characters no longer produce invalid output

Developer

  • New modules: kv_extract, normalize, schema
  • New types: ExtractedTable, KeyValuePair
  • JSON schema version bumped to 1.1.0 — includes tables, key_values, and structured fields
  • 225 tests passing, zero new dependencies

Usage

pip install --upgrade pdfmux

# Structured JSON with tables and key-values
pdfmux convert statement.pdf -f json

# Schema-guided extraction
pdfmux convert statement.pdf --schema bank-statement.schema.json

Full Changelog: https://github.com/NameetP/pdfmux/compare/v1.0.1...v1.1.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]