This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryJSON output now produces structured data including tables, key‑value pairs, normalized dates, amounts, rates; fixed JSON validity and control‑character issues.
Full changelog
What's New in v1.1.0
Structured Extraction
- JSON output with
--format json— tables extracted as structured data (headers + rows), key-value pairs auto-detected fromLabel: Valuepatterns common in bank statements, invoices, and forms - Date normalization → ISO 8601 output. Handles
"28 Feb 2026","February 28, 2026",DD/MM/YYYY, and other common formats - Amount normalization → parsed floats with currency detection (
AED,USD,EUR, etc.) and debit/credit direction - Rate normalization → percentage value + period (
monthly/annual) - Schema-guided extraction with
--schema— fuzzy-match extracted data to your JSON schema, zero LLM cost - New MCP tool:
extract_structuredfor AI agent integration convert_pdfMCP tool now supports JSON format
Bug Fixes
- Fixed
--stdoutJSON output — Rich console was word-wrapping long lines, breaking JSON validity for downstream consumers - Control character sanitization — PDFs with embedded control characters no longer produce invalid output
Developer
- New modules:
kv_extract,normalize,schema - New types:
ExtractedTable,KeyValuePair - JSON schema version bumped to
1.1.0— includestables,key_values, andstructuredfields - 225 tests passing, zero new dependencies
Usage
pip install --upgrade pdfmux
# Structured JSON with tables and key-values
pdfmux convert statement.pdf -f json
# Schema-guided extraction
pdfmux convert statement.pdf --schema bank-statement.schema.json
Full Changelog: https://github.com/NameetP/pdfmux/compare/v1.0.1...v1.1.0
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]