Skip to content

NameetP/pdfmux

v1.6.4 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

New pdfmux audit command for diffing extractor outputs and an OSS‑to‑cloud upsell line after conversion.

Full changelog

Additive release. Two new things, no breaking changes, no defaults change. This is the OSS half of the Verified Extraction Manifest (VEM) standard — the audit command produces the comparison artifact that VEM 1.0 standardizes.

New: pdfmux audit

Diff your current extractor's output against pdfmux on the same PDFs:

pdfmux audit --against your_extractor_output.csv --on /path/to/pdfs
  • Reads --against as CSV (filename,text or file,content) or JSON ({filename: text})
  • Computes per-document word-set Jaccard overlap between the two extractors
  • Flags documents below --overlap-threshold (default 0.70) OR --confidence-threshold (default 0.50)
  • Writes a 7-column CSV: filename, my_extractor_chars, pdfmux_chars, jaccard_overlap, pdfmux_confidence, recommendation, error
  • Exit codes: 0 clean, 2 usage error, 3 anything flagged

The pitch: "diff our output against your current extractor on 100 of your own PDFs — if we agree on every document, you don't need us. If we disagree on more than 2%, those are the silent failures already in your pipeline."

New: OSS → cloud funnel line

A single dim line prints after a successful convert, pointing to the free tier (1,000 pages/mo) and the open VEM spec. Suppress with PDFMUX_NO_UPSELL=1. Skipped when stdout isn't a TTY (so piped output stays clean) and when writing to stdout via --output -.

Tests

tests/test_audit_cli.py — 8 new tests. Total: 678 passing (up from 670).

Install

pip install --upgrade pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]