This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryNew pdfmux audit command for diffing extractor outputs and an OSS‑to‑cloud upsell line after conversion.
Full changelog
Additive release. Two new things, no breaking changes, no defaults change. This is the OSS half of the Verified Extraction Manifest (VEM) standard — the audit command produces the comparison artifact that VEM 1.0 standardizes.
New: pdfmux audit
Diff your current extractor's output against pdfmux on the same PDFs:
pdfmux audit --against your_extractor_output.csv --on /path/to/pdfs
- Reads
--againstas CSV (filename,textorfile,content) or JSON ({filename: text}) - Computes per-document word-set Jaccard overlap between the two extractors
- Flags documents below
--overlap-threshold(default0.70) OR--confidence-threshold(default0.50) - Writes a 7-column CSV:
filename, my_extractor_chars, pdfmux_chars, jaccard_overlap, pdfmux_confidence, recommendation, error - Exit codes:
0clean,2usage error,3anything flagged
The pitch: "diff our output against your current extractor on 100 of your own PDFs — if we agree on every document, you don't need us. If we disagree on more than 2%, those are the silent failures already in your pipeline."
New: OSS → cloud funnel line
A single dim line prints after a successful convert, pointing to the free tier (1,000 pages/mo) and the open VEM spec. Suppress with PDFMUX_NO_UPSELL=1. Skipped when stdout isn't a TTY (so piped output stays clean) and when writing to stdout via --output -.
Tests
tests/test_audit_cli.py — 8 new tests. Total: 678 passing (up from 670).
Install
pip install --upgrade pdfmux
Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]