This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryAdded a strict confidence‑threshold exit code and new Python batch API.
Full changelog
Field-driven patch release
Triggered by a real-world 433-PDF batch run where the first invocation silently dropped 16 documents — the exact failure mode pdfmux's brand promises to prevent. Every change here is additive; no breaking defaults.
Surface signals that already exist
pdfmux convert --strict --min-confidence FLOAT— exits with code 3 if any document confidence falls below the threshold. Use it in CI to fail loud instead of silent.- stderr WARNING line for every document with confidence < 0.50, regardless of
--strict. Visible in CI logs. manifest.jsonwritten at the end of everyconvert <dir>run. Per-document confidence, extractor used, OCR pages, cost, warnings, plus a summary breakdown.pdfmux.batch_extract(paths, **kwargs)— public Python API. Use this instead of shelling out to the CLI in a loop.pdfmux doctor --check <dir>— samples PDFs, classifies them, recommends missing extras. Catches "23% of your batch is scanned, install pdfmux[ocr]" before the batch.- RapidOCR warnings translated to pdfmux-namespaced messages with file + page context.
Removed
- The ML heading classifier (
models/heading_classifier.pkl+ml_headings.py). It needed sklearn (not a base dep), printedFailed to load ML heading model24+ times per batch, and produced no measurable lift over the heuristic fallback. -250 LOC, no behavior change on real PDFs.
Fixed
pdfmux.__version__was stale at1.5.1; now matchespyproject.
Docs
- README leads Python users with
batch_extractfor batch use cases. pdfmux[ocr]promoted from "optional extra" to recommended-default.- New note: don't wrap pdfmux with your own pypdf fallback — PyMuPDF tolerates malformed PDFs that pypdf rejects.
Exit codes (now documented)
0— success1— extraction or runtime error2— usage error (bad arguments, file not found)3— strict gate failed (at least one document below--min-confidence)
Tests: 659 passed, 3 skipped.
Install: pip install -U pdfmux
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]