NameetP/pdfmux

v1.6.1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr

+8 more

opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

Added a strict confidence‑threshold exit code and new Python batch API.

Full changelog

Field-driven patch release

Triggered by a real-world 433-PDF batch run where the first invocation silently dropped 16 documents — the exact failure mode pdfmux's brand promises to prevent. Every change here is additive; no breaking defaults.

Surface signals that already exist

pdfmux convert --strict --min-confidence FLOAT — exits with code 3 if any document confidence falls below the threshold. Use it in CI to fail loud instead of silent.
stderr WARNING line for every document with confidence < 0.50, regardless of --strict. Visible in CI logs.
manifest.json written at the end of every convert <dir> run. Per-document confidence, extractor used, OCR pages, cost, warnings, plus a summary breakdown.
pdfmux.batch_extract(paths, **kwargs) — public Python API. Use this instead of shelling out to the CLI in a loop.
pdfmux doctor --check <dir> — samples PDFs, classifies them, recommends missing extras. Catches "23% of your batch is scanned, install pdfmux[ocr]" before the batch.
RapidOCR warnings translated to pdfmux-namespaced messages with file + page context.

Removed

The ML heading classifier (models/heading_classifier.pkl + ml_headings.py). It needed sklearn (not a base dep), printed Failed to load ML heading model 24+ times per batch, and produced no measurable lift over the heuristic fallback. -250 LOC, no behavior change on real PDFs.

Fixed

pdfmux.__version__ was stale at 1.5.1; now matches pyproject.

Docs

README leads Python users with batch_extract for batch use cases.
pdfmux[ocr] promoted from "optional extra" to recommended-default.
New note: don't wrap pdfmux with your own pypdf fallback — PyMuPDF tolerates malformed PDFs that pypdf rejects.

Exit codes (now documented)

0 — success
1 — extraction or runtime error
2 — usage error (bad arguments, file not found)
3 — strict gate failed (at least one document below --min-confidence)

Tests: 659 passed, 3 skipped.

Install: pip install -U pdfmux

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track NameetP/pdfmux

Get notified when new releases ship.

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)