NameetP/pdfmux

v1.6.4 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr

+8 more

opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

New pdfmux audit command for diffing extractor outputs and an OSS‑to‑cloud upsell line after conversion.

Full changelog

Additive release. Two new things, no breaking changes, no defaults change. This is the OSS half of the Verified Extraction Manifest (VEM) standard — the audit command produces the comparison artifact that VEM 1.0 standardizes.

New: `pdfmux audit`

Diff your current extractor's output against pdfmux on the same PDFs:

pdfmux audit --against your_extractor_output.csv --on /path/to/pdfs

Reads --against as CSV (filename,text or file,content) or JSON ({filename: text})
Computes per-document word-set Jaccard overlap between the two extractors
Flags documents below --overlap-threshold (default 0.70) OR --confidence-threshold (default 0.50)
Writes a 7-column CSV: filename, my_extractor_chars, pdfmux_chars, jaccard_overlap, pdfmux_confidence, recommendation, error
Exit codes: 0 clean, 2 usage error, 3 anything flagged

The pitch: "diff our output against your current extractor on 100 of your own PDFs — if we agree on every document, you don't need us. If we disagree on more than 2%, those are the silent failures already in your pipeline."

New: OSS → cloud funnel line

A single dim line prints after a successful convert, pointing to the free tier (1,000 pages/mo) and the open VEM spec. Suppress with PDFMUX_NO_UPSELL=1. Skipped when stdout isn't a TTY (so piped output stays clean) and when writing to stdout via --output -.

Tests

tests/test_audit_cli.py — 8 new tests. Total: 678 passing (up from 670).

Install

pip install --upgrade pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track NameetP/pdfmux

Get notified when new releases ship.

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

NameetP/pdfmux

Summary

New: `pdfmux audit`

New: OSS → cloud funnel line

Tests

Install

Related context

Related tools

Earlier breaking changes

NameetP/pdfmux

Summary

New: pdfmux audit

New: OSS → cloud funnel line

Tests

Install

Related context

Related tools

Earlier breaking changes

New: `pdfmux audit`