Skip to content

NameetP/pdfmux

v0.2.2 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

Graphical PDFs are automatically routed to OCR or LLM extractors and confidence scores accurately reflect extraction quality.

Full changelog

What's New

Graphical PDF Detection

pdfmux now detects image-heavy PDFs (pitch decks, infographics, slides) and routes them to OCR or LLM extractors instead of fast text extraction. Previously, these PDFs would silently produce incomplete output.

Honest Confidence Scoring

Confidence scores now reflect actual extraction quality. A graphical PDF where image content was missed will score ~55% instead of falsely claiming 100%. Color-coded output: green (≥80%), yellow (50-79%), red (<50%).

Actionable Warnings

When extraction is limited, pdfmux tells you exactly what to do:

⚠ 35 of 47 pages contain images with text that could not be extracted.
  Install pdfmux[ocr] or pdfmux[llm] for better results.

MCP Server Quality Metadata

AI agents now receive confidence scores and warnings alongside extracted text, so they can make informed decisions about output quality.

Spaced-Text Cleanup

Fixes a common PDF artifact where text renders as W i t h o v e r → cleaned to With over.

FastExtractor Fallback

When pymupdf4llm returns empty (certain PDF encodings), pdfmux now falls back to raw fitz text extraction instead of producing empty output.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]