This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryGraphical PDFs are automatically routed to OCR or LLM extractors and confidence scores accurately reflect extraction quality.
Full changelog
What's New
Graphical PDF Detection
pdfmux now detects image-heavy PDFs (pitch decks, infographics, slides) and routes them to OCR or LLM extractors instead of fast text extraction. Previously, these PDFs would silently produce incomplete output.
Honest Confidence Scoring
Confidence scores now reflect actual extraction quality. A graphical PDF where image content was missed will score ~55% instead of falsely claiming 100%. Color-coded output: green (≥80%), yellow (50-79%), red (<50%).
Actionable Warnings
When extraction is limited, pdfmux tells you exactly what to do:
⚠ 35 of 47 pages contain images with text that could not be extracted.
Install pdfmux[ocr] or pdfmux[llm] for better results.
MCP Server Quality Metadata
AI agents now receive confidence scores and warnings alongside extracted text, so they can make informed decisions about output quality.
Spaced-Text Cleanup
Fixes a common PDF artifact where text renders as W i t h o v e r → cleaned to With over.
FastExtractor Fallback
When pymupdf4llm returns empty (certain PDF encodings), pdfmux now falls back to raw fitz text extraction instead of producing empty output.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]