NameetP/pdfmux

v1.6.2 Maintenance

This release keeps dependencies and maintenance posture current for teams operating this tool.

Published 2mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr

+8 more

opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

Minor fixes and improvements.

Full changelog

Regression-guard release

No behavior changes. Adds 11 behavioral-contract tests for the real-world failure modes that prompted the 1.6.1 work — pinning correct behavior so it can't silently regress.

Test categories

Truncated PDF streams — the four pypdf: Stream has ended unexpectedly cases from the original batch run. pdfmux must either recover (PyMuPDF's xref repair) or raise — never silently return empty.
Non-ASCII filenames — CJK + full-width punctuation (Coolsoft test reports（原版）.pdf). Both extract_text and batch_extract must accept these without shell-quoting issues.
Arabic-only PDFs — the BiDi post-processor must not crash on RTL text.
0-byte files — must raise a named PdfmuxError, never silently return empty.
HTML files renamed to .pdf — common when 'view as PDF' saves the page source. Must error cleanly OR return text without HTML markup.
Missing files — must raise FileError, not a bare FileNotFoundError.
Batch isolation — a bad file in batch_extract must yield an exception for that file without poisoning the rest of the batch.

Numbers

670 passing (was 659 in 1.6.1)
0 behavior changes
0 source code changes — tests-only release

Install: pip install -U pdfmux

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track NameetP/pdfmux

Get notified when new releases ship.

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

NameetP/pdfmux

Summary

Regression-guard release

Test categories

Numbers

Related context

Related tools

Earlier breaking changes