Skip to content

NameetP/pdfmux

v1.6.2 Maintenance

This release keeps dependencies and maintenance posture current for teams operating this tool.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

Minor fixes and improvements.

Full changelog

Regression-guard release

No behavior changes. Adds 11 behavioral-contract tests for the real-world failure modes that prompted the 1.6.1 work — pinning correct behavior so it can't silently regress.

Test categories

  • Truncated PDF streams — the four pypdf: Stream has ended unexpectedly cases from the original batch run. pdfmux must either recover (PyMuPDF's xref repair) or raise — never silently return empty.
  • Non-ASCII filenames — CJK + full-width punctuation (Coolsoft test reports(原版).pdf). Both extract_text and batch_extract must accept these without shell-quoting issues.
  • Arabic-only PDFs — the BiDi post-processor must not crash on RTL text.
  • 0-byte files — must raise a named PdfmuxError, never silently return empty.
  • HTML files renamed to .pdf — common when 'view as PDF' saves the page source. Must error cleanly OR return text without HTML markup.
  • Missing files — must raise FileError, not a bare FileNotFoundError.
  • Batch isolation — a bad file in batch_extract must yield an exception for that file without poisoning the rest of the batch.

Numbers

  • 670 passing (was 659 in 1.6.1)
  • 0 behavior changes
  • 0 source code changes — tests-only release

Install: pip install -U pdfmux

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]