Skip to content

NameetP/pdfmux

v1.6.0 Feature

This release adds 12 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

New extraction backends, Arabic/RTL support, caching/streaming tools, and DX improvements added.

Full changelog

Highlights

Extraction backends

  • Mistral OCR ($0.002/page) and Marker neural extractor as optional backends
  • Gemma 4 27B IT as a vision provider via GeminiAPI (reuses GEMINI_API_KEY) with native Arabic OCR

Arabic / RTL

  • BiDi post-processing (markdown-aware: preserves headings, lists, code fences)
  • Arabic detection in the classifier; arabic page type wired into the routing matrix

Caching & streaming

  • Smart result cache keyed by file hash + (quality, format, schema). 30d TTL, 1GB LRU.
  • pdfmux stream and extract_streaming MCP tool — NDJSON events as pages complete

DX

  • pdfmux profiles list/show/save/delete and --profile flag (built-ins: invoices, receipts, papers, contracts, bulk-rag)
  • pdfmux watch <dir> — auto-convert on change
  • pdfmux estimate — predict cost before running
  • pdfmux diff a.pdf b.pdf — extraction comparison
  • Better error messages: .user_message, .suggestion, .reproduce_cmd
  • @with_retry (exponential backoff, honors Retry-After) on every LLM provider

Tests: 659 passing (up from 481).

Install: pip install -U pdfmux

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]