This release adds 12 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryNew extraction backends, Arabic/RTL support, caching/streaming tools, and DX improvements added.
Full changelog
Highlights
Extraction backends
- Mistral OCR ($0.002/page) and Marker neural extractor as optional backends
- Gemma 4 27B IT as a vision provider via GeminiAPI (reuses
GEMINI_API_KEY) with native Arabic OCR
Arabic / RTL
- BiDi post-processing (markdown-aware: preserves headings, lists, code fences)
- Arabic detection in the classifier;
arabicpage type wired into the routing matrix
Caching & streaming
- Smart result cache keyed by file hash + (quality, format, schema). 30d TTL, 1GB LRU.
pdfmux streamandextract_streamingMCP tool — NDJSON events as pages complete
DX
pdfmux profiles list/show/save/deleteand--profileflag (built-ins: invoices, receipts, papers, contracts, bulk-rag)pdfmux watch <dir>— auto-convert on changepdfmux estimate— predict cost before runningpdfmux diff a.pdf b.pdf— extraction comparison- Better error messages:
.user_message,.suggestion,.reproduce_cmd @with_retry(exponential backoff, honorsRetry-After) on every LLM provider
Tests: 659 passing (up from 481).
Install: pip install -U pdfmux
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]