NameetP/pdfmux

v1.6.0 Feature

This release adds 12 notable features for engineering teams evaluating rollout.

Published 2mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr

+8 more

opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

New extraction backends, Arabic/RTL support, caching/streaming tools, and DX improvements added.

Full changelog

Highlights

Extraction backends

Mistral OCR ($0.002/page) and Marker neural extractor as optional backends
Gemma 4 27B IT as a vision provider via GeminiAPI (reuses GEMINI_API_KEY) with native Arabic OCR

Arabic / RTL

BiDi post-processing (markdown-aware: preserves headings, lists, code fences)
Arabic detection in the classifier; arabic page type wired into the routing matrix

Caching & streaming

Smart result cache keyed by file hash + (quality, format, schema). 30d TTL, 1GB LRU.
pdfmux stream and extract_streaming MCP tool — NDJSON events as pages complete

pdfmux profiles list/show/save/delete and --profile flag (built-ins: invoices, receipts, papers, contracts, bulk-rag)
pdfmux watch <dir> — auto-convert on change
pdfmux estimate — predict cost before running
pdfmux diff a.pdf b.pdf — extraction comparison
Better error messages: .user_message, .suggestion, .reproduce_cmd
@with_retry (exponential backoff, honors Retry-After) on every LLM provider

Tests: 659 passing (up from 481).

Install: pip install -U pdfmux

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track NameetP/pdfmux

Get notified when new releases ship.

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

NameetP/pdfmux

Summary

Highlights

Related context

Related tools

Earlier breaking changes