This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryOverall benchmark score improved from 0.867 to 0.905.
Full changelog
What's New in v1.5.0
Benchmark Results
- 0.905 overall benchmark score on opendataloader-bench (200 docs)
- Up from 0.867 (v1.3.0) — a +4.4% improvement
- 100% confidence score across all documents
- 98 docs improved, only 3 regressed
Key Improvements
Image Table OCR (TEDS: 0.887 → 0.911, +2.7%)
- Integrated RapidOCR for tables embedded as images
- Smart filtering: 50% fill rate + 30% numeric cell thresholds to avoid false positives on charts
ML Heading Classifier (MHS: 0.844 → 0.852, +0.9%)
- ML-based fallback for heading detection when heuristics fail
- Improved heading cleanup for cleaner document structure
Column-Aware Reading Order (NID: 0.910 → 0.920)
- A/B column reordering: detects multi-column pages, compares both orderings, picks the better one
- Safe by design — worst case is no-op (original text preserved)
- Conservative detection (200pt gap threshold) to avoid false positives
Install
pip install pdfmux==1.5.0
Full Changelog: https://github.com/NameetP/pdfmux/compare/v1.3.0...v1.5.0
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]