Skip to content

NameetP/pdfmux

v1.5.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

Overall benchmark score improved from 0.867 to 0.905.

Full changelog

What's New in v1.5.0

Benchmark Results

  • 0.905 overall benchmark score on opendataloader-bench (200 docs)
  • Up from 0.867 (v1.3.0) — a +4.4% improvement
  • 100% confidence score across all documents
  • 98 docs improved, only 3 regressed

Key Improvements

Image Table OCR (TEDS: 0.887 → 0.911, +2.7%)

  • Integrated RapidOCR for tables embedded as images
  • Smart filtering: 50% fill rate + 30% numeric cell thresholds to avoid false positives on charts

ML Heading Classifier (MHS: 0.844 → 0.852, +0.9%)

  • ML-based fallback for heading detection when heuristics fail
  • Improved heading cleanup for cleaner document structure

Column-Aware Reading Order (NID: 0.910 → 0.920)

  • A/B column reordering: detects multi-column pages, compares both orderings, picks the better one
  • Safe by design — worst case is no-op (original text preserved)
  • Conservative detection (200pt gap threshold) to avoid false positives

Install

pip install pdfmux==1.5.0

Full Changelog: https://github.com/NameetP/pdfmux/compare/v1.3.0...v1.5.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]