NameetP/pdfmux

v1.2.0 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 4mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr

+8 more

opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

Font-size‑based heading detection improves benchmark by +7.7%.

Full changelog

What's New in v1.2.0

Heading Detection (benchmark +7.7%)

Font-size-based heading detection that analyzes PyMuPDF font metadata to identify and inject markdown heading markers.

Compares span font sizes to body text — maps to #/##/###
Detects bold-at-same-size headings common in academic PDFs
Promotes short bold-only lines to ### as fallback
Early exit when pymupdf4llm already detected headings
Zero new dependencies, ~220 lines of pure Python

Borderless Table Fallback

Whitespace column detection for tables missed by find_tables():

Detects consistent column positions across 3+ consecutive lines
Validates: numeric column required, minimum 3 rows
Returns ExtractedTable objects matching existing API

Benchmark Results

Tested on opendataloader-bench (200 real-world PDFs):

| Metric | v1.1.0 | v1.2.0 | Delta |
|--------|--------|--------|-------|
| Overall | 0.792 | 0.853 | +0.061 |
| MHS (headings) | 0.500 | 0.740 | +0.240 |
| NID (reading) | 0.911 | 0.911 | — |
| TEDS (tables) | 0.704 | 0.704 | — |

Leaderboard: #6 → #4 — ahead of opendataloader local (0.844) and mineru (0.831).

Developer

New modules: headings, table_fallback
246 tests passing (21 new)
Zero new dependencies

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track NameetP/pdfmux

Get notified when new releases ship.

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)