This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summaryFont-size‑based heading detection improves benchmark by +7.7%.
Full changelog
What's New in v1.2.0
Heading Detection (benchmark +7.7%)
Font-size-based heading detection that analyzes PyMuPDF font metadata to identify and inject markdown heading markers.
- Compares span font sizes to body text — maps to
#/##/### - Detects bold-at-same-size headings common in academic PDFs
- Promotes short bold-only lines to
###as fallback - Early exit when pymupdf4llm already detected headings
- Zero new dependencies, ~220 lines of pure Python
Borderless Table Fallback
Whitespace column detection for tables missed by find_tables():
- Detects consistent column positions across 3+ consecutive lines
- Validates: numeric column required, minimum 3 rows
- Returns
ExtractedTableobjects matching existing API
Benchmark Results
Tested on opendataloader-bench (200 real-world PDFs):
| Metric | v1.1.0 | v1.2.0 | Delta |
|--------|--------|--------|-------|
| Overall | 0.792 | 0.853 | +0.061 |
| MHS (headings) | 0.500 | 0.740 | +0.240 |
| NID (reading) | 0.911 | 0.911 | — |
| TEDS (tables) | 0.704 | 0.704 | — |
Leaderboard: #6 → #4 — ahead of opendataloader local (0.844) and mineru (0.831).
Developer
- New modules:
headings,table_fallback - 246 tests passing (21 new)
- Zero new dependencies
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]