Skip to content

NameetP/pdfmux

v1.6.3 Feature

This release adds 1 notable feature for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Affected surfaces

auth

Summary

AI summary

Fixed document confidence incorrectly reporting 1.0 for empty extractions.

Full changelog

Correctness patch

No defaults change. Every existing flag and CLI invocation behaves identically. Confidence numbers are now correct on documents where they were previously inflated.

The bug

`audit.compute_document_confidence` was returning 1.0 on documents with empty extractions. The function did a content-weighted average of the per-page `confidence` value the extractor wrote at yield time — always `1.0` with the comment audit will reassess. The reassessment never happened, so:

  • Blank pages → confidence 1.0
  • HTML files renamed to `.pdf` → confidence 1.0
  • Single-character bodies → confidence 1.0
  • Image-only pages with no OCR → confidence 1.0

This is the bug behind the silent failures in our 433-PDF batch retro. `--strict --min-confidence 0.20` shipped in 1.6.1 could not catch the eleven silent failures because the audit didn't know the pages were empty.

The fix (5 lines in `audit.py`)

  • Re-score every page with `score_page(p.text, p.image_count)` before averaging.
  • Stop flooring per-page weight at `1`. Blank pages now register zero weight in the denominator.

Also added: `eval/` calibration harness

The instrument that surfaced the bug. Self-contained, three composable steps:

```bash
python eval/build_fixtures.py # generates 50 labeled PDFs
python eval/run_eval.py # runs pdfmux on each
python eval/calibrate.py # ROC + threshold recommendations
```

The first calibration run produced precision flat at 0.683 across every threshold from 0.0 to 0.95 — the smoking gun. After the fix, threshold `0.75` produces precision `1.00`, recall `0.71` on the 50-fixture set.

Calibration headline

| Threshold | Precision | Recall | F1 |
|---:|---:|---:|---:|
| 0.50 | 0.821 | 0.821 | 0.821 |
| 0.75 | 1.000 | 0.714 | 0.833 |

`0.75` is the recommended default for `--min-confidence` when 1.7 ships breaking-default-strict on `pdfmux convert ` (target: 2026-05-08).

Tests

670 passing, 3 skipped. No regressions.

Install: `pip install -U pdfmux`

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]