This release adds 1 notable feature for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+8 more
Affected surfaces
Summary
AI summaryFixed document confidence incorrectly reporting 1.0 for empty extractions.
Full changelog
Correctness patch
No defaults change. Every existing flag and CLI invocation behaves identically. Confidence numbers are now correct on documents where they were previously inflated.
The bug
`audit.compute_document_confidence` was returning 1.0 on documents with empty extractions. The function did a content-weighted average of the per-page `confidence` value the extractor wrote at yield time — always `1.0` with the comment audit will reassess. The reassessment never happened, so:
- Blank pages → confidence 1.0
- HTML files renamed to `.pdf` → confidence 1.0
- Single-character bodies → confidence 1.0
- Image-only pages with no OCR → confidence 1.0
This is the bug behind the silent failures in our 433-PDF batch retro. `--strict --min-confidence 0.20` shipped in 1.6.1 could not catch the eleven silent failures because the audit didn't know the pages were empty.
The fix (5 lines in `audit.py`)
- Re-score every page with `score_page(p.text, p.image_count)` before averaging.
- Stop flooring per-page weight at `1`. Blank pages now register zero weight in the denominator.
Also added: `eval/` calibration harness
The instrument that surfaced the bug. Self-contained, three composable steps:
```bash
python eval/build_fixtures.py # generates 50 labeled PDFs
python eval/run_eval.py # runs pdfmux on each
python eval/calibrate.py # ROC + threshold recommendations
```
The first calibration run produced precision flat at 0.683 across every threshold from 0.0 to 0.95 — the smoking gun. After the fix, threshold `0.75` produces precision `1.00`, recall `0.71` on the 50-fixture set.
Calibration headline
| Threshold | Precision | Recall | F1 |
|---:|---:|---:|---:|
| 0.50 | 0.821 | 0.821 | 0.821 |
| 0.75 | 1.000 | 0.714 | 0.833 |
`0.75` is the recommended default for `--min-confidence` when 1.7 ships breaking-default-strict on `pdfmux convert ` (target: 2026-05-08).
Tests
670 passing, 3 skipped. No regressions.
Install: `pip install -U pdfmux`
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]