This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+8 more
Summary
AI summarypdfmux is now a proper Python library with section‑aware chunking for LLM pipelines.
Full changelog
What's New
Public Python API
pdfmux is now a proper Python library. Three importable functions — no CLI required:
import pdfmux
text = pdfmux.extract_text("report.pdf")
data = pdfmux.extract_json("report.pdf")
chunks = pdfmux.load_llm_context("report.pdf")
Section-Aware Chunking
New load_llm_context() returns LLM-ready chunks split at heading boundaries, with per-chunk page tracking and token estimates. Designed for RAG pipelines and context windows.
chunks = pdfmux.load_llm_context("report.pdf")
for c in chunks:
print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")
LLM Output Format
New --format llm CLI option outputs chunked JSON with {title, text, page_start, page_end, tokens, confidence} per section.
pdfmux report.pdf -f llm
Locked JSON Schema
JSON output now includes schema_version: "0.4.0" and ocr_pages field for downstream stability.
pdfmux analyze
Per-page extraction breakdown showing page type, quality, char count, confidence, and extractor used.
pdfmux analyze report.pdf
Install
pip install pdfmux
Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About NameetP/pdfmux
PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)
Related context
Related tools
Beta — feedback welcome: [email protected]