Skip to content

NameetP/pdfmux

v0.4.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr
+8 more
opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

pdfmux is now a proper Python library with section‑aware chunking for LLM pipelines.

Full changelog

What's New

Public Python API

pdfmux is now a proper Python library. Three importable functions — no CLI required:

import pdfmux

text = pdfmux.extract_text("report.pdf")
data = pdfmux.extract_json("report.pdf")
chunks = pdfmux.load_llm_context("report.pdf")

Section-Aware Chunking

New load_llm_context() returns LLM-ready chunks split at heading boundaries, with per-chunk page tracking and token estimates. Designed for RAG pipelines and context windows.

chunks = pdfmux.load_llm_context("report.pdf")
for c in chunks:
    print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")

LLM Output Format

New --format llm CLI option outputs chunked JSON with {title, text, page_start, page_end, tokens, confidence} per section.

pdfmux report.pdf -f llm

Locked JSON Schema

JSON output now includes schema_version: "0.4.0" and ocr_pages field for downstream stability.

pdfmux analyze

Per-page extraction breakdown showing page type, quality, char count, confidence, and extractor used.

pdfmux analyze report.pdf

Install

pip install pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track NameetP/pdfmux

Get notified when new releases ship.

Sign up free

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)

All releases →

Related context

Beta — feedback welcome: [email protected]