NameetP/pdfmux

v0.4.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 4mo Developer Productivity

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agent docling document-parsing llm mcp ocr

+8 more

opendataloader pdf pdf-extraction pdf-to-json pdf-to-markdown python self-healing structured-extraction

Summary

AI summary

pdfmux is now a proper Python library with section‑aware chunking for LLM pipelines.

Full changelog

What's New

Public Python API

pdfmux is now a proper Python library. Three importable functions — no CLI required:

import pdfmux

text = pdfmux.extract_text("report.pdf")
data = pdfmux.extract_json("report.pdf")
chunks = pdfmux.load_llm_context("report.pdf")

Section-Aware Chunking

New load_llm_context() returns LLM-ready chunks split at heading boundaries, with per-chunk page tracking and token estimates. Designed for RAG pipelines and context windows.

chunks = pdfmux.load_llm_context("report.pdf")
for c in chunks:
    print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")

LLM Output Format

New --format llm CLI option outputs chunked JSON with {title, text, page_start, page_end, tokens, confidence} per section.

pdfmux report.pdf -f llm

Locked JSON Schema

JSON output now includes schema_version: "0.4.0" and ocr_pages field for downstream stability.

`pdfmux analyze`

Per-page extraction breakdown showing page type, quality, char count, confidence, and extractor used.

pdfmux analyze report.pdf

Install

pip install pdfmux

Full changelog: https://github.com/NameetP/pdfmux/blob/main/CHANGELOG.md

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track NameetP/pdfmux

Get notified when new releases ship.

About NameetP/pdfmux

PDF extraction router with built-in MCP server. Classifies each page (digital, scanned, tables) and routes to the best backend (PyMuPDF, Docling, OCR, or optional LLM fallback)