Skip to content

docling

RAG & Retrieval

A document‑processing library that parses many file formats (PDF, Office, audio, images, etc.) and integrates with generative‑AI ecosystems

Python Latest v2.97.0 · 11h ago Security brief →

Features

  • Parse multiple document formats including PDF, DOCX, PPTX, XLSX, HTML, audio (WAV/MP3), images, LaTeX, plain‑text and more
  • Advanced PDF understanding – layout, tables, code, formulas, image classification, etc.
  • Unified DoclingDocument representation for seamless AI integrations (LangChain, LlamaIndex, Crew AI, Haystack)
  • Extensive OCR support for scanned documents and images
  • CLI tool and Python API with export options (Markdown, HTML, JSON, WebVTT, etc.)

Recent releases

View all 77 releases →
No immediate action
v2.97.0 Breaking risk

Parameter rename

No immediate action
v2.96.1 Bug fix

FFmpeg error + DrawingML text

No immediate action
v2.96.0 Mixed

PDF backend + JSON fix + docs update

No immediate action
v2.95.0 Bug fix

Preserve DOCX text on DrawingML images

No immediate action
v2.94.0 New feature

TikZ rendering + new options + Vision 4.1 + HF

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
60,894
Forks
4,243
Languages
Python Shell Dockerfile

Install & Platforms

Install via
pip
Platforms
linux macos windows arm64

Beta — feedback welcome: [email protected]