AIMLPM/markcrawl

v0.2.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 3mo RAG & Retrieval

✓ No known CVEs patched

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Crawling performance improved to 15.7 pages/sec with async I/O and parallel processing.

Full changelog

3x faster crawling — async I/O + ProcessPoolExecutor bypass the GIL for true parallel HTML extraction.

Async httpx engine replaces sequential requests — concurrent fetches with asyncio.gather
ProcessPoolExecutor offloads CPU-bound BeautifulSoup + markdownify to separate processes
Streaming pipeline via asyncio.as_completed — pages save as they arrive, no batch-wait
Benchmark: 15.7 pages/sec at concurrency=5 (up from 3.4 p/s in v0.1.1)

pip install --upgrade markcrawl

The async engine activates automatically when httpx is installed:

pip install markcrawl[http2]

Or use it directly:

from markcrawl import crawl
result = crawl("https://example.com", out_dir="output", concurrency=5)

https://github.com/AIMLPM/markcrawl/compare/v0.1.1...v0.2.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.