Skip to content

AIMLPM/markcrawl

v0.2.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Crawling performance improved to 15.7 pages/sec with async I/O and parallel processing.

Full changelog

What's new

3x faster crawling — async I/O + ProcessPoolExecutor bypass the GIL for true parallel HTML extraction.

Performance

  • Async httpx engine replaces sequential requests — concurrent fetches with asyncio.gather
  • ProcessPoolExecutor offloads CPU-bound BeautifulSoup + markdownify to separate processes
  • Streaming pipeline via asyncio.as_completed — pages save as they arrive, no batch-wait
  • Benchmark: 15.7 pages/sec at concurrency=5 (up from 3.4 p/s in v0.1.1)

How to upgrade

pip install --upgrade markcrawl

The async engine activates automatically when httpx is installed:

pip install markcrawl[http2]

Or use it directly:

from markcrawl import crawl
result = crawl("https://example.com", out_dir="output", concurrency=5)

Full changelog

https://github.com/AIMLPM/markcrawl/compare/v0.1.1...v0.2.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]