Skip to content

AIMLPM/markcrawl

v0.10.2 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Fixed regression where pathological sitemap-indexes caused excessive pre‑enumeration time by adding a 60 s wallclock budget to sitemap discovery.

Full changelog

tl;dr

Patch release fixing a regression surfaced by llm-crawler-benchmarks against v0.10.1: pathological sitemap-indexes (ikea: 2,113 locale shards) consumed 200+ s in pre-enumeration before any page got crawled, tripping benchmark zero-output watchdogs (120 s).

The sitemap-discovery phase now has a 60 s wallclock budget shared across all top-level sitemaps + their recursive children. When the budget fires, the parser returns whatever URLs it has collected so far and the crawl proceeds normally.

Verified locally on the failing sites

| Site | v0.10.1 | v0.10.2 |
|---------------------------|----------------------------|-----------------------------|
| ikea (max_pages=30) | 0 pages (heartbeat fired) | 30 pages saved in 49.7 s |
| huggingface-transformers | regression on bench CI | 30 pages saved in 36.2 s |

What changed

  • markcrawl.robots.parse_sitemap_xml and parse_sitemap_xml_async: new time_budget_s kwarg (default 60.0), threaded through recursion via the internal _deadline. Async path switches from asyncio.gather to asyncio.as_completed so pending child-sitemap tasks are cancelled rather than awaited once the budget fires.
  • markcrawl.core: both sync and async crawl paths instantiate a shared deadline at the start of sitemap discovery.
  • 2 new tests in tests/test_sitemap_parallel.py covering the short-circuit and the no-op default.
  • 500 tests passing (was 498).

Compatibility

No CLI flag changes. No behavior change for sites with normal sitemaps (which finish in <10 s anyway). Only the pathological-index path is affected.

For benchmark integrators

pip install --upgrade markcrawl==0.10.2 and re-run the previously failing sites. Crawl wallclock for ikea drops from "timeout, 0 pages" to "max_pages saved within budget."

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]