This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryFixed regression where pathological sitemap-indexes caused excessive pre‑enumeration time by adding a 60 s wallclock budget to sitemap discovery.
Full changelog
tl;dr
Patch release fixing a regression surfaced by llm-crawler-benchmarks against v0.10.1: pathological sitemap-indexes (ikea: 2,113 locale shards) consumed 200+ s in pre-enumeration before any page got crawled, tripping benchmark zero-output watchdogs (120 s).
The sitemap-discovery phase now has a 60 s wallclock budget shared across all top-level sitemaps + their recursive children. When the budget fires, the parser returns whatever URLs it has collected so far and the crawl proceeds normally.
Verified locally on the failing sites
| Site | v0.10.1 | v0.10.2 |
|---------------------------|----------------------------|-----------------------------|
| ikea (max_pages=30) | 0 pages (heartbeat fired) | 30 pages saved in 49.7 s |
| huggingface-transformers | regression on bench CI | 30 pages saved in 36.2 s |
What changed
markcrawl.robots.parse_sitemap_xmlandparse_sitemap_xml_async: newtime_budget_skwarg (default60.0), threaded through recursion via the internal_deadline. Async path switches fromasyncio.gathertoasyncio.as_completedso pending child-sitemap tasks are cancelled rather than awaited once the budget fires.markcrawl.core: both sync and async crawl paths instantiate a shared deadline at the start of sitemap discovery.- 2 new tests in
tests/test_sitemap_parallel.pycovering the short-circuit and the no-op default. - 500 tests passing (was 498).
Compatibility
No CLI flag changes. No behavior change for sites with normal sitemaps (which finish in <10 s anyway). Only the pathological-index path is affected.
For benchmark integrators
pip install --upgrade markcrawl==0.10.2 and re-run the previously failing sites. Crawl wallclock for ikea drops from "timeout, 0 pages" to "max_pages saved within budget."
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]