This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryIdle‑timeout now resets on any meaningful progress event, fixing premature termination during bursty crawls.
Full changelog
The v0.10.3 idle-timeout reset only on save_page, which mis-fired on bursty crawls where the engine was successfully fetching pages but most were getting deduped or were under min_words. The public benchmark surfaced this on huggingface-transformers (21/200 pages saved before the timer fired at 120 s).
Fix
The idle-timeout clock now resets on any meaningful progress event:
save_page(already in v0.10.3)- successful HTTP 2xx response
discover_linkscall that adds at least one new URL to the queue
4xx / 5xx responses do not reset the clock — anti-bot loops still get caught.
Empirical verification
A fresh crawl of huggingface-transformers at max_pages=200:
| version | pages saved | elapsed |
|---|---|---|
| v0.10.3 | 21 | 120 s (timer fired early) |
| v0.10.4 | 174 | 236 s (graceful exit) |
8x improvement on the bursty-discovery case; idle timer now functions as a true deadlock detector, not a save-rate guard.
CrawlResult API additions (additive only)
first_status: Optional[int]— first observable HTTP status. Lets callers distinguish engine bugs from external WAF/anti-bot blocks without scraping logs.stalled: bool—Truewhen the run was terminated by the idle-timeout watchdog rather than running out of work or hittingmax_pages.
Pre-release smoke harness
New bench/local_replica/release_smoke.py runs crawl() against ~4 real sites with per-site baselines. Treats first_status >= 400 + 0 pages as BLOCKED (skip, not fail) so transient WAF blocks don't false-alarm. Catches stall-detection regressions, coverage regressions, and anti-bot diagnostic regressions in 5-10 min.
Migration
No breaking changes. Users who set MARKCRAWL_IDLE_TIMEOUT_S=300 to work around the v0.10.3 mis-fire can drop the override — 120 s is correct again.
528 tests passing (was 521 on v0.10.3; +7 covering the new reset paths).
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]