This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryPartial-write recovery, discovery-exhaustion stall detection with configurable idle timeout, and zero-page diagnostic logging.
Full changelog
Three generalizable resilience fixes surfaced by the public llm-crawler-benchmarks v1.3 cycle. All site-agnostic — none reference the sites or site classes that surfaced them.
Fixes
Partial-write recovery. pages.jsonl is now line-buffered (buffering=1) and save_page flushes after every row. SIGKILL / external watchdog termination no longer leaves an empty JSONL on disk; OS page cache holds all written rows.
Discovery-exhaustion stall detection (idle_timeout_s). Engine tracks _last_save_time and terminates gracefully when no new page has been saved for idle_timeout_s seconds (default 120). Catches link-graph churn after reachable pages exhaust without site-specific heuristics.
0-page diagnostic. Engine captures the first observed HTTP status. On crawls that finish with pages_saved == 0, logs a class-aware warning: 4xx/5xx → likely anti-bot block, 200 → likely min_words too high or JS-rendered, no response → seed unreachable / DNS error.
API additions (additive only)
crawl(..., idle_timeout_s: Optional[float] = None)CrawlEngine/AsyncCrawlEngineacceptidle_timeout_skwargMARKCRAWL_IDLE_TIMEOUT_Senv varDEFAULT_IDLE_TIMEOUT_S = 120.0module constant- Set
idle_timeout_s=0(or env to0) to disable
Verification
- 521 tests passing (was 500 on v0.10.2; +21 in
tests/test_v0103_resilience.py) - Ruff lint clean
- All 4 Python versions (3.10–3.13) green on CI
Migration
No breaking changes. Default idle_timeout_s=120 is generous and only fires on genuine stalls. Users running long-blocked crawls intentionally (e.g. waiting on slow renders) can pass idle_timeout_s=0.
See CHANGELOG.md for full details.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]