AIMLPM/markcrawl

v0.10.3 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Partial-write recovery, discovery-exhaustion stall detection with configurable idle timeout, and zero-page diagnostic logging.

Full changelog

Three generalizable resilience fixes surfaced by the public llm-crawler-benchmarks v1.3 cycle. All site-agnostic — none reference the sites or site classes that surfaced them.

Fixes

Partial-write recovery. pages.jsonl is now line-buffered (buffering=1) and save_page flushes after every row. SIGKILL / external watchdog termination no longer leaves an empty JSONL on disk; OS page cache holds all written rows.

Discovery-exhaustion stall detection (idle_timeout_s). Engine tracks _last_save_time and terminates gracefully when no new page has been saved for idle_timeout_s seconds (default 120). Catches link-graph churn after reachable pages exhaust without site-specific heuristics.

0-page diagnostic. Engine captures the first observed HTTP status. On crawls that finish with pages_saved == 0, logs a class-aware warning: 4xx/5xx → likely anti-bot block, 200 → likely min_words too high or JS-rendered, no response → seed unreachable / DNS error.

API additions (additive only)

crawl(..., idle_timeout_s: Optional[float] = None)
CrawlEngine / AsyncCrawlEngine accept idle_timeout_s kwarg
MARKCRAWL_IDLE_TIMEOUT_S env var
DEFAULT_IDLE_TIMEOUT_S = 120.0 module constant
Set idle_timeout_s=0 (or env to 0) to disable

Verification

521 tests passing (was 500 on v0.10.2; +21 in tests/test_v0103_resilience.py)
Ruff lint clean
All 4 Python versions (3.10–3.13) green on CI

Migration

No breaking changes. Default idle_timeout_s=120 is generous and only fires on genuine stalls. Users running long-blocked crawls intentionally (e.g. waiting on slow renders) can pass idle_timeout_s=0.

See CHANGELOG.md for full details.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →