Skip to content

AIMLPM/markcrawl

v0.10.4 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Idle‑timeout now resets on any meaningful progress event, fixing premature termination during bursty crawls.

Full changelog

The v0.10.3 idle-timeout reset only on save_page, which mis-fired on bursty crawls where the engine was successfully fetching pages but most were getting deduped or were under min_words. The public benchmark surfaced this on huggingface-transformers (21/200 pages saved before the timer fired at 120 s).

Fix

The idle-timeout clock now resets on any meaningful progress event:

  • save_page (already in v0.10.3)
  • successful HTTP 2xx response
  • discover_links call that adds at least one new URL to the queue

4xx / 5xx responses do not reset the clock — anti-bot loops still get caught.

Empirical verification

A fresh crawl of huggingface-transformers at max_pages=200:

| version | pages saved | elapsed |
|---|---|---|
| v0.10.3 | 21 | 120 s (timer fired early) |
| v0.10.4 | 174 | 236 s (graceful exit) |

8x improvement on the bursty-discovery case; idle timer now functions as a true deadlock detector, not a save-rate guard.

CrawlResult API additions (additive only)

  • first_status: Optional[int] — first observable HTTP status. Lets callers distinguish engine bugs from external WAF/anti-bot blocks without scraping logs.
  • stalled: boolTrue when the run was terminated by the idle-timeout watchdog rather than running out of work or hitting max_pages.

Pre-release smoke harness

New bench/local_replica/release_smoke.py runs crawl() against ~4 real sites with per-site baselines. Treats first_status >= 400 + 0 pages as BLOCKED (skip, not fail) so transient WAF blocks don't false-alarm. Catches stall-detection regressions, coverage regressions, and anti-bot diagnostic regressions in 5-10 min.

Migration

No breaking changes. Users who set MARKCRAWL_IDLE_TIMEOUT_S=300 to work around the v0.10.3 mis-fire can drop the override — 120 s is correct again.

528 tests passing (was 521 on v0.10.3; +7 covering the new reset paths).

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]