Skip to content

AIMLPM/markcrawl

v0.10.5 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Engine now auto‑broadens scope one level when budget remains and the site is a docs hub.

Full changelog

When a crawl exhausts its narrow auto-derived scope (e.g. /docs/concepts/* from a kubernetes seed) with budget remaining, the engine now attempts one-level broadening (/docs/concepts/*/docs/*) before terminating. URLs filtered under the previous scope are stashed during link discovery and replayed through the broader scope.

Empirical verification (real network, max_pages=400)

| Site | v0.10.4 | v0.10.5 | Delta |
|---|---|---|---|
| kubernetes-docs | 195/400 | 400/400 | +105% |
| rust-book | 111 | 111 | unchanged (guardrail held) |
| postgres-docs | 80 | 80 | unchanged |
| newegg | 1 | 1 | unchanged (engine handles WAF gracefully) |

Rust-book is deliberately unchanged: its Tier 0 single-segment scope /book/* cannot broaden short of whole-host, which the guardrail blocks. We don't auto-pull /std/, /cargo/, /nomicon/ even though crawl4ai-raw does — those are different publications, and our scope honors the seed's intent.

Guardrails

Broadening fires only when:

  1. Scope was auto-derived. User-explicit include_paths is respected as intent and never mutated.
  2. Current scope's leftmost segment is in _DOCS_HUB_MARKERS (docs, book, learn, tutorial, guide, reference, manual, handbook, api, etc.) or the site classifies as docs / apiref by hostname.
  3. One-level broadening doesn't land at whole-host (/*).
  4. Cap of _DEFAULT_MAX_BROADEN_EVENTS = 2 per crawl.

API additions (additive only)

  • CrawlResult.scope_history: List[List[str]] — sequence of include_paths patterns the crawl traversed. Auditable. Empty if no scope was set.

Migration

No breaking changes. Behavior preserved exactly when the user passes include_paths explicitly. For default crawls on docs sites, expect more pages and the same (or better) signal-to-noise — the broadening guardrail is intentionally tight (docs hub markers only, no whole-host fallback).

549 tests passing (was 528 on v0.10.4; +21 in tests/test_v0105_adaptive_scope.py).

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]