This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryEngine now auto‑broadens scope one level when budget remains and the site is a docs hub.
Full changelog
When a crawl exhausts its narrow auto-derived scope (e.g. /docs/concepts/* from a kubernetes seed) with budget remaining, the engine now attempts one-level broadening (/docs/concepts/* → /docs/*) before terminating. URLs filtered under the previous scope are stashed during link discovery and replayed through the broader scope.
Empirical verification (real network, max_pages=400)
| Site | v0.10.4 | v0.10.5 | Delta |
|---|---|---|---|
| kubernetes-docs | 195/400 | 400/400 | +105% |
| rust-book | 111 | 111 | unchanged (guardrail held) |
| postgres-docs | 80 | 80 | unchanged |
| newegg | 1 | 1 | unchanged (engine handles WAF gracefully) |
Rust-book is deliberately unchanged: its Tier 0 single-segment scope /book/* cannot broaden short of whole-host, which the guardrail blocks. We don't auto-pull /std/, /cargo/, /nomicon/ even though crawl4ai-raw does — those are different publications, and our scope honors the seed's intent.
Guardrails
Broadening fires only when:
- Scope was auto-derived. User-explicit
include_pathsis respected as intent and never mutated. - Current scope's leftmost segment is in
_DOCS_HUB_MARKERS(docs,book,learn,tutorial,guide,reference,manual,handbook,api, etc.) or the site classifies asdocs/apirefby hostname. - One-level broadening doesn't land at whole-host (
/*). - Cap of
_DEFAULT_MAX_BROADEN_EVENTS = 2per crawl.
API additions (additive only)
CrawlResult.scope_history: List[List[str]]— sequence of include_paths patterns the crawl traversed. Auditable. Empty if no scope was set.
Migration
No breaking changes. Behavior preserved exactly when the user passes include_paths explicitly. For default crawls on docs sites, expect more pages and the same (or better) signal-to-noise — the broadening guardrail is intentionally tight (docs hub markers only, no whole-host fallback).
549 tests passing (was 528 on v0.10.4; +21 in tests/test_v0105_adaptive_scope.py).
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]