This release includes 2 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryAuto‑derived include_paths now default ON, constraining crawls to the seed subtree and reordering BFS to prioritize on‑section links.
Full changelog
Two complementary, generic crawler-coverage improvements that fix the dominant BFS-escapes-seed-subtree failure mode (mdn /Web/CSS, HF /docs/transformers/ etc.).
Both default ON. Behavior-preserving for root-domain seeds and explicit user include_paths overrides.
Sandbox validation: 39-site coverage pool
| Metric | v0.9.1 (control) | v0.9.2 | Δ |
|---|---|---|---|
| Avg coverage | 0.410 | 0.523 | +0.113 (+27%) |
| Full coverage (≥99%) | 7 | 10 | +3 |
| Zero coverage | 14 | 10 | −4 |
| Wins / Losses / Ties | — | 13 / 4 / 22 | — |
Big wins (≥+20pp): aws-iam +100, mongodb-docs +60, mdn-javascript +60, mdn-webapi +60, postgres-docs +60, python-stdlib +40, wikipedia-ml +40, plus 6 more.
Smaller losses: tailwind-utils −40, django-docs −20, stackoverflow-blog −20, nike-mens −20.
What's new
auto_path_scope=True default
Auto-derives include_paths from the seed URL when it has ≥ 2 path segments:
| Seed URL | Auto-derived scope |
|---|---|
| huggingface.co/docs/transformers/index | /docs/transformers/* |
| developer.mozilla.org/en-US/docs/Web/CSS | /en-US/docs/Web/CSS/* |
| kubernetes.io/docs/concepts/ | /docs/concepts/* |
| example.com/blog/2026/post.html | /blog/2026/* |
| example.com/ | None (root — full-site crawl) |
| en.wikipedia.org/wiki/Computer_science | None (article container) |
Three edge cases handled:
- Content-page filenames like
/stable/user_guide.htmlstrip to parent/stable/*so sibling pages remain reachable. /index*suffixes stripped before scoping.- Article-container path prefixes (
wiki,wikipedia) skip scoping entirely — articles are siblings, not children.
auto_path_priority=True default
Reorders the BFS queue so links sharing ≥ 50% of seed path segments are visited before off-section links. Never blocks any URL — purely an ordering hint. FIFO preserved within priority buckets.
Priority follows scope's "scopeable" check. When scope returns None (article-container or content-page seeds), priority is also a no-op, falling back to plain BFS. This avoids depth-first starvation of sibling sections.
Migration
Behavior-preserving for callers passing include_paths explicitly or using root-domain seeds. Otherwise:
crawl(url, out_dir=...)now constrains to the seed's subtree. Passauto_path_scope=Falseto opt out.- BFS visit order prefers on-section links. Pass
auto_path_priority=Falseto opt out.
Tests
316 passing (was 292 in v0.9.1; +24 unit tests covering scope and priority across docs, wiki, ecommerce, and edge-case seed shapes).
Known limitations
- Deep ecommerce category seeds (ikea/us/en/cat/X, nike /w/Y) where targets live at sibling
/p/*paths. Passauto_path_scope=Falsefor those, or use explicitinclude_paths. - Single-page-app sites that rely on JS-rendered navigation still need
--render-js. Theauto_render_jsopt-in flag is available but defaults off (v0.9.0's naive auto-on regressed retrieval and was reverted).
Install
pip install 'markcrawl[js]==0.9.2'
Breaking Changes
- Default `auto_path_scope=True` limits crawl to seed's subtree unless overridden with `auto_path_scope=False`.
- Default `auto_path_priority=True` changes BFS visit order to prioritize links sharing ≥50% of seed path segments, affecting traversal patterns.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]