Skip to content

AIMLPM/markcrawl

v0.9.2 Breaking

This release includes 2 breaking changes for platform teams planning a safe upgrade.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Auto‑derived include_paths now default ON, constraining crawls to the seed subtree and reordering BFS to prioritize on‑section links.

Full changelog

Two complementary, generic crawler-coverage improvements that fix the dominant BFS-escapes-seed-subtree failure mode (mdn /Web/CSS, HF /docs/transformers/ etc.).

Both default ON. Behavior-preserving for root-domain seeds and explicit user include_paths overrides.

Sandbox validation: 39-site coverage pool

| Metric | v0.9.1 (control) | v0.9.2 | Δ |
|---|---|---|---|
| Avg coverage | 0.410 | 0.523 | +0.113 (+27%) |
| Full coverage (≥99%) | 7 | 10 | +3 |
| Zero coverage | 14 | 10 | −4 |
| Wins / Losses / Ties | — | 13 / 4 / 22 | — |

Big wins (≥+20pp): aws-iam +100, mongodb-docs +60, mdn-javascript +60, mdn-webapi +60, postgres-docs +60, python-stdlib +40, wikipedia-ml +40, plus 6 more.

Smaller losses: tailwind-utils −40, django-docs −20, stackoverflow-blog −20, nike-mens −20.

What's new

auto_path_scope=True default

Auto-derives include_paths from the seed URL when it has ≥ 2 path segments:

| Seed URL | Auto-derived scope |
|---|---|
| huggingface.co/docs/transformers/index | /docs/transformers/* |
| developer.mozilla.org/en-US/docs/Web/CSS | /en-US/docs/Web/CSS/* |
| kubernetes.io/docs/concepts/ | /docs/concepts/* |
| example.com/blog/2026/post.html | /blog/2026/* |
| example.com/ | None (root — full-site crawl) |
| en.wikipedia.org/wiki/Computer_science | None (article container) |

Three edge cases handled:

  1. Content-page filenames like /stable/user_guide.html strip to parent /stable/* so sibling pages remain reachable.
  2. /index* suffixes stripped before scoping.
  3. Article-container path prefixes (wiki, wikipedia) skip scoping entirely — articles are siblings, not children.

auto_path_priority=True default

Reorders the BFS queue so links sharing ≥ 50% of seed path segments are visited before off-section links. Never blocks any URL — purely an ordering hint. FIFO preserved within priority buckets.

Priority follows scope's "scopeable" check. When scope returns None (article-container or content-page seeds), priority is also a no-op, falling back to plain BFS. This avoids depth-first starvation of sibling sections.

Migration

Behavior-preserving for callers passing include_paths explicitly or using root-domain seeds. Otherwise:

  • crawl(url, out_dir=...) now constrains to the seed's subtree. Pass auto_path_scope=False to opt out.
  • BFS visit order prefers on-section links. Pass auto_path_priority=False to opt out.

Tests

316 passing (was 292 in v0.9.1; +24 unit tests covering scope and priority across docs, wiki, ecommerce, and edge-case seed shapes).

Known limitations

  • Deep ecommerce category seeds (ikea/us/en/cat/X, nike /w/Y) where targets live at sibling /p/* paths. Pass auto_path_scope=False for those, or use explicit include_paths.
  • Single-page-app sites that rely on JS-rendered navigation still need --render-js. The auto_render_js opt-in flag is available but defaults off (v0.9.0's naive auto-on regressed retrieval and was reverted).

Install

pip install 'markcrawl[js]==0.9.2'

Breaking Changes

  • Default `auto_path_scope=True` limits crawl to seed's subtree unless overridden with `auto_path_scope=False`.
  • Default `auto_path_priority=True` changes BFS visit order to prioritize links sharing ≥50% of seed path segments, affecting traversal patterns.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]