This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryauto_path_scope now automatically detects ecommerce category-index markers and adjusts crawling scope accordingly.
Full changelog
Generic URL-convention fix for ecommerce sites where the seed URL passes through a category-index segment but target items live at sibling paths. Same kind of rule as the v0.9.2 `/wiki/` article-container check.
Validated on the 4 sites used by the public llm-crawler-benchmarks rotation
| Site | Public v0.9.1 | v0.9.2 | v0.9.3 | Notes |
|---|---|---|---|---|
| mdn-css | 0.125 | 0.5625 | 0.5625 | unchanged from v0.9.2 |
| kubernetes-docs | 0.542 | 0.9062 | 0.9062 | unchanged from v0.9.2 |
| huggingface-transformers | 0.000 | 0.3438 | 0.3438 | unchanged from v0.9.2 |
| ikea | 0.375 | 0.0000 | 0.1250 | recovered from v0.9.2 over-tight scope |
| AVG MRR | 0.260 | 0.4531 | 0.4844 | +0.224 over public (+86% relative) |
(Ikea remains below public's 0.375 due to long-tail variance — a 200-page random sample of thousands of products will hit different specific named items each run. The scope fix itself works as designed.)
What's new
auto_path_scope now detects ecommerce category-index markers
When the seed URL passes through a /<marker>/ segment that's a known ecommerce-platform URL convention, the segments BEFORE the marker become the scope. This is generic — applies to any site adopting the convention, not domain-specific.
Markers detected: cat, category, categories, products, shop, collections.
Used by Shopify, WooCommerce, Magento, Salesforce Commerce Cloud defaults, plus IKEA, Etsy, BigCommerce, and many more.
| Seed URL | New v0.9.3 scope | Why |
|---|---|---|
| ikea.com/us/en/cat/furniture-fu001/ | /us/en/* | Products at /us/en/p/* are siblings |
| myshop.com/store/collections/spring/products/x | /store/collections/spring/* | Deepest marker wins; outer is parent grouping |
| myshop.com/products/single-thing | None | Marker at root → siblings span whole site |
| mywp.com/blog/category/news/post-1 | /blog/* | News posts are at /blog/<slug> |
Multi-marker tiebreak: deepest wins
For URLs with nested markers (e.g. /store/collections/X/products/Y), the deepest marker is the leaf-level category — outer markers are parent groupings. Scope is anchored at the segments before the deepest marker.
Tests
324 passing (was 316 in v0.9.2; +8 new ecommerce-marker tests covering ikea-style /us/en/cat, Shopify /products & /collections, deepest-marker tiebreak, case-insensitivity).
Migration
No API changes. Behavior shift only on seeds passing through one of the marker words. If you were relying on a tight scope at the marker level, pass auto_path_scope=False and use include_paths explicitly.
Known limitations
- Long-tail product queries: with constrained
max_pages, hitting specific named products on a large catalog (ikea MALM/SLATTUM, etc.) depends on which products end up in BFS order — irreducible variance. Largermax_pagesreduces this. - SPA sites: pages requiring JavaScript-rendered navigation still need explicit
--render-js.
Install
pip install 'markcrawl[js]==0.9.3'
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]