Skip to content

AIMLPM/markcrawl

v0.9.3 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

auto_path_scope now automatically detects ecommerce category-index markers and adjusts crawling scope accordingly.

Full changelog

Generic URL-convention fix for ecommerce sites where the seed URL passes through a category-index segment but target items live at sibling paths. Same kind of rule as the v0.9.2 `/wiki/` article-container check.

Validated on the 4 sites used by the public llm-crawler-benchmarks rotation

| Site | Public v0.9.1 | v0.9.2 | v0.9.3 | Notes |
|---|---|---|---|---|
| mdn-css | 0.125 | 0.5625 | 0.5625 | unchanged from v0.9.2 |
| kubernetes-docs | 0.542 | 0.9062 | 0.9062 | unchanged from v0.9.2 |
| huggingface-transformers | 0.000 | 0.3438 | 0.3438 | unchanged from v0.9.2 |
| ikea | 0.375 | 0.0000 | 0.1250 | recovered from v0.9.2 over-tight scope |
| AVG MRR | 0.260 | 0.4531 | 0.4844 | +0.224 over public (+86% relative) |

(Ikea remains below public's 0.375 due to long-tail variance — a 200-page random sample of thousands of products will hit different specific named items each run. The scope fix itself works as designed.)

What's new

auto_path_scope now detects ecommerce category-index markers

When the seed URL passes through a /<marker>/ segment that's a known ecommerce-platform URL convention, the segments BEFORE the marker become the scope. This is generic — applies to any site adopting the convention, not domain-specific.

Markers detected: cat, category, categories, products, shop, collections.

Used by Shopify, WooCommerce, Magento, Salesforce Commerce Cloud defaults, plus IKEA, Etsy, BigCommerce, and many more.

| Seed URL | New v0.9.3 scope | Why |
|---|---|---|
| ikea.com/us/en/cat/furniture-fu001/ | /us/en/* | Products at /us/en/p/* are siblings |
| myshop.com/store/collections/spring/products/x | /store/collections/spring/* | Deepest marker wins; outer is parent grouping |
| myshop.com/products/single-thing | None | Marker at root → siblings span whole site |
| mywp.com/blog/category/news/post-1 | /blog/* | News posts are at /blog/<slug> |

Multi-marker tiebreak: deepest wins

For URLs with nested markers (e.g. /store/collections/X/products/Y), the deepest marker is the leaf-level category — outer markers are parent groupings. Scope is anchored at the segments before the deepest marker.

Tests

324 passing (was 316 in v0.9.2; +8 new ecommerce-marker tests covering ikea-style /us/en/cat, Shopify /products & /collections, deepest-marker tiebreak, case-insensitivity).

Migration

No API changes. Behavior shift only on seeds passing through one of the marker words. If you were relying on a tight scope at the marker level, pass auto_path_scope=False and use include_paths explicitly.

Known limitations

  • Long-tail product queries: with constrained max_pages, hitting specific named products on a large catalog (ikea MALM/SLATTUM, etc.) depends on which products end up in BFS order — irreducible variance. Larger max_pages reduces this.
  • SPA sites: pages requiring JavaScript-rendered navigation still need explicit --render-js.

Install

pip install 'markcrawl[js]==0.9.3'

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]