AIMLPM/markcrawl

v0.10.0 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 2mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Chunker defaults flipped to higher min_words and link stripping, boosting MRR by ~11.5%.

Full changelog

Highlights

| Metric | v0.9.9-rc1 (baseline) | v0.10.0 |
|-------------------------|------------------------:|-----------------------------------------:|
| Mean MRR (local pool) | 0.3461 | 0.3859 (+11.5%) |
| Cost at 50M pages | $10,152 | $0 with markcrawl[ml] / $5,246 default |
| Chunks per page | 20.3 | 10.49 (-48% smaller index) |
| Tests | 350 | 493 passing |

What's in this release

Chunker defaults flipped (Track D). chunk_markdown now defaults to min_words=250, section_overlap_words=40, strip_markdown_links=True. Multi-trial validated: +14% MRR on st-mini (6 trials, all positive), +15% on OpenAI 3-small (3 trials, all positive). Backward-compatible — pass legacy values explicitly to opt out.
Embedder abstraction (Track B). New markcrawl/embedder.py ships with Embedder ABC + OpenAIEmbedder + LocalSentenceTransformerEmbedder (with model-specific instruction prefixes for asymmetric retrieval). Bake-off winner is mixedbread-ai/mxbai-embed-large-v1 — passes SC-B1 ($0 cost) + SC-B2 (Δ −0.018 MRR within ±0.020 band).
Rerank infrastructure (Track A). New markcrawl/retrieval.py with CrossEncoderReranker. Failed the +0.030 MRR bar on this distribution (regressed −0.013); ships opt-in only.
Ecom diagnosis (Track C). Documented that newegg/ikea failures are crawl-discovery, not extraction. Per-site auto_path_scope / auto_path_priority / use_sitemap overrides plumbed in the local-replica harness; SC-C1 closure deferred to v0.11 (sitemap-first + UA rotation).
Retry resilience. Tenacity-backed retry, binary-parity CI smoke tests, structured exhaustion logs.

Per-category MRR (11-site canonical pool)

| Category | v0.9.9 | v0.10.0 | Δ |
|-----------------|-------:|--------:|---------:|
| framework_docs | 0.3641 | 0.3771 | +0.0130 |
| api_docs | 0.4840 | 0.5414 | +0.0574 |
| reference | 0.3750 | 0.3750 | 0.0000 |
| tutorial | 0.4605 | 0.7000 | +0.2395 |
| ecommerce | 0.0625 | 0.0625 | 0.0000 |
| blog | 0.5000 | 0.5000 | 0.0000 |
| news | 0.1667 | 0.1667 | 0.0000 |

No category regresses. The largest per-site regression
(huggingface-transformers −0.083) is offset within the
framework_docs category by react-dev's gain.

Known v0.11 work

Default flip for mxbai embedder. Extras-aware factory: pick mxbai when markcrawl[ml] is installed, fall back to OpenAI 3-small. Captures the −$10K/yr cost reduction in production callers.
Sitemap-first ecom discovery. Closes SC-C1 (ikea reaches its canonical products; newegg avoids anti-bot via different URL queue).
Newegg anti-bot mitigation. UA rotation pool + retry-with-jitter for ecom-class sites.

Reports

bench/local_replica/v010_release_report.md — authoritative release report with side-by-side comparison.
bench/local_replica/track_d_report.md — chunker sweep methodology (56 configs).
bench/local_replica/track_a_report.md — rerank failure analysis.
bench/local_replica/track_b_report.md — embedder bake-off (4 of 5 candidates run; nomic aborted).
bench/local_replica/track_c_report.md — ecom discovery vs extraction diagnosis.
specs/v010-leaderboard-sweep.md — campaign spec with per-track status markers.

Breaking Changes

Chunker default parameters changed: min_words=250, section_overlap_words=40, strip_markdown_links=True (legacy values must be passed explicitly to retain old behavior).

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →