This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryChunker defaults flipped to higher min_words and link stripping, boosting MRR by ~11.5%.
Full changelog
Highlights
| Metric | v0.9.9-rc1 (baseline) | v0.10.0 |
|-------------------------|------------------------:|-----------------------------------------:|
| Mean MRR (local pool) | 0.3461 | 0.3859 (+11.5%) |
| Cost at 50M pages | $10,152 | $0 with markcrawl[ml] / $5,246 default |
| Chunks per page | 20.3 | 10.49 (-48% smaller index) |
| Tests | 350 | 493 passing |
What's in this release
- Chunker defaults flipped (Track D).
chunk_markdownnow defaults tomin_words=250,section_overlap_words=40,strip_markdown_links=True. Multi-trial validated: +14% MRR on st-mini (6 trials, all positive), +15% on OpenAI 3-small (3 trials, all positive). Backward-compatible — pass legacy values explicitly to opt out. - Embedder abstraction (Track B). New
markcrawl/embedder.pyships withEmbedderABC +OpenAIEmbedder+LocalSentenceTransformerEmbedder(with model-specific instruction prefixes for asymmetric retrieval). Bake-off winner ismixedbread-ai/mxbai-embed-large-v1— passes SC-B1 ($0 cost) + SC-B2 (Δ −0.018 MRR within ±0.020 band). - Rerank infrastructure (Track A). New
markcrawl/retrieval.pywithCrossEncoderReranker. Failed the +0.030 MRR bar on this distribution (regressed −0.013); ships opt-in only. - Ecom diagnosis (Track C). Documented that newegg/ikea failures are crawl-discovery, not extraction. Per-site
auto_path_scope/auto_path_priority/use_sitemapoverrides plumbed in the local-replica harness; SC-C1 closure deferred to v0.11 (sitemap-first + UA rotation). - Retry resilience. Tenacity-backed retry, binary-parity CI smoke tests, structured exhaustion logs.
Per-category MRR (11-site canonical pool)
| Category | v0.9.9 | v0.10.0 | Δ |
|-----------------|-------:|--------:|---------:|
| framework_docs | 0.3641 | 0.3771 | +0.0130 |
| api_docs | 0.4840 | 0.5414 | +0.0574 |
| reference | 0.3750 | 0.3750 | 0.0000 |
| tutorial | 0.4605 | 0.7000 | +0.2395 |
| ecommerce | 0.0625 | 0.0625 | 0.0000 |
| blog | 0.5000 | 0.5000 | 0.0000 |
| news | 0.1667 | 0.1667 | 0.0000 |
No category regresses. The largest per-site regression
(huggingface-transformers −0.083) is offset within the
framework_docs category by react-dev's gain.
Known v0.11 work
- Default flip for mxbai embedder. Extras-aware factory: pick mxbai when
markcrawl[ml]is installed, fall back to OpenAI 3-small. Captures the −$10K/yr cost reduction in production callers. - Sitemap-first ecom discovery. Closes SC-C1 (ikea reaches its canonical products; newegg avoids anti-bot via different URL queue).
- Newegg anti-bot mitigation. UA rotation pool + retry-with-jitter for ecom-class sites.
Reports
bench/local_replica/v010_release_report.md— authoritative release report with side-by-side comparison.bench/local_replica/track_d_report.md— chunker sweep methodology (56 configs).bench/local_replica/track_a_report.md— rerank failure analysis.bench/local_replica/track_b_report.md— embedder bake-off (4 of 5 candidates run; nomic aborted).bench/local_replica/track_c_report.md— ecom discovery vs extraction diagnosis.specs/v010-leaderboard-sweep.md— campaign spec with per-track status markers.
Breaking Changes
- Chunker default parameters changed: min_words=250, section_overlap_words=40, strip_markdown_links=True (legacy values must be passed explicitly to retain old behavior).
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]