This release includes 2 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryDefault retrieval-optimized pipeline and automatic SPA rendering improve benchmark MRR to 0.803.
Full changelog
Highlights
Benchmark: pool MRR 0.803 on llm-crawler-benchmarks retrieval comparison — up from ~0.70 on v0.8.0's library defaults. Biggest per-site gains: quotes-toscrape +0.125, wikipedia +0.100, blog +0.062.
Two independent changes drive the lift:
1. MRR-optimized recipe is now the library default
v0.6.0 added flags for the winning RAG recipe but left their defaults False. v0.9.0 flips them all True, so a user calling crawl(url, out_dir=...) with zero extra flags now gets the retrieval-optimized pipeline out of the box:
| Flag | Was | Now |
|---|---|---|
| auto_extract_title | False | True |
| prepend_first_paragraph | False | True |
| strip_markdown_links | False | True |
| i18n_filter | False | True |
| title_at_top | False | True |
Migration notes: these are behavioral defaults, opt-out via the corresponding params or CLI flags if you relied on the old behavior. In particular, i18n_filter=True skips URL paths under locale segments (/fr/, /de-DE/, /zh-Hans/) — pass i18n_filter=False if you crawl localized content intentionally.
2. Smart Playwright integration
Crawls of JS-heavy SPA sites now work without explicit --render-js:
markcrawl.js_detect— conservative SPA detection (framework marker + low visible-text ratio, both required; protects SSR sites that useid=\"root\"as convention).markcrawl.dom_cleanup— overlay stripping (cookie banners, modals, newsletter popups, sticky CTAs) before extraction.- New default
auto_render_js=True— probes the base URL via HTTP on start; auto-promotes to Playwright when SPA is detected. - Playwright wait strategy switched from
domcontentloadedtoload+ 500ms hydration pause.networkidleremains off by default (modern sites never idle). Screenshot capture path unchanged.
Migration notes: if you'd rather the crawler never touch Playwright without explicit opt-in, pass auto_render_js=False.
Benchmark summary
| Metric | v0.9.0 | v0.8.0 library default | Current leaderboard #1 |
|---|---|---|---|
| Pool MRR (embedding) | 0.803 | ~0.70 | crawlee 0.733 |
| Hit@1 | 78% | ~67% | — |
| Speed | ~12 p/s | ~12 p/s | markcrawl |
| Extraction quality | 99% | 99% | markcrawl |
| Cost at scale | $4,505/yr | $4,505/yr | markcrawl |
| Answer quality | 4.52/5 | 4.52/5 | colly+md 4.53 (tied) |
Tests
+23 new unit tests (13 js_detect + 10 dom_cleanup). Full suite 292 pass.
Install
pip install 'markcrawl[js]==0.9.0'
playwright install chromium
Breaking Changes
- Default flag values for `auto_extract_title`, `prepend_first_paragraph`, `strip_markdown_links`, `i18n_filter`, and `title_at_top` changed from False to True.
- Default behavior of `markcrawl.auto_render_js` switched from False to True, enabling automatic Playwright rendering for detected SPAs.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]