AIMLPM/markcrawl

v0.9.0 Breaking

This release includes 2 breaking changes for platform teams planning a safe upgrade.

Published 3mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Default retrieval-optimized pipeline and automatic SPA rendering improve benchmark MRR to 0.803.

Full changelog

Highlights

Benchmark: pool MRR 0.803 on llm-crawler-benchmarks retrieval comparison — up from ~0.70 on v0.8.0's library defaults. Biggest per-site gains: quotes-toscrape +0.125, wikipedia +0.100, blog +0.062.

Two independent changes drive the lift:

1. MRR-optimized recipe is now the library default

v0.6.0 added flags for the winning RAG recipe but left their defaults False. v0.9.0 flips them all True, so a user calling crawl(url, out_dir=...) with zero extra flags now gets the retrieval-optimized pipeline out of the box:

| Flag | Was | Now |
|---|---|---|
| auto_extract_title | False | True |
| prepend_first_paragraph | False | True |
| strip_markdown_links | False | True |
| i18n_filter | False | True |
| title_at_top | False | True |

Migration notes: these are behavioral defaults, opt-out via the corresponding params or CLI flags if you relied on the old behavior. In particular, i18n_filter=True skips URL paths under locale segments (/fr/, /de-DE/, /zh-Hans/) — pass i18n_filter=False if you crawl localized content intentionally.

2. Smart Playwright integration

Crawls of JS-heavy SPA sites now work without explicit --render-js:

markcrawl.js_detect — conservative SPA detection (framework marker + low visible-text ratio, both required; protects SSR sites that use id=\"root\" as convention).
markcrawl.dom_cleanup — overlay stripping (cookie banners, modals, newsletter popups, sticky CTAs) before extraction.
New default auto_render_js=True — probes the base URL via HTTP on start; auto-promotes to Playwright when SPA is detected.
Playwright wait strategy switched from domcontentloaded to load + 500ms hydration pause. networkidle remains off by default (modern sites never idle). Screenshot capture path unchanged.

Migration notes: if you'd rather the crawler never touch Playwright without explicit opt-in, pass auto_render_js=False.

Benchmark summary

| Metric | v0.9.0 | v0.8.0 library default | Current leaderboard #1 |
|---|---|---|---|
| Pool MRR (embedding) | 0.803 | ~0.70 | crawlee 0.733 |
| Hit@1 | 78% | ~67% | — |
| Speed | ~12 p/s | ~12 p/s | markcrawl |
| Extraction quality | 99% | 99% | markcrawl |
| Cost at scale | $4,505/yr | $4,505/yr | markcrawl |
| Answer quality | 4.52/5 | 4.52/5 | colly+md 4.53 (tied) |

Tests

+23 new unit tests (13 js_detect + 10 dom_cleanup). Full suite 292 pass.

Install

pip install 'markcrawl[js]==0.9.0'
playwright install chromium

Breaking Changes

Default flag values for `auto_extract_title`, `prepend_first_paragraph`, `strip_markdown_links`, `i18n_filter`, and `title_at_top` changed from False to True.
Default behavior of `markcrawl.auto_render_js` switched from False to True, enabling automatic Playwright rendering for detected SPAs.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →