Skip to content

AIMLPM/markcrawl

v0.9.1 Breaking

This release includes 2 breaking changes for platform teams planning a safe upgrade.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Affected surfaces

breaking_upgrade

Summary

AI summary

Reverts six library defaults to False, restoring v0.8.0 behavior after causing regression in public benchmarks.

Full changelog

Hotfix

v0.9.0 flipped six library defaults to True. On the internal 11-site pool this looked good (+0.014 MRR over champion). On the public llm-crawler-benchmarks pool (different sites), it caused a significant regression — both MRR and crawl speed dropped, and at least one site (huggingface-transformers) went to 0.000 MRR.

v0.9.1 reverts all six defaults to False, restoring v0.8.0 behavior.

What changed

| Default | v0.9.0 | v0.9.1 (this) |
|---|---|---|
| auto_extract_title | True | False (reverted) |
| prepend_first_paragraph | True | False (reverted) |
| strip_markdown_links | True | False (reverted) |
| i18n_filter | True | False (reverted) |
| title_at_top | True | False (reverted) |
| auto_render_js | True | False (reverted) |

The new modules from v0.9.0 — markcrawl.js_detect (SPA detection) and markcrawl.dom_cleanup (overlay stripping) — remain available as opt-in. Pass the corresponding flags or call the modules directly to use them.

Migration

If you upgraded to v0.9.0 and saw degraded behavior, upgrading to v0.9.1 restores the prior behavior with zero changes needed on your side.

If you intentionally relied on the v0.9.0 defaults, opt back in explicitly:

from markcrawl.core import crawl
result = crawl(
    base_url=...,
    out_dir=...,
    i18n_filter=True,
    title_at_top=True,
    auto_render_js=True,
)
# plus pass auto_extract_title=True etc. to chunker if calling chunker directly

Root cause

Under investigation. The internal-pool +0.014 lift hid that some defaults misbehave on out-of-distribution sites. Likely culprits:

  • auto_render_js=True falsely flagging SSR sites with heavy inline scripts as SPAs, forcing the slower Playwright path
  • Overlay stripping potentially removing legitimate content on some sites
  • One or more chunker defaults interacting poorly with sites we hadn't tested

Will publish a detailed post-mortem once the diagnosis is in. v0.9.1 is the safe state until then.

Install

```bash
pip install 'markcrawl[js]==0.9.1'
```

Breaking Changes

  • Default values for `auto_extract_title`, `prepend_first_paragraph`, `strip_markdown_links`, `i18n_filter`, `title_at_top`, and `auto_render_js` reverted from True to False.
  • Behavioral change: Previous v0.9.0 defaults caused significant MRR and crawl speed regressions on external benchmark pools.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]