This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryOpt-in RAG recipe improves MRR to 0.8148, adding five new flags and three retrieval bug fixes.
Full changelog
What's new
Opt-in RAG recipe that reaches 0.8148 MRR on the llm-crawler-benchmarks harness (OpenAI text-embedding-3-small, 57 hand-authored queries across 4 sites) — +0.18 over the v0.5.0 default and +0.08 over the next-best crawler (crawlee at 0.733). Every flag defaults to False — existing callers see zero behaviour change.
Recipe ingredients (all five patches generic, content-agnostic — no per-domain config)
| Flag | Layer | What it does |
|---|---|---|
| --i18n-filter | crawl | Skip URL paths under locale segments (/fr/, /de-DE/, /zh-Hans/) via generic ISO-639-1 detection |
| --title-at-top | crawl | Prepend # {title} to the text field of every pages.jsonl row |
| auto_extract_title=True | chunker | Self-extract first # heading as page_title when caller passes None |
| prepend_first_paragraph=True | chunker | Prepend the page's first prose paragraph to every output chunk as a "lead summary" |
| strip_markdown_links=True | chunker | Rewrite [anchor](url) to just anchor before chunking |
Measured ablation (deterministic, PYTHONHASHSEED=0, concurrency=1)
| Config | MRR | Δ vs raw | quotes | books | fastapi | python |
|---|---|---|---|---|---|---|
| Raw v0.5.0 baseline | 0.6309 | — | 0.375 | 0.904 | 0.728 | 0.493 |
| --i18n-filter --title-at-top + auto_extract_title | 0.7718 | +0.141 | 0.375 | 1.000 | 0.877 | 0.708 |
| + strip_markdown_links | 0.7868 | +0.156 | 0.375 | 1.000 | 0.902 | 0.727 |
| + prepend_first_paragraph (full v0.6.0 recipe) | 0.8148 | +0.184 | 0.375 | 1.000 | 0.931 | 0.781 |
The prepend_first_paragraph patch alone on the raw baseline regresses MRR by 0.03 — the compound synergy with strip_markdown_links is what unlocks the final +0.028 jump.
Usage
# CLI — crawl-level flags
markcrawl --base https://docs.example.com --out ./output \
--i18n-filter --title-at-top --show-progress
# Library — chunker-level flags
from markcrawl.chunker import chunk_markdown
chunks = chunk_markdown(
markdown_text,
auto_extract_title=True,
prepend_first_paragraph=True,
strip_markdown_links=True,
)
Also included (from #17)
Three retrieval-focused bug fixes found while building the MRR harness:
- Fenced code blocks no longer trigger spurious heading splits — the chunker now tracks ``` and
~~~fence state line-by-line so#inside a code block isn't treated as a heading. - Breadcrumb chunk prefix (
Section: Page > H1 > H2) replaces the old[Page: title]tag, carrying full ancestor context into every chunk. Held-out fastapi_tutorial MRR: 0.3203 → 0.5037 (+57% relative). - Link-heavy sidebars stop leaking into content — MkDocs / Sphinx / Docusaurus nav menus that contained inline
<code>labels (/openapi.json,/docs,/redoc) no longer pass the content-substantiality check.
Identifier characters (_, *) are preserved in output Markdown so my_function, *args, **kwargs tokenise correctly for retrieval.
New research infrastructure
bench/eval_mrr.py— MRR evaluation harness with local sentence-transformers (no cloud dependency) and optional OpenAI.bench/autoresearch.py— MRR-driven autoresearch loop with regression guards.bench/fixtures/— 29 training queries across 4 fixtures.bench/heldout/— FastAPI tutorial + quotes.toscrape for out-of-loop validation.
Install / upgrade
pip install --upgrade markcrawl
Full changelog: https://github.com/AIMLPM/markcrawl/compare/v0.5.0...v0.6.0
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]