AIMLPM/markcrawl

v0.6.0 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 3mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Opt-in RAG recipe improves MRR to 0.8148, adding five new flags and three retrieval bug fixes.

Full changelog

What's new

Opt-in RAG recipe that reaches 0.8148 MRR on the llm-crawler-benchmarks harness (OpenAI text-embedding-3-small, 57 hand-authored queries across 4 sites) — +0.18 over the v0.5.0 default and +0.08 over the next-best crawler (crawlee at 0.733). Every flag defaults to False — existing callers see zero behaviour change.

Recipe ingredients (all five patches generic, content-agnostic — no per-domain config)

| Flag | Layer | What it does |
|---|---|---|
| --i18n-filter | crawl | Skip URL paths under locale segments (/fr/, /de-DE/, /zh-Hans/) via generic ISO-639-1 detection |
| --title-at-top | crawl | Prepend # {title} to the text field of every pages.jsonl row |
| auto_extract_title=True | chunker | Self-extract first # heading as page_title when caller passes None |
| prepend_first_paragraph=True | chunker | Prepend the page's first prose paragraph to every output chunk as a "lead summary" |
| strip_markdown_links=True | chunker | Rewrite [anchor](url) to just anchor before chunking |

Measured ablation (deterministic, PYTHONHASHSEED=0, concurrency=1)

| Config | MRR | Δ vs raw | quotes | books | fastapi | python |
|---|---|---|---|---|---|---|
| Raw v0.5.0 baseline | 0.6309 | — | 0.375 | 0.904 | 0.728 | 0.493 |
| --i18n-filter --title-at-top + auto_extract_title | 0.7718 | +0.141 | 0.375 | 1.000 | 0.877 | 0.708 |
| + strip_markdown_links | 0.7868 | +0.156 | 0.375 | 1.000 | 0.902 | 0.727 |
| + prepend_first_paragraph (full v0.6.0 recipe) | 0.8148 | +0.184 | 0.375 | 1.000 | 0.931 | 0.781 |

The prepend_first_paragraph patch alone on the raw baseline regresses MRR by 0.03 — the compound synergy with strip_markdown_links is what unlocks the final +0.028 jump.

Usage

# CLI — crawl-level flags
markcrawl --base https://docs.example.com --out ./output \
    --i18n-filter --title-at-top --show-progress

# Library — chunker-level flags
from markcrawl.chunker import chunk_markdown

chunks = chunk_markdown(
    markdown_text,
    auto_extract_title=True,
    prepend_first_paragraph=True,
    strip_markdown_links=True,
)

Also included (from #17)

Three retrieval-focused bug fixes found while building the MRR harness:

Fenced code blocks no longer trigger spurious heading splits — the chunker now tracks ``` and ~~~ fence state line-by-line so # inside a code block isn't treated as a heading.
Breadcrumb chunk prefix (Section: Page > H1 > H2) replaces the old [Page: title] tag, carrying full ancestor context into every chunk. Held-out fastapi_tutorial MRR: 0.3203 → 0.5037 (+57% relative).
Link-heavy sidebars stop leaking into content — MkDocs / Sphinx / Docusaurus nav menus that contained inline <code> labels (/openapi.json, /docs, /redoc) no longer pass the content-substantiality check.

Identifier characters (_, *) are preserved in output Markdown so my_function, *args, **kwargs tokenise correctly for retrieval.

New research infrastructure

bench/eval_mrr.py — MRR evaluation harness with local sentence-transformers (no cloud dependency) and optional OpenAI.
bench/autoresearch.py — MRR-driven autoresearch loop with regression guards.
bench/fixtures/ — 29 training queries across 4 fixtures.
bench/heldout/ — FastAPI tutorial + quotes.toscrape for out-of-loop validation.

Install / upgrade

pip install --upgrade markcrawl

Full changelog: https://github.com/AIMLPM/markcrawl/compare/v0.5.0...v0.6.0

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →