Skip to content

AIMLPM/markcrawl

v0.6.0 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Opt-in RAG recipe improves MRR to 0.8148, adding five new flags and three retrieval bug fixes.

Full changelog

What's new

Opt-in RAG recipe that reaches 0.8148 MRR on the llm-crawler-benchmarks harness (OpenAI text-embedding-3-small, 57 hand-authored queries across 4 sites) — +0.18 over the v0.5.0 default and +0.08 over the next-best crawler (crawlee at 0.733). Every flag defaults to False — existing callers see zero behaviour change.

Recipe ingredients (all five patches generic, content-agnostic — no per-domain config)

| Flag | Layer | What it does |
|---|---|---|
| --i18n-filter | crawl | Skip URL paths under locale segments (/fr/, /de-DE/, /zh-Hans/) via generic ISO-639-1 detection |
| --title-at-top | crawl | Prepend # {title} to the text field of every pages.jsonl row |
| auto_extract_title=True | chunker | Self-extract first # heading as page_title when caller passes None |
| prepend_first_paragraph=True | chunker | Prepend the page's first prose paragraph to every output chunk as a "lead summary" |
| strip_markdown_links=True | chunker | Rewrite [anchor](url) to just anchor before chunking |

Measured ablation (deterministic, PYTHONHASHSEED=0, concurrency=1)

| Config | MRR | Δ vs raw | quotes | books | fastapi | python |
|---|---|---|---|---|---|---|
| Raw v0.5.0 baseline | 0.6309 | — | 0.375 | 0.904 | 0.728 | 0.493 |
| --i18n-filter --title-at-top + auto_extract_title | 0.7718 | +0.141 | 0.375 | 1.000 | 0.877 | 0.708 |
| + strip_markdown_links | 0.7868 | +0.156 | 0.375 | 1.000 | 0.902 | 0.727 |
| + prepend_first_paragraph (full v0.6.0 recipe) | 0.8148 | +0.184 | 0.375 | 1.000 | 0.931 | 0.781 |

The prepend_first_paragraph patch alone on the raw baseline regresses MRR by 0.03 — the compound synergy with strip_markdown_links is what unlocks the final +0.028 jump.

Usage

# CLI — crawl-level flags
markcrawl --base https://docs.example.com --out ./output \
    --i18n-filter --title-at-top --show-progress
# Library — chunker-level flags
from markcrawl.chunker import chunk_markdown

chunks = chunk_markdown(
    markdown_text,
    auto_extract_title=True,
    prepend_first_paragraph=True,
    strip_markdown_links=True,
)

Also included (from #17)

Three retrieval-focused bug fixes found while building the MRR harness:

  • Fenced code blocks no longer trigger spurious heading splits — the chunker now tracks ``` and ~~~ fence state line-by-line so # inside a code block isn't treated as a heading.
  • Breadcrumb chunk prefix (Section: Page > H1 > H2) replaces the old [Page: title] tag, carrying full ancestor context into every chunk. Held-out fastapi_tutorial MRR: 0.3203 → 0.5037 (+57% relative).
  • Link-heavy sidebars stop leaking into content — MkDocs / Sphinx / Docusaurus nav menus that contained inline <code> labels (/openapi.json, /docs, /redoc) no longer pass the content-substantiality check.

Identifier characters (_, *) are preserved in output Markdown so my_function, *args, **kwargs tokenise correctly for retrieval.

New research infrastructure

  • bench/eval_mrr.py — MRR evaluation harness with local sentence-transformers (no cloud dependency) and optional OpenAI.
  • bench/autoresearch.py — MRR-driven autoresearch loop with regression guards.
  • bench/fixtures/ — 29 training queries across 4 fixtures.
  • bench/heldout/ — FastAPI tutorial + quotes.toscrape for out-of-loop validation.

Install / upgrade

pip install --upgrade markcrawl

Full changelog: https://github.com/AIMLPM/markcrawl/compare/v0.5.0...v0.6.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]