Skip to content

AIMLPM/markcrawl

v0.11.1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 22d RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

ReleasePort's take

Light signal
editorial:auto 13d

Release v0.11.1 automatically rejects mdBook `/print.html` and Hugo `/_print/` aggregator pages during crawl‑time URL filtering, reducing unnecessary fetches.

Why it matters: New default patterns reject */print.html and */_print URLs pre‑fetch, saving crawl budget; test the updated filter composition in dev before upgrade.

Summary

AI summary

Added URL patterns to reject mdBook and Hugo print aggregator pages by default.

Changes in this release

Feature Medium

Reject mdBook `/print.html` and Hugo `/_print/` pages during crawl-time URL filtering.

Reject mdBook `/print.html` and Hugo `/_print/` pages during crawl-time URL filtering.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

New kwarg `include_aggregator_pages: bool = False` on `crawl()` and engine classes.

New kwarg `include_aggregator_pages: bool = False` on `crawl()` and engine classes.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

CLI flag `--include-aggregators` mirrors new kwarg behavior.

CLI flag `--include-aggregators` mirrors new kwarg behavior.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

User-supplied `exclude_paths` and `include_paths` compose with aggregator filter.

User-supplied `exclude_paths` and `include_paths` compose with aggregator filter.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

New default URL patterns rejected pre-fetch: `*/print.html`, `*/_print`, etc.

New default URL patterns rejected pre-fetch: `*/print.html`, `*/_print`, etc.

Source: llm_adapter@2026-05-21

Confidence: low

Feature Medium

Add default rejection patterns: `*/print.html`, `*/_print`, `*/_print/`, `*/_print/*`, `*/print/index.html`.

Add default rejection patterns: `*/print.html`, `*/_print`, `*/_print/`, `*/_print/*`, `*/print/index.html`.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Performance Medium

Saves crawl budget by rejecting aggregator pages early.

Saves crawl budget by rejecting aggregator pages early.

Source: llm_adapter@2026-05-21

Confidence: high

Deprecation Medium

No breaking changes; existing sites unaffected unless they archive print views.

No breaking changes; existing sites unaffected unless they archive print views.

Source: llm_adapter@2026-05-21

Confidence: low

Refactor Medium

Patterns anchored to avoid over-matching substring cases.

Patterns anchored to avoid over-matching substring cases.

Source: llm_adapter@2026-05-21

Confidence: low

Refactor Low

Anchor URL patterns to prevent over‑matching substrings.

Anchor URL patterns to prevent over‑matching substrings.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Other Medium

647 passing tests now; 36 new tests added for aggregator filter coverage.

647 passing tests now; 36 new tests added for aggregator filter coverage.

Source: llm_adapter@2026-05-21

Confidence: low

Other Low

Add 36 new tests covering aggregator filter behavior and engine parity.

Add 36 new tests covering aggregator filter behavior and engine parity.

Source: granite4.1:30b@2026-05-23-audit

Confidence: low

Full changelog

Reject mdBook /print.html and Hugo /_print/ pages during crawl-time URL filtering. These single-render-of-whole-tree pages have artificially high keyword density and pollute embedding-based retrieval rankings.

Why

Surfaced by the public llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five other well-functioning competitors returned 0% /_print/ on kubernetes-docs.

These pages contain the entire docs tree on one URL, so embedding-based retrieval ranks them above the dedicated chapter pages a user actually wants.

What changed

  • New default URL patterns rejected pre-fetch (saves crawl budget):
    */print.html, */_print, */_print/, */_print/*, */print/index.html
  • New kwarg include_aggregator_pages: bool = False on crawl() and both engine classes for offline-archive use cases.
  • CLI flag --include-aggregators mirrors.
  • User-supplied exclude_paths and include_paths still apply independently — the aggregator filter composes with both, doesn't replace either.

Substring-match safety

Patterns are anchored to avoid over-matching:

| URL | Behavior |
|---|---|
| /book/print.html | rejected (mdBook) |
| /blueprint.html | passes (print is mid-word) |
| /preprint.html | passes (academic content) |
| /imprint/ | passes (legal page) |
| /_print/index.html | rejected (Hugo) |
| /_printer-friendly/css.css | passes (asset path) |

Expected impact

Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. Measurement deferred to bench v1.5's helpful-pages-universe methodology (the current v1.4 anchor-biased methodology would give misleading numbers regardless of the underlying fix).

Tests

647 passing (was 611); 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring safety, opt-out flag, composition with user filters, and CrawlEngine + AsyncCrawlEngine parity.

Migration

No breaking changes. Default behavior unchanged on sites that don't generate aggregator pages. For users archiving offline docs that include print views, pass include_aggregator_pages=True or --include-aggregators.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]