AIMLPM/markcrawl

v0.11.1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

ReleasePort's take

Light signal

editorial:auto 2mo

Release v0.11.1 automatically rejects mdBook `/print.html` and Hugo `/_print/` aggregator pages during crawl‑time URL filtering, reducing unnecessary fetches.

Why it matters: New default patterns reject */print.html and */_print URLs pre‑fetch, saving crawl budget; test the updated filter composition in dev before upgrade.

Summary

AI summary

Added URL patterns to reject mdBook and Hugo print aggregator pages by default.

Changes in this release

Type	Severity	Summary	CVE
Feature
Feature	Medium	Reject mdBook `/print.html` and Hugo `/_print/` pages during crawl-time URL filtering. Reject mdBook `/print.html` and Hugo `/_print/` pages during crawl-time URL filtering. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	New kwarg `include_aggregator_pages: bool = False` on `crawl()` and engine classes. New kwarg `include_aggregator_pages: bool = False` on `crawl()` and engine classes. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	CLI flag `--include-aggregators` mirrors new kwarg behavior. CLI flag `--include-aggregators` mirrors new kwarg behavior. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	User-supplied `exclude_paths` and `include_paths` compose with aggregator filter. User-supplied `exclude_paths` and `include_paths` compose with aggregator filter. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	New default URL patterns rejected pre-fetch: `/print.html`, `/_print`, etc. New default URL patterns rejected pre-fetch: `/print.html`, `/_print`, etc. Source: llm_adapter@2026-05-21 Confidence: low	—
Feature	Medium	Add default rejection patterns: `/print.html`, `/_print`, `/_print/`, `/_print/`, `/print/index.html`. Add default rejection patterns: `/print.html`, `/_print`, `/_print/`, `/_print/`, `/print/index.html`. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Performance	Medium	Saves crawl budget by rejecting aggregator pages early. Saves crawl budget by rejecting aggregator pages early. Source: llm_adapter@2026-05-21 Confidence: high	—
Deprecation	Medium	No breaking changes; existing sites unaffected unless they archive print views. No breaking changes; existing sites unaffected unless they archive print views. Source: llm_adapter@2026-05-21 Confidence: low	—
Refactor	Medium	Patterns anchored to avoid over-matching substring cases. Patterns anchored to avoid over-matching substring cases. Source: llm_adapter@2026-05-21 Confidence: low	—
Refactor	Low	Anchor URL patterns to prevent over‑matching substrings. Anchor URL patterns to prevent over‑matching substrings. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—
Other	Medium	647 passing tests now; 36 new tests added for aggregator filter coverage. 647 passing tests now; 36 new tests added for aggregator filter coverage. Source: llm_adapter@2026-05-21 Confidence: low	—
Other	Low	Add 36 new tests covering aggregator filter behavior and engine parity. Add 36 new tests covering aggregator filter behavior and engine parity. Source: granite4.1:30b@2026-05-23-audit Confidence: low	—

Full changelog

Reject mdBook /print.html and Hugo /_print/ pages during crawl-time URL filtering. These single-render-of-whole-tree pages have artificially high keyword density and pollute embedding-based retrieval rankings.

Why

Surfaced by the public llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five other well-functioning competitors returned 0% /_print/ on kubernetes-docs.

These pages contain the entire docs tree on one URL, so embedding-based retrieval ranks them above the dedicated chapter pages a user actually wants.

What changed

New default URL patterns rejected pre-fetch (saves crawl budget):
*/print.html, */_print, */_print/, */_print/*, */print/index.html
New kwarg include_aggregator_pages: bool = False on crawl() and both engine classes for offline-archive use cases.
CLI flag --include-aggregators mirrors.
User-supplied exclude_paths and include_paths still apply independently — the aggregator filter composes with both, doesn't replace either.

Substring-match safety

Patterns are anchored to avoid over-matching:

| URL | Behavior |
|---|---|
| /book/print.html | rejected (mdBook) |
| /blueprint.html | passes (print is mid-word) |
| /preprint.html | passes (academic content) |
| /imprint/ | passes (legal page) |
| /_print/index.html | rejected (Hugo) |
| /_printer-friendly/css.css | passes (asset path) |

Expected impact

Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. Measurement deferred to bench v1.5's helpful-pages-universe methodology (the current v1.4 anchor-biased methodology would give misleading numbers regardless of the underlying fix).

Tests

647 passing (was 611); 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring safety, opt-out flag, composition with user filters, and CrawlEngine + AsyncCrawlEngine parity.

Migration

No breaking changes. Default behavior unchanged on sites that don't generate aggregator pages. For users archiving offline docs that include print views, pass include_aggregator_pages=True or --include-aggregators.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →