This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
ReleasePort's take
Light signalRelease v0.11.1 automatically rejects mdBook `/print.html` and Hugo `/_print/` aggregator pages during crawl‑time URL filtering, reducing unnecessary fetches.
Why it matters: New default patterns reject */print.html and */_print URLs pre‑fetch, saving crawl budget; test the updated filter composition in dev before upgrade.
Summary
AI summaryAdded URL patterns to reject mdBook and Hugo print aggregator pages by default.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Reject mdBook `/print.html` and Hugo `/_print/` pages during crawl-time URL filtering. Reject mdBook `/print.html` and Hugo `/_print/` pages during crawl-time URL filtering. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
New kwarg `include_aggregator_pages: bool = False` on `crawl()` and engine classes. New kwarg `include_aggregator_pages: bool = False` on `crawl()` and engine classes. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
CLI flag `--include-aggregators` mirrors new kwarg behavior. CLI flag `--include-aggregators` mirrors new kwarg behavior. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
User-supplied `exclude_paths` and `include_paths` compose with aggregator filter. User-supplied `exclude_paths` and `include_paths` compose with aggregator filter. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
New default URL patterns rejected pre-fetch: `*/print.html`, `*/_print`, etc. New default URL patterns rejected pre-fetch: `*/print.html`, `*/_print`, etc. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Add default rejection patterns: `*/print.html`, `*/_print`, `*/_print/`, `*/_print/*`, `*/print/index.html`. Add default rejection patterns: `*/print.html`, `*/_print`, `*/_print/`, `*/_print/*`, `*/print/index.html`. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Performance | Medium |
Saves crawl budget by rejecting aggregator pages early. Saves crawl budget by rejecting aggregator pages early. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Deprecation | Medium |
No breaking changes; existing sites unaffected unless they archive print views. No breaking changes; existing sites unaffected unless they archive print views. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Refactor | Medium |
Patterns anchored to avoid over-matching substring cases. Patterns anchored to avoid over-matching substring cases. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Refactor | Low |
Anchor URL patterns to prevent over‑matching substrings. Anchor URL patterns to prevent over‑matching substrings. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Other | Medium |
647 passing tests now; 36 new tests added for aggregator filter coverage. 647 passing tests now; 36 new tests added for aggregator filter coverage. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Other | Low |
Add 36 new tests covering aggregator filter behavior and engine parity. Add 36 new tests covering aggregator filter behavior and engine parity. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
Full changelog
Reject mdBook /print.html and Hugo /_print/ pages during crawl-time URL filtering. These single-render-of-whole-tree pages have artificially high keyword density and pollute embedding-based retrieval rankings.
Why
Surfaced by the public llm-crawler-benchmarks v1.4 cycle: markcrawl was returning /print.html in 49% of rust-book top-5 retrieval slots and /_print/ in 39% of kubernetes-docs slots, while four of the five other well-functioning competitors returned 0% /_print/ on kubernetes-docs.
These pages contain the entire docs tree on one URL, so embedding-based retrieval ranks them above the dedicated chapter pages a user actually wants.
What changed
- New default URL patterns rejected pre-fetch (saves crawl budget):
*/print.html,*/_print,*/_print/,*/_print/*,*/print/index.html - New kwarg
include_aggregator_pages: bool = Falseoncrawl()and both engine classes for offline-archive use cases. - CLI flag
--include-aggregatorsmirrors. - User-supplied
exclude_pathsandinclude_pathsstill apply independently — the aggregator filter composes with both, doesn't replace either.
Substring-match safety
Patterns are anchored to avoid over-matching:
| URL | Behavior |
|---|---|
| /book/print.html | rejected (mdBook) |
| /blueprint.html | passes (print is mid-word) |
| /preprint.html | passes (academic content) |
| /imprint/ | passes (legal page) |
| /_print/index.html | rejected (Hugo) |
| /_printer-friendly/css.css | passes (asset path) |
Expected impact
Predicted MRR lift on the 9-site bench pool: +0.02 to +0.04, concentrated on rust-book and kubernetes-docs. Measurement deferred to bench v1.5's helpful-pages-universe methodology (the current v1.4 anchor-biased methodology would give misleading numbers regardless of the underlying fix).
Tests
647 passing (was 611); 36 new tests in tests/test_v011_1_aggregator_filter.py covering default rejection, substring safety, opt-out flag, composition with user filters, and CrawlEngine + AsyncCrawlEngine parity.
Migration
No breaking changes. Default behavior unchanged on sites that don't generate aggregator pages. For users archiving offline docs that include print views, pass include_aggregator_pages=True or --include-aggregators.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]