Skip to content

AIMLPM/markcrawl

v0.10.6 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 29d RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Affected surfaces

auth

Summary

AI summary

Added explicit respect_robots flag with audit fields and non‑silenceable warnings for bypass.

Full changelog

New crawl(..., respect_robots: bool = True) — default unchanged (robots.txt Disallow rules honored). Setting respect_robots=False bypasses Disallow but still honors Crawl-delay (politeness preserved). Caller takes responsibility for legality, ethics, and downstream consequences.

Why

robots.txt is the only widely-deployed mechanism site owners have to express preferences about automated access. We default to respecting it. But forks and monkey-patches that ignore robots already exist in the wild; an explicit, audited flag is more honest than letting users hack around the constraint silently.

Three guardrails

  1. Loud, non-silenceable warning at engine setup when bypass is active — both progress callback and Python logger.warning. No env-var or CLI override; the choice must be made deliberately in code.
  2. CrawlResult.robots_respected: bool — mirrors the kwarg the caller passed. Surfaced for audit / governance pipelines.
  3. CrawlResult.robots_bypassed_count: int — count of unique URLs robots.txt Disallowed but were fetched anyway. Always 0 when robots_respected is True. Lets you see the actual impact of the override — small numbers mean robots wasn't constraining you.

End-of-crawl summary when bypass was active reports either had no effect this run (count=0) or fetched N URL(s) that robots.txt Disallowed (count>0).

What stays unchanged

  • Default behavior: robots.txt Disallow rules honored.
  • Crawl-delay (politeness): honored unconditionally. We disregard Disallow, not politeness. Bypassing rate limits would be DoS-shaped.

Migration

No breaking changes. Default behavior unchanged. Use the flag for legitimate cases:

  • Your own site (forgotten or misconfigured robots.txt)
  • Authorized pen-testing engagements
  • Internal / intranet documentation you own
  • RAG ingestion of docs the site owner explicitly wants ingested but forgot to whitelist your UA

566 tests passing (was 549 on v0.10.5; +17 in tests/test_v0106_respect_robots.py).

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]