AIMLPM/markcrawl

v0.10.6 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Affected surfaces

auth

Summary

AI summary

Added explicit respect_robots flag with audit fields and non‑silenceable warnings for bypass.

Full changelog

New crawl(..., respect_robots: bool = True) — default unchanged (robots.txt Disallow rules honored). Setting respect_robots=False bypasses Disallow but still honors Crawl-delay (politeness preserved). Caller takes responsibility for legality, ethics, and downstream consequences.

Why

robots.txt is the only widely-deployed mechanism site owners have to express preferences about automated access. We default to respecting it. But forks and monkey-patches that ignore robots already exist in the wild; an explicit, audited flag is more honest than letting users hack around the constraint silently.

Three guardrails

Loud, non-silenceable warning at engine setup when bypass is active — both progress callback and Python logger.warning. No env-var or CLI override; the choice must be made deliberately in code.
CrawlResult.robots_respected: bool — mirrors the kwarg the caller passed. Surfaced for audit / governance pipelines.
CrawlResult.robots_bypassed_count: int — count of unique URLs robots.txt Disallowed but were fetched anyway. Always 0 when robots_respected is True. Lets you see the actual impact of the override — small numbers mean robots wasn't constraining you.

End-of-crawl summary when bypass was active reports either had no effect this run (count=0) or fetched N URL(s) that robots.txt Disallowed (count>0).

What stays unchanged

Default behavior: robots.txt Disallow rules honored.
Crawl-delay (politeness): honored unconditionally. We disregard Disallow, not politeness. Bypassing rate limits would be DoS-shaped.

Migration

No breaking changes. Default behavior unchanged. Use the flag for legitimate cases:

Your own site (forgotten or misconfigured robots.txt)
Authorized pen-testing engagements
Internal / intranet documentation you own
RAG ingestion of docs the site owner explicitly wants ingested but forgot to whitelist your UA

566 tests passing (was 549 on v0.10.5; +17 in tests/test_v0106_respect_robots.py).

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

AIMLPM/markcrawl

Summary

Why

Three guardrails

What stays unchanged

Migration

Related context

Related tools