This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Affected surfaces
Summary
AI summaryAdded explicit respect_robots flag with audit fields and non‑silenceable warnings for bypass.
Full changelog
New crawl(..., respect_robots: bool = True) — default unchanged (robots.txt Disallow rules honored). Setting respect_robots=False bypasses Disallow but still honors Crawl-delay (politeness preserved). Caller takes responsibility for legality, ethics, and downstream consequences.
Why
robots.txt is the only widely-deployed mechanism site owners have to express preferences about automated access. We default to respecting it. But forks and monkey-patches that ignore robots already exist in the wild; an explicit, audited flag is more honest than letting users hack around the constraint silently.
Three guardrails
- Loud, non-silenceable warning at engine setup when bypass is active — both progress callback and Python
logger.warning. No env-var or CLI override; the choice must be made deliberately in code. CrawlResult.robots_respected: bool— mirrors the kwarg the caller passed. Surfaced for audit / governance pipelines.CrawlResult.robots_bypassed_count: int— count of unique URLs robots.txt Disallowed but were fetched anyway. Always 0 whenrobots_respectedis True. Lets you see the actual impact of the override — small numbers mean robots wasn't constraining you.
End-of-crawl summary when bypass was active reports either had no effect this run (count=0) or fetched N URL(s) that robots.txt Disallowed (count>0).
What stays unchanged
- Default behavior: robots.txt Disallow rules honored.
- Crawl-delay (politeness): honored unconditionally. We disregard Disallow, not politeness. Bypassing rate limits would be DoS-shaped.
Migration
No breaking changes. Default behavior unchanged. Use the flag for legitimate cases:
- Your own site (forgotten or misconfigured robots.txt)
- Authorized pen-testing engagements
- Internal / intranet documentation you own
- RAG ingestion of docs the site owner explicitly wants ingested but forgot to whitelist your UA
566 tests passing (was 549 on v0.10.5; +17 in tests/test_v0106_respect_robots.py).
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]