Skip to content

AIMLPM/markcrawl

v0.11.0 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 28d RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Affected surfaces

deps

Summary

AI summary

Added streaming binary download support and reusable pre‑fetch filters to markcrawl.

Full changelog

Two new modules expand markcrawl from "HTML to Markdown converter" to "crawl + selectively download referenced files":

markcrawl/binaries.py — streaming binary downloads

New crawl(..., download_types=["pdf","docx"], ...) opt-in kwarg:

  • Streaming with size capstream=True / aiter_bytes() with per-chunk accumulation. Never buffers the full body. Default 25 MB per-file cap, 200 file count cap.
  • Atomic write via .tmp + os.replace. Partial files unlinked on cap-exceed.
  • Content-type validated BEFORE writing bytes — a .pdf URL serving text/html (login wall, marketing splash) is dropped immediately.
  • JSONL row gains downloads field when a page's binaries were downloaded: [{url, path, size_bytes, content_type}, ...]. Field omitted when empty (backward compat).
  • Sitemap entries route to download queue when they match download_types (symmetry with link discovery).
  • All v0.10.x safety nets (respect_robots, idle_timeout_s, include_subdomains) apply uniformly to downloads.

markcrawl/filters.py — reusable pre-fetch filters

from markcrawl import crawl
from markcrawl.filters import is_likely_resume

result = crawl(
    base_url="https://example.com/templates",
    out_dir="./resumes",
    download_types=["pdf", "docx"],
    download_filter=is_likely_resume,
)
print(f"Saved {result.downloads_count} files")
  • DownloadCandidate(url, anchor_text, parent_url, parent_title, extension) — pre-fetch context passed to filters.
  • is_likely_resume / is_likely_paper / exclude_legal_boilerplate — reusable URL+anchor heuristics. Best-effort, not classifiers.
  • Filters run pre-fetch — rejected URLs never get fetched, zero HTTP bytes transferred.
  • Compose via lambda c: positive(c) and exclude_legal_boilerplate(c).

New CrawlResult fields

  • downloads_count: int — files saved
  • downloads_bytes: int — total bytes saved
  • downloads_size_skipped: List[str] — URLs that exceeded the size cap
  • downloads_type_skipped: List[str] — URLs whose content-type didn't match

Migration

No breaking changes. Default download_types=None preserves v0.10.6 behavior exactly.

Deferred

  • Live-network smoke harness case for an ATS-template aggregator → v0.11.1.
  • Format-specific text extraction (PDF/DOCX → Markdown) remains out of scope; users compose with pypdf / python-docx / mammoth / unstructured downstream of saved files.

Tests

611 passing (was 566 on v0.10.6; +45 in tests/test_v011_binary_downloads.py). Spec specs/binary-downloads.md confidence-reviewed; all SC/DS rated ≥ 90% before implementation began.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]