AIMLPM/markcrawl

v0.11.0 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Affected surfaces

deps

Summary

AI summary

Added streaming binary download support and reusable pre‑fetch filters to markcrawl.

Full changelog

Two new modules expand markcrawl from "HTML to Markdown converter" to "crawl + selectively download referenced files":

`markcrawl/binaries.py` — streaming binary downloads

New crawl(..., download_types=["pdf","docx"], ...) opt-in kwarg:

Streaming with size cap — stream=True / aiter_bytes() with per-chunk accumulation. Never buffers the full body. Default 25 MB per-file cap, 200 file count cap.
Atomic write via .tmp + os.replace. Partial files unlinked on cap-exceed.
Content-type validated BEFORE writing bytes — a .pdf URL serving text/html (login wall, marketing splash) is dropped immediately.
JSONL row gains downloads field when a page's binaries were downloaded: [{url, path, size_bytes, content_type}, ...]. Field omitted when empty (backward compat).
Sitemap entries route to download queue when they match download_types (symmetry with link discovery).
All v0.10.x safety nets (respect_robots, idle_timeout_s, include_subdomains) apply uniformly to downloads.

`markcrawl/filters.py` — reusable pre-fetch filters

from markcrawl import crawl
from markcrawl.filters import is_likely_resume

result = crawl(
    base_url="https://example.com/templates",
    out_dir="./resumes",
    download_types=["pdf", "docx"],
    download_filter=is_likely_resume,
)
print(f"Saved {result.downloads_count} files")

DownloadCandidate(url, anchor_text, parent_url, parent_title, extension) — pre-fetch context passed to filters.
is_likely_resume / is_likely_paper / exclude_legal_boilerplate — reusable URL+anchor heuristics. Best-effort, not classifiers.
Filters run pre-fetch — rejected URLs never get fetched, zero HTTP bytes transferred.
Compose via lambda c: positive(c) and exclude_legal_boilerplate(c).

New `CrawlResult` fields

downloads_count: int — files saved
downloads_bytes: int — total bytes saved
downloads_size_skipped: List[str] — URLs that exceeded the size cap
downloads_type_skipped: List[str] — URLs whose content-type didn't match

Migration

No breaking changes. Default download_types=None preserves v0.10.6 behavior exactly.

Deferred

Live-network smoke harness case for an ATS-template aggregator → v0.11.1.
Format-specific text extraction (PDF/DOCX → Markdown) remains out of scope; users compose with pypdf / python-docx / mammoth / unstructured downstream of saved files.

Tests

611 passing (was 566 on v0.10.6; +45 in tests/test_v011_binary_downloads.py). Spec specs/binary-downloads.md confidence-reviewed; all SC/DS rated ≥ 90% before implementation began.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

AIMLPM/markcrawl

Summary

`markcrawl/binaries.py` — streaming binary downloads

`markcrawl/filters.py` — reusable pre-fetch filters

New `CrawlResult` fields

Migration

Deferred

Tests

Related context

Related tools

AIMLPM/markcrawl

Summary

markcrawl/binaries.py — streaming binary downloads

markcrawl/filters.py — reusable pre-fetch filters

New CrawlResult fields

Migration

Deferred

Tests

Related context

Related tools

`markcrawl/binaries.py` — streaming binary downloads

`markcrawl/filters.py` — reusable pre-fetch filters

New `CrawlResult` fields