This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Affected surfaces
Summary
AI summaryAdded streaming binary download support and reusable pre‑fetch filters to markcrawl.
Full changelog
Two new modules expand markcrawl from "HTML to Markdown converter" to "crawl + selectively download referenced files":
markcrawl/binaries.py — streaming binary downloads
New crawl(..., download_types=["pdf","docx"], ...) opt-in kwarg:
- Streaming with size cap —
stream=True/aiter_bytes()with per-chunk accumulation. Never buffers the full body. Default 25 MB per-file cap, 200 file count cap. - Atomic write via
.tmp+os.replace. Partial files unlinked on cap-exceed. - Content-type validated BEFORE writing bytes — a
.pdfURL servingtext/html(login wall, marketing splash) is dropped immediately. - JSONL row gains
downloadsfield when a page's binaries were downloaded:[{url, path, size_bytes, content_type}, ...]. Field omitted when empty (backward compat). - Sitemap entries route to download queue when they match
download_types(symmetry with link discovery). - All v0.10.x safety nets (
respect_robots,idle_timeout_s,include_subdomains) apply uniformly to downloads.
markcrawl/filters.py — reusable pre-fetch filters
from markcrawl import crawl
from markcrawl.filters import is_likely_resume
result = crawl(
base_url="https://example.com/templates",
out_dir="./resumes",
download_types=["pdf", "docx"],
download_filter=is_likely_resume,
)
print(f"Saved {result.downloads_count} files")
DownloadCandidate(url, anchor_text, parent_url, parent_title, extension)— pre-fetch context passed to filters.is_likely_resume/is_likely_paper/exclude_legal_boilerplate— reusable URL+anchor heuristics. Best-effort, not classifiers.- Filters run pre-fetch — rejected URLs never get fetched, zero HTTP bytes transferred.
- Compose via
lambda c: positive(c) and exclude_legal_boilerplate(c).
New CrawlResult fields
downloads_count: int— files saveddownloads_bytes: int— total bytes saveddownloads_size_skipped: List[str]— URLs that exceeded the size capdownloads_type_skipped: List[str]— URLs whose content-type didn't match
Migration
No breaking changes. Default download_types=None preserves v0.10.6 behavior exactly.
Deferred
- Live-network smoke harness case for an ATS-template aggregator → v0.11.1.
- Format-specific text extraction (PDF/DOCX → Markdown) remains out of scope; users compose with
pypdf/python-docx/mammoth/unstructureddownstream of saved files.
Tests
611 passing (was 566 on v0.10.6; +45 in tests/test_v011_binary_downloads.py). Spec specs/binary-downloads.md confidence-reviewed; all SC/DS rated ≥ 90% before implementation began.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]