This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryMultiple extraction backends, page‑type heuristics, semantic chunking, cross‑crawl deduplication, link prioritization, and auto‑resume are added.
Full changelog
What's new
Multiple extraction backends (--extractor)
Choose the best extraction strategy for your use case:
- default — BS4 + markdownify (fastest, good for most sites)
- trafilatura — higher recall for complex layouts
- ensemble — runs both and picks the best per page
- readerlm — ML-based extraction via ReaderLM-v2 (
pip install markcrawl[ml])
Page-type extraction and content-region heuristics
The extraction pipeline now classifies pages by type (article, docs, landing, listing, etc.) and uses content-density scoring to identify main content regions. This reduces nav pollution and improves output quality for diverse page layouts.
Semantic chunking
chunk_markdown() now supports adaptive, semantically-aware chunking that respects heading boundaries and section structure, producing more coherent chunks for RAG embeddings.
Cross-crawl deduplication (--cross-dedup)
Skip pages already seen in previous crawls to the same output directory. Useful for incremental crawls of sites that change slowly:
markcrawl --base https://docs.example.com --out ./docs --cross-dedup --show-progress
Link prioritization (--prioritize-links)
Score discovered links by predicted content yield and crawl high-value pages first. When combined with --max-pages, this ensures your page budget is spent on the most useful content:
markcrawl --base https://docs.example.com --out ./docs --prioritize-links --max-pages 100
Auto-resume (--auto-resume)
Automatically resume from saved state if it exists, otherwise start a fresh crawl. No more checking for state files manually:
markcrawl --base https://docs.example.com --out ./docs --auto-resume --show-progress
Stats
- 200 tests passing
- 9 new features across 3 phases
- 926 lines of new code across 9 files
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]