This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryAdded --exclude-path glob filtering, --dry-run preview mode, and updated docs.
Full changelog
What's new
--exclude-path: skip junk pages
Glob-style URL path filtering prevents crawling thousands of templated pages (job listings, resume examples, SEO spam).
markcrawl --base https://example.com \
--exclude-path "/job/*" --exclude-path "/careers/*" \
--max-pages 500 --out ./output
Can be repeated for multiple patterns. Works in both CLI and Python API:
crawl("https://example.com", out_dir="output", exclude_paths=["/job/*", "/careers/*"])
--dry-run: preview before crawling
Discover URLs (via sitemap and robots.txt) and print them without fetching content.
markcrawl --base https://example.com --dry-run
markcrawl --base https://example.com --dry-run | wc -l
markcrawl --base https://example.com --dry-run | grep "/job/"
Recipes and docs
- New README recipes for safe crawling (dry-run + exclude patterns, job board handling)
- Updated LLM_PROMPT.md with new flags and workflows
- Benchmark text updated: markcrawl is now fastest at 14.0 pages/sec
Bug fix
- Playwright + concurrency > 1 warning now uses
logger.warning()so it always shows, not just with--show-progress
How to upgrade
pip install --upgrade markcrawl
Full changelog
https://github.com/AIMLPM/markcrawl/compare/v0.2.0...v0.3.0
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]