AIMLPM/markcrawl

v0.3.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 3mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Added --exclude-path glob filtering, --dry-run preview mode, and updated docs.

Full changelog

What's new

--exclude-path: skip junk pages

Glob-style URL path filtering prevents crawling thousands of templated pages (job listings, resume examples, SEO spam).

markcrawl --base https://example.com \
  --exclude-path "/job/*" --exclude-path "/careers/*" \
  --max-pages 500 --out ./output

Can be repeated for multiple patterns. Works in both CLI and Python API:

crawl("https://example.com", out_dir="output", exclude_paths=["/job/*", "/careers/*"])

--dry-run: preview before crawling

Discover URLs (via sitemap and robots.txt) and print them without fetching content.

markcrawl --base https://example.com --dry-run
markcrawl --base https://example.com --dry-run | wc -l
markcrawl --base https://example.com --dry-run | grep "/job/"

Recipes and docs

New README recipes for safe crawling (dry-run + exclude patterns, job board handling)
Updated LLM_PROMPT.md with new flags and workflows
Benchmark text updated: markcrawl is now fastest at 14.0 pages/sec

Bug fix

Playwright + concurrency > 1 warning now uses logger.warning() so it always shows, not just with --show-progress

How to upgrade

pip install --upgrade markcrawl

Full changelog

https://github.com/AIMLPM/markcrawl/compare/v0.2.0...v0.3.0

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

AIMLPM/markcrawl

Summary

What's new

--exclude-path: skip junk pages

--dry-run: preview before crawling

Recipes and docs

Bug fix

How to upgrade

Full changelog

Related context

Related tools