Skip to content

AIMLPM/markcrawl

v0.3.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Added --exclude-path glob filtering, --dry-run preview mode, and updated docs.

Full changelog

What's new

--exclude-path: skip junk pages

Glob-style URL path filtering prevents crawling thousands of templated pages (job listings, resume examples, SEO spam).

markcrawl --base https://example.com \
  --exclude-path "/job/*" --exclude-path "/careers/*" \
  --max-pages 500 --out ./output

Can be repeated for multiple patterns. Works in both CLI and Python API:

crawl("https://example.com", out_dir="output", exclude_paths=["/job/*", "/careers/*"])

--dry-run: preview before crawling

Discover URLs (via sitemap and robots.txt) and print them without fetching content.

markcrawl --base https://example.com --dry-run
markcrawl --base https://example.com --dry-run | wc -l
markcrawl --base https://example.com --dry-run | grep "/job/"

Recipes and docs

  • New README recipes for safe crawling (dry-run + exclude patterns, job board handling)
  • Updated LLM_PROMPT.md with new flags and workflows
  • Benchmark text updated: markcrawl is now fastest at 14.0 pages/sec

Bug fix

  • Playwright + concurrency > 1 warning now uses logger.warning() so it always shows, not just with --show-progress

How to upgrade

pip install --upgrade markcrawl

Full changelog

https://github.com/AIMLPM/markcrawl/compare/v0.2.0...v0.3.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]