Skip to content

AIMLPM/markcrawl

v0.4.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Added --smart-sample flag for auto-detecting and sampling large URL pattern clusters.

Full changelog

What's new

--smart-sample: URL pattern clustering

Auto-detect templated URL patterns (e.g. /jobs/*, /products/*) and sample from large clusters instead of crawling every instance. Perfect for sites with thousands of near-identical pages like e-commerce catalogs, job boards, and real estate listings.

# Preview the pattern clusters
markcrawl --base https://bigsite.com --dry-run --smart-sample --show-progress

# Crawl with sampling
markcrawl --base https://bigsite.com --out ./bigsite --smart-sample --show-progress

New flags

  • --smart-sample — enable pattern clustering and sampling
  • --sample-size N — pages to sample per cluster (default: 5)
  • --sample-threshold N — clusters larger than N are sampled (default: 20)

JSONL metadata

When smart sampling is active, each row in pages.jsonl includes:

  • pattern — the URL pattern cluster (e.g. /jobs/*)
  • pattern_cluster_size — total URLs in that cluster
  • is_sample — whether this page was sampled from a templated cluster

Closes #12.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]