This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryAdded --smart-sample flag for auto-detecting and sampling large URL pattern clusters.
Full changelog
What's new
--smart-sample: URL pattern clustering
Auto-detect templated URL patterns (e.g. /jobs/*, /products/*) and sample from large clusters instead of crawling every instance. Perfect for sites with thousands of near-identical pages like e-commerce catalogs, job boards, and real estate listings.
# Preview the pattern clusters
markcrawl --base https://bigsite.com --dry-run --smart-sample --show-progress
# Crawl with sampling
markcrawl --base https://bigsite.com --out ./bigsite --smart-sample --show-progress
New flags
--smart-sample— enable pattern clustering and sampling--sample-size N— pages to sample per cluster (default: 5)--sample-threshold N— clusters larger than N are sampled (default: 20)
JSONL metadata
When smart sampling is active, each row in pages.jsonl includes:
pattern— the URL pattern cluster (e.g./jobs/*)pattern_cluster_size— total URLs in that clusteris_sample— whether this page was sampled from a templated cluster
Closes #12.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]