AIMLPM/markcrawl

v0.5.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 3mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Multiple extraction backends, page‑type heuristics, semantic chunking, cross‑crawl deduplication, link prioritization, and auto‑resume are added.

Full changelog

What's new

Multiple extraction backends (`--extractor`)

Choose the best extraction strategy for your use case:

default — BS4 + markdownify (fastest, good for most sites)
trafilatura — higher recall for complex layouts
ensemble — runs both and picks the best per page
readerlm — ML-based extraction via ReaderLM-v2 (pip install markcrawl[ml])

Page-type extraction and content-region heuristics

The extraction pipeline now classifies pages by type (article, docs, landing, listing, etc.) and uses content-density scoring to identify main content regions. This reduces nav pollution and improves output quality for diverse page layouts.

Semantic chunking

chunk_markdown() now supports adaptive, semantically-aware chunking that respects heading boundaries and section structure, producing more coherent chunks for RAG embeddings.

Cross-crawl deduplication (`--cross-dedup`)

Skip pages already seen in previous crawls to the same output directory. Useful for incremental crawls of sites that change slowly:

markcrawl --base https://docs.example.com --out ./docs --cross-dedup --show-progress

Link prioritization (`--prioritize-links`)

Score discovered links by predicted content yield and crawl high-value pages first. When combined with --max-pages, this ensures your page budget is spent on the most useful content:

markcrawl --base https://docs.example.com --out ./docs --prioritize-links --max-pages 100

Auto-resume (`--auto-resume`)

Automatically resume from saved state if it exists, otherwise start a fresh crawl. No more checking for state files manually:

markcrawl --base https://docs.example.com --out ./docs --auto-resume --show-progress

Stats

200 tests passing
9 new features across 3 phases
926 lines of new code across 9 files

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

AIMLPM/markcrawl

Summary

What's new

Multiple extraction backends (`--extractor`)

Page-type extraction and content-region heuristics

Semantic chunking

Cross-crawl deduplication (`--cross-dedup`)

Link prioritization (`--prioritize-links`)

Auto-resume (`--auto-resume`)

Stats

Related context

Related tools

AIMLPM/markcrawl

Summary

What's new

Multiple extraction backends (--extractor)

Page-type extraction and content-region heuristics

Semantic chunking

Cross-crawl deduplication (--cross-dedup)

Link prioritization (--prioritize-links)

Auto-resume (--auto-resume)

Stats

Related context

Related tools

Multiple extraction backends (`--extractor`)

Cross-crawl deduplication (`--cross-dedup`)

Link prioritization (`--prioritize-links`)

Auto-resume (`--auto-resume`)