Skip to content

AIMLPM/markcrawl

v0.5.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Multiple extraction backends, page‑type heuristics, semantic chunking, cross‑crawl deduplication, link prioritization, and auto‑resume are added.

Full changelog

What's new

Multiple extraction backends (--extractor)

Choose the best extraction strategy for your use case:

  • default — BS4 + markdownify (fastest, good for most sites)
  • trafilatura — higher recall for complex layouts
  • ensemble — runs both and picks the best per page
  • readerlm — ML-based extraction via ReaderLM-v2 (pip install markcrawl[ml])

Page-type extraction and content-region heuristics

The extraction pipeline now classifies pages by type (article, docs, landing, listing, etc.) and uses content-density scoring to identify main content regions. This reduces nav pollution and improves output quality for diverse page layouts.

Semantic chunking

chunk_markdown() now supports adaptive, semantically-aware chunking that respects heading boundaries and section structure, producing more coherent chunks for RAG embeddings.

Cross-crawl deduplication (--cross-dedup)

Skip pages already seen in previous crawls to the same output directory. Useful for incremental crawls of sites that change slowly:

markcrawl --base https://docs.example.com --out ./docs --cross-dedup --show-progress

Link prioritization (--prioritize-links)

Score discovered links by predicted content yield and crawl high-value pages first. When combined with --max-pages, this ensures your page budget is spent on the most useful content:

markcrawl --base https://docs.example.com --out ./docs --prioritize-links --max-pages 100

Auto-resume (--auto-resume)

Automatically resume from saved state if it exists, otherwise start a fresh crawl. No more checking for state files manually:

markcrawl --base https://docs.example.com --out ./docs --auto-resume --show-progress

Stats

  • 200 tests passing
  • 9 new features across 3 phases
  • 926 lines of new code across 9 files

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]