Skip to content

AIMLPM/markcrawl

v0.8.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Full-page screenshots can now be captured per crawled page via new --screenshot flag.

Full changelog

New

--screenshot — full-page captures during crawl

Full-page PNG/JPEG screenshots of every crawled page via Playwright. Paths are recorded on each pages.jsonl row (or screenshot_error on failure). Auto-enables --render-js.

markcrawl --base https://dashboard.example.com/ --out ./out \
  --screenshot --max-pages 20 --show-progress

Flags: --screenshot-viewport WxH, --screenshot-selector CSS (crop to element), --screenshot-format {png,jpeg}, --screenshot-wait-ms N, --no-screenshot-full-page. Failures are recorded rather than aborting extraction.

Wait strategy is wait_until="load" plus a configurable post-load pause — not networkidle, which real sites rarely reach due to analytics pings.

markcrawl discover — seed pack subcommand

Emits URLs to stdout, pipeable into the main crawler:

markcrawl discover --pack game-dashboards | \
  markcrawl --seed-file - --out ./out \
    --screenshot --max-pages-per-site 5

Ships a curated game-dashboards pack (15 URLs: trackers, aggregators, esports wikis, patch notes). --list-packs enumerates available packs. --provider flag reserved for future search-API integration (currently returns a "not yet implemented" message).

Multi-site crawling via --seed-file

New flag on the main CLI — reads URLs from a file (or stdin with -). Each seed runs as its own crawl() into a per-netloc subdirectory under --out. Seeds sharing a netloc (e.g. multiple Liquipedia sub-wikis) get disambiguated with a path slug so each gets its own output. --max-pages-per-site caps per site.

New recipe

docs/recipes/game-dashboards.md covers end-to-end capture including a companion yt-dlp + ffmpeg frame-extraction path for gamer YouTube content (kept deliberately separate since video is out of MarkCrawl's scope).

Tested

Full pipeline verified live against the bundled pack: 13/15 sites render correctly, 2 are Cloudflare-blocked at the robots.txt layer (annotated in the pack). All 5 --screenshot-* tweak flags exercised against real sites. YouTube recipe commands run end-to-end. 269 tests, ruff clean.

Known limitations

  • Cloudflare-aggressive sites (wowprogress, hltv) return 403 on robots.txt from non-residential IPs. Not fixable from the crawler side.
  • Default --timeout 15s can be tight when combined with --screenshot on slower sites. Consider --timeout 30 for screenshot runs.

Install

pip install 'markcrawl[js]==0.8.0'
playwright install chromium

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]