AIMLPM/markcrawl

v0.8.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 3mo RAG & Retrieval

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm

+9 more

markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Full-page screenshots can now be captured per crawled page via new --screenshot flag.

Full changelog

New

`--screenshot` — full-page captures during crawl

Full-page PNG/JPEG screenshots of every crawled page via Playwright. Paths are recorded on each pages.jsonl row (or screenshot_error on failure). Auto-enables --render-js.

markcrawl --base https://dashboard.example.com/ --out ./out \
  --screenshot --max-pages 20 --show-progress

Flags: --screenshot-viewport WxH, --screenshot-selector CSS (crop to element), --screenshot-format {png,jpeg}, --screenshot-wait-ms N, --no-screenshot-full-page. Failures are recorded rather than aborting extraction.

Wait strategy is wait_until="load" plus a configurable post-load pause — not networkidle, which real sites rarely reach due to analytics pings.

`markcrawl discover` — seed pack subcommand

Emits URLs to stdout, pipeable into the main crawler:

markcrawl discover --pack game-dashboards | \
  markcrawl --seed-file - --out ./out \
    --screenshot --max-pages-per-site 5

Ships a curated game-dashboards pack (15 URLs: trackers, aggregators, esports wikis, patch notes). --list-packs enumerates available packs. --provider flag reserved for future search-API integration (currently returns a "not yet implemented" message).

Multi-site crawling via `--seed-file`

New flag on the main CLI — reads URLs from a file (or stdin with -). Each seed runs as its own crawl() into a per-netloc subdirectory under --out. Seeds sharing a netloc (e.g. multiple Liquipedia sub-wikis) get disambiguated with a path slug so each gets its own output. --max-pages-per-site caps per site.

New recipe

docs/recipes/game-dashboards.md covers end-to-end capture including a companion yt-dlp + ffmpeg frame-extraction path for gamer YouTube content (kept deliberately separate since video is out of MarkCrawl's scope).

Tested

Full pipeline verified live against the bundled pack: 13/15 sites render correctly, 2 are Cloudflare-blocked at the robots.txt layer (annotated in the pack). All 5 --screenshot-* tweak flags exercised against real sites. YouTube recipe commands run end-to-end. 269 tests, ruff clean.

Known limitations

Cloudflare-aggressive sites (wowprogress, hltv) return 403 on robots.txt from non-residential IPs. Not fixable from the crawler side.
Default --timeout 15s can be tight when combined with --screenshot on slower sites. Consider --timeout 30 for screenshot runs.

Install

pip install 'markcrawl[js]==0.8.0'
playwright install chromium

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AIMLPM/markcrawl

Get notified when new releases ship.

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

AIMLPM/markcrawl

Summary

New

`--screenshot` — full-page captures during crawl

`markcrawl discover` — seed pack subcommand

Multi-site crawling via `--seed-file`

New recipe

Tested

Known limitations

Install

Related context

Related tools

AIMLPM/markcrawl

Summary

New

--screenshot — full-page captures during crawl

markcrawl discover — seed pack subcommand

Multi-site crawling via --seed-file

New recipe

Tested

Known limitations

Install

Related context

Related tools

`--screenshot` — full-page captures during crawl

`markcrawl discover` — seed pack subcommand

Multi-site crawling via `--seed-file`