This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryFull-page screenshots can now be captured per crawled page via new --screenshot flag.
Full changelog
New
--screenshot — full-page captures during crawl
Full-page PNG/JPEG screenshots of every crawled page via Playwright. Paths are recorded on each pages.jsonl row (or screenshot_error on failure). Auto-enables --render-js.
markcrawl --base https://dashboard.example.com/ --out ./out \
--screenshot --max-pages 20 --show-progress
Flags: --screenshot-viewport WxH, --screenshot-selector CSS (crop to element), --screenshot-format {png,jpeg}, --screenshot-wait-ms N, --no-screenshot-full-page. Failures are recorded rather than aborting extraction.
Wait strategy is wait_until="load" plus a configurable post-load pause — not networkidle, which real sites rarely reach due to analytics pings.
markcrawl discover — seed pack subcommand
Emits URLs to stdout, pipeable into the main crawler:
markcrawl discover --pack game-dashboards | \
markcrawl --seed-file - --out ./out \
--screenshot --max-pages-per-site 5
Ships a curated game-dashboards pack (15 URLs: trackers, aggregators, esports wikis, patch notes). --list-packs enumerates available packs. --provider flag reserved for future search-API integration (currently returns a "not yet implemented" message).
Multi-site crawling via --seed-file
New flag on the main CLI — reads URLs from a file (or stdin with -). Each seed runs as its own crawl() into a per-netloc subdirectory under --out. Seeds sharing a netloc (e.g. multiple Liquipedia sub-wikis) get disambiguated with a path slug so each gets its own output. --max-pages-per-site caps per site.
New recipe
docs/recipes/game-dashboards.md covers end-to-end capture including a companion yt-dlp + ffmpeg frame-extraction path for gamer YouTube content (kept deliberately separate since video is out of MarkCrawl's scope).
Tested
Full pipeline verified live against the bundled pack: 13/15 sites render correctly, 2 are Cloudflare-blocked at the robots.txt layer (annotated in the pack). All 5 --screenshot-* tweak flags exercised against real sites. YouTube recipe commands run end-to-end. 269 tests, ruff clean.
Known limitations
- Cloudflare-aggressive sites (
wowprogress,hltv) return 403 on robots.txt from non-residential IPs. Not fixable from the crawler side. - Default
--timeout 15scan be tight when combined with--screenshoton slower sites. Consider--timeout 30for screenshot runs.
Install
pip install 'markcrawl[js]==0.8.0'
playwright install chromium
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]