Skip to content

AIMLPM/markcrawl

v0.4.1 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Images now preserve alt text as inline references and CrawlResult includes a pages list for direct programmatic access.

Full changelog

What's new

Image alt text preservation

Images are no longer silently stripped. Alt text and figcaptions are extracted as [Image: description] inline references, preserving context from diagrams, architecture charts, and annotated screenshots. Figcaptions take priority over alt text when both are present.

Python API: result.pages

CrawlResult now includes a pages list of PageData objects for direct programmatic access:

import markcrawl

result = markcrawl.crawl("https://example.com", out_dir="./output")
for page in result.pages:
    print(page.url, page.title)
    chunks = markcrawl.chunk_markdown(page.content)

No more parsing JSONL files to use crawl results in code.

Benchmark documentation

New docs/BENCHMARKS.md with self-contained speed, quality, and cost comparisons across 7 tools. Full methodology at llm-crawler-benchmarks.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]