Skip to content

AIMLPM/markcrawl

v0.10.1 Breaking

This release includes 2 breaking changes for platform teams planning a safe upgrade.

Published 1mo RAG & Retrieval
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

ai-agents anthropic-claude data-extraction gemini ingestion-pipeline llm
+9 more
markdown-extraction openai pgvector python sitemap-crawler structured-data supabase vector-db webcrawler

Summary

AI summary

Default embedder switches to local mixedbread-ai/mxbai-embed-large-v1 reducing API cost to zero.

Full changelog

tl;dr

pip install markcrawl now ships a complete crawl-and-embed stack with zero API cost. The default embedder flips from OpenAI 3-small to the bake-off-winning local mixedbread-ai/mxbai-embed-large-v1. Combined with the v0.10.0 chunker work, v0.10.1 closes the leaderboard story:

| Metric (vs v0.9.9-rc1) | v0.10.1 default | Δ |
|---|---:|---:|
| Mean MRR (11-site local pool) | 0.3859 | +0.040 (+11.5%) |
| Cost at 50M pages | $0 | −$10,152/yr |
| Chunks per page | 10.49 | −48% smaller index |

Multi-trial validated: +14% MRR on all-MiniLM-L6-v2 (6 trials, all positive) and +15% on OpenAI 3-small (3 trials, all positive) on the chunker change. The mxbai swap is MRR-neutral (Δ −0.018 within ±0.020 SC-B2 noise band) at $0/yr cost-at-scale.

What's new in 0.10.1

  • pip install markcrawl now bundles the ML stack (torch + transformers + sentence-transformers + sentencepiece). The chunker's chunk_semantic and the new default embedder work out of the box.
  • Default embedder = mixedbread-ai/mxbai-embed-large-v1 (local, zero API cost). Replaces the previous OpenAI 3-small default.
  • markcrawl[ml] kept as a no-op alias — existing install commands keep working.
  • Override paths: MARKCRAWL_EMBEDDER=text-embedding-3-small env var, or embedding_model="..." / embedder=... kwargs on upload(...).

Lean install (no ML deps)

pip install --no-deps markcrawl beautifulsoup4 lxml markdownify requests certifi tenacity
# Then either set OPENAI_API_KEY for the OpenAI fallback, or skip embedding entirely.

Migration

Default kwargs to upload(...) now produce mxbai-embedded rows automatically — callers simply stop being charged for OpenAI. To stay on OpenAI explicitly:

from markcrawl.upload import upload
upload(jsonl_path=..., supabase_url=..., supabase_key=...,
       embedding_model="text-embedding-3-small")

Or set MARKCRAWL_EMBEDDER=text-embedding-3-small in your environment.

Reports

Breaking Changes

  • Default embedder changed from `OpenAI text-embedding-3-small` to local `mixedbread-ai/mxbai-embed-large-v1`
  • Minimum Python version bumped to support bundled torch, transformers, sentence-transformers, and sentencepiece

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AIMLPM/markcrawl

Get notified when new releases ship.

Sign up free

About AIMLPM/markcrawl

Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.

All releases →

Related context

Beta — feedback welcome: [email protected]