This release includes 2 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+9 more
Summary
AI summaryDefault embedder switches to local mixedbread-ai/mxbai-embed-large-v1 reducing API cost to zero.
Full changelog
tl;dr
pip install markcrawl now ships a complete crawl-and-embed stack with zero API cost. The default embedder flips from OpenAI 3-small to the bake-off-winning local mixedbread-ai/mxbai-embed-large-v1. Combined with the v0.10.0 chunker work, v0.10.1 closes the leaderboard story:
| Metric (vs v0.9.9-rc1) | v0.10.1 default | Δ |
|---|---:|---:|
| Mean MRR (11-site local pool) | 0.3859 | +0.040 (+11.5%) |
| Cost at 50M pages | $0 | −$10,152/yr |
| Chunks per page | 10.49 | −48% smaller index |
Multi-trial validated: +14% MRR on all-MiniLM-L6-v2 (6 trials, all positive) and +15% on OpenAI 3-small (3 trials, all positive) on the chunker change. The mxbai swap is MRR-neutral (Δ −0.018 within ±0.020 SC-B2 noise band) at $0/yr cost-at-scale.
What's new in 0.10.1
pip install markcrawlnow bundles the ML stack (torch + transformers + sentence-transformers + sentencepiece). The chunker'schunk_semanticand the new default embedder work out of the box.- Default embedder =
mixedbread-ai/mxbai-embed-large-v1(local, zero API cost). Replaces the previous OpenAI 3-small default. markcrawl[ml]kept as a no-op alias — existing install commands keep working.- Override paths:
MARKCRAWL_EMBEDDER=text-embedding-3-smallenv var, orembedding_model="..."/embedder=...kwargs onupload(...).
Lean install (no ML deps)
pip install --no-deps markcrawl beautifulsoup4 lxml markdownify requests certifi tenacity
# Then either set OPENAI_API_KEY for the OpenAI fallback, or skip embedding entirely.
Migration
Default kwargs to upload(...) now produce mxbai-embedded rows automatically — callers simply stop being charged for OpenAI. To stay on OpenAI explicitly:
from markcrawl.upload import upload
upload(jsonl_path=..., supabase_url=..., supabase_key=...,
embedding_model="text-embedding-3-small")
Or set MARKCRAWL_EMBEDDER=text-embedding-3-small in your environment.
Reports
bench/local_replica/v010_release_report.md— full v0.10 release report.bench/local_replica/track_b_report.md— embedder bake-off (4 of 5 candidates run on the canonical 11-site pool).bench/local_replica/track_d_report.md— chunker sweep (56 configs).
Breaking Changes
- Default embedder changed from `OpenAI text-embedding-3-small` to local `mixedbread-ai/mxbai-embed-large-v1`
- Minimum Python version bumped to support bundled torch, transformers, sentence-transformers, and sentencepiece
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About AIMLPM/markcrawl
Crawl websites into clean Markdown, search pages, and extract structured data with LLMs. Built-in MCP server for web research and RAG pipelines.
Beta — feedback welcome: [email protected]