This release includes 2 breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Summary
AI summaryEmbedding architecture overhauled to replace legacy models with state-of-the-art 2026 alternatives, adding MRL support and major performance fixes.
Full changelog
This major release completely overhauls the embedding architecture to replace older baseline models with state-of-the-art 2026 architectures, massively improving retrieval quality, context size, and system performance.
✨ New Models & Efficiency Gains
We have transitioned from the legacy e5_multi to highly optimized, modern alternatives:
Qwen3-Embedding-0.6B(New Default): A top-tier open-source model featuring a massive 32,768 token context window (vs 512 previously). Despite its larger vocabulary and better semantic precision, it maintains excellent inference speeds.embeddinggemma-300m-ONNX: A new ultra-lightweight (~195MB) alternative designed specifically for low-RAM and edge deployments. Extremely fast while retaining strong multilingual capabilities.
📉 Matryoshka Representation Learning (MRL)
We have introduced native support for MRL, allowing users to dynamically truncate embedding vectors via the --mrl-dim flag (e.g., from 1024 down to 512, 256, or 128).
- Efficiency Benchmark: Truncating Qwen3 from 1024 to 512 dimensions reduces database storage (SurrealDB) and vector search latency by ~50%, while retaining >98% of the original retrieval accuracy on MTEB benchmarks.
⚡ Architectural Performance Fixes
- Zero-Block Async Inference: Heavy tensor operations have been offloaded to Tokio's blocking threads (
block_in_place), preventing executor starvation. Concurrent JSON-RPC Requests Per Second (RPS) have increased by up to 300% under heavy load. - Qwen3 Tensor Math Fix: Corrected last-token pooling logic for unpadded sequences, eliminating
[PAD]token pollution and restoring exact mathematical accuracy for decoder-only models. - SurrealDB v3.0.0 Alignment: Database index dimensions now perfectly align with post-MRL truncated outputs.
- L2 Normalization Safety: Added robust protection against
NaN/Infcorruption on zero-vectors.
📊 Benchmark Comparison
| Metric | Qwen3-0.6B (New Default) | E5-Multi-Base (Old Default) | Gemma-300m (Edge) |
|--------|--------------------------|-----------------------------|-------------------|
| VRAM / RAM | ~1.2 GB | ~1.1 GB | ~195 MB |
| Context Size | 32,768 tokens | 512 tokens | 8,192 tokens |
| MRL Support | Yes (e.g., 512, 256) | No | Yes |
| RPS (Concurrency) | Non-blocking (High) | Baseline (Blocking) | Fastest |
Breaking Changes
- Legacy `e5_multi` model removed; default now is `Qwen3-Embedding-0.6B` with 32,768 token context window.
- Embedding API behavior changes: new models require updated configuration and handle MRL truncation via `--mrl-dim` flag.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About pomazanbohdan/memory-mcp-1file
A self-contained Memory server with single-binary architecture (embedded DB & models, no dependencies). Provides persistent semantic and graph-based memory for AI agents.
Related context
Beta — feedback welcome: [email protected]