Skip to content

claude-flow

v3.10.25 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-rag agentic-workflow agents ai-agents
+14 more
ai-assistant ai-coding ai-skills autonomous-agents claude-code codex mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates What changed in code, Honest limits, and Lucene across a mixed release.

Changes in this release

Feature Low

Added `src/memory/bge-embedder.ts` supporting lazy-loaded BGE models (small, base, large).

Added `src/memory/bge-embedder.ts` supporting lazy-loaded BGE models (small, base, large).

Source: llm_adapter@2026-05-30

Confidence: high

Feature Low

Added `scripts/run-beir-bge.mjs` for direct-dense BEIR benchmark runner with on‑disk embedding cache.

Added `scripts/run-beir-bge.mjs` for direct-dense BEIR benchmark runner with on‑disk embedding cache.

Source: llm_adapter@2026-05-30

Confidence: high

Feature Low

Added `docs/benchmarks/BEIR-MATRIX.md` public benchmark tracking page.

Added `docs/benchmarks/BEIR-MATRIX.md` public benchmark tracking page.

Source: llm_adapter@2026-05-30

Confidence: high

Performance Medium

Achieved nDCG@10 = 0.352 on BEIR NFCorpus using BGE‑base, ranking top‑2 among listed baselines.

Achieved nDCG@10 = 0.352 on BEIR NFCorpus using BGE‑base, ranking top‑2 among listed baselines.

Source: llm_adapter@2026-05-30

Confidence: high

Bugfix Medium

Fixed silent degradation of embedding path on darwin-arm64 caused by `sharp`/`libvips` issue.

Fixed silent degradation of embedding path on darwin-arm64 caused by `sharp`/`libvips` issue.

Source: llm_adapter@2026-05-30

Confidence: high

Full changelog

ruflo 3.10.25 — reproducible BEIR NFCorpus benchmark, nDCG@10 0.352, top-2 against listed public baselines

We now have a reproducible BEIR benchmark harness, run JSONs, per-query metrics
(in 3.10.26), and a clean direct BGE dense path.

First public result: BEIR NFCorpus

nDCG@10 = 0.352 using BGE-base-en-v1.5 (110M params) via the direct
dense path (no fine-tuning, no hybrid BM25+dense fusion, no cross-encoder
reranker). Internal hybrid pipeline is isolated from this comparison so the
dense-vs-dense numbers stay honest.

| Rank | Method | Params | nDCG@10 |
|---:|---|---:|---:|
| 1 | BGE-large-v1.5 (listed) | 335M | 0.380 |
| 2 | ruflo + BGE-base-en-v1.5 ← us | 110M | 0.352 |
| 3 | SPLADE++ | 110M | 0.347 |
| 4 | GTR-XL | 1.2B | 0.343 |
| 5 | DocT5query / Contriever | — | 0.328 |
| 7 | BM25 (Lucene) | — | 0.325 |
| 8 | TAS-B / GenQ | — | 0.319 |
| 10 | ColBERT | 110M | 0.305 |
| 11 | SBERT msmarco | 110M | 0.272 |

This is top-2 on BEIR NFCorpus, NOT "top-2 on BEIR." BEIR is an 18-dataset
suite; NFCorpus is one dataset. The broader BEIR average requires TREC-COVID,
FiQA, ArguAna, HotpotQA, NQ, etc. SciFact (2nd dataset) is queued.

The more important part — the audit trail

We found and fixed a real environment bug where the embedding path could
silently degrade into hash fallback because of a sharp/libvips issue on
darwin-arm64. The neural store reported _realEmbedding: true because the
import succeeded — but per-call embeds threw and got swallowed by an inner
catch. The pure-BM25 path (with broken random cosine) was carrying the entire
"hybrid" signal undetected.

The new path bypasses that dependency by loading BGE directly through
@xenova/transformers's AutoTokenizer + AutoModel. Text bi-encoders
don't need image preprocessing; sharp is a transitive dep that's never
needed for retrieval.

What changed in code

  1. src/memory/bge-embedder.ts — lazy-loaded singleton, supports
    bge-small (33M, 384-dim), bge-base (110M, 768-dim, default),
    bge-large (335M, 1024-dim). CLS-token pooling + L2 normalisation
    per BAAI spec.
  2. scripts/run-beir-nfcorpus.mjs — hybrid-pipeline harness; with the
    embedder broken this collapses to pure-BM25 (measured 0.289 vs published
    BM25 0.325).
  3. scripts/run-beir-bge.mjs — direct-dense BEIR runner, on-disk
    embedding cache, dataset auto-detect.
  4. docs/benchmarks/BEIR-MATRIX.md — public benchmark tracking page
    (added in 3.10.26).

Reproduce

git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nfcorpus.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip'
unzip -q nfcorpus.zip

# BGE-base direct dense (one-time ~25min ingest + ~2min full eval)
node /path/to/v3/@claude-flow/cli/scripts/run-beir-bge.mjs
# → nDCG@10 0.352, rank 2/11 against listed baselines

# Cached subsequent runs (~2 min)
SKIP_INGEST=1 node /path/to/scripts/run-beir-bge.mjs

Honest limits

  • One BEIR dataset measured. SciFact in progress; broader BEIR average
    tracked.
  • Zero-shot, no fine-tuning. NFCorpus has a 110K-pair train split that
    could fine-tune for an additional ~0.02-0.05 nDCG.
  • The 0.005 gap to SPLADE++ is small. Paired bootstrap CI shipping
    in 3.10.26 will determine if it's statistically significant.
  • The _realEmbedding: true lie in neural-tools.ts is bypassed, not
    fixed.
    BGE direct-API path is the workaround; the underlying flag bug
    is tracked.

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-085-beir-public-benchmark.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track claude-flow

Get notified when new releases ship.

Sign up free

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →

Related context

Beta — feedback welcome: [email protected]