Skip to content

claude-flow

v3.10.26 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-rag agentic-workflow agents ai-agents
+14 more
ai-assistant ai-coding ai-skills autonomous-agents claude-code codex mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates What's in the box, Honest limits, and n.s across a mixed release.

Changes in this release

Feature Low

Adds BEIR-MATRIX benchmark grid documentation in docs/benchmarks/BEIR-MATRIX.md

Adds BEIR-MATRIX benchmark grid documentation in docs/benchmarks/BEIR-MATRIX.md

Source: llm_adapter@2026-05-30

Confidence: high

Feature Low

Adds paired bootstrap significance script scripts/beir-bootstrap-significance.mjs with 10K resamples and mulberry32 seed=42

Adds paired bootstrap significance script scripts/beir-bootstrap-significance.mjs with 10K resamples and mulberry32 seed=42

Source: llm_adapter@2026-05-30

Confidence: high

Feature Low

Adds perQuery metrics to every BEIR run JSON for external verification

Adds perQuery metrics to every BEIR run JSON for external verification

Source: llm_adapter@2026-05-30

Confidence: high

Feature Low

Updates install command to npx [email protected] for latest/alpha alignment

Updates install command to npx [email protected] for latest/alpha alignment

Source: llm_adapter@2026-05-30

Confidence: high

Dependency Low

Files upstream ruvector issues #523 (API contract bugs) and #524 (bundle BGE‑base/small)

Files upstream ruvector issues #523 (API contract bugs) and #524 (bundle BGE‑base/small)

Source: llm_adapter@2026-05-30

Confidence: high

Bugfix Medium

Fixes silent‑fallback bug described in ADR-086 and implements bootstrap method

Fixes silent‑fallback bug described in ADR-086 and implements bootstrap method

Source: llm_adapter@2026-05-30

Confidence: high

Full changelog

What ships

Took the user's "release hype → benchmark infrastructure" feedback to the wire. This release IS the infrastructure, not the rank.

Honest two-dataset picture

| Dataset | nDCG@10 | 95% CI | Rank | vs BM25 |
|---|---:|---|---:|---|
| NFCorpus | 0.352 | [0.317, 0.387] | 2/11 | +0.027 (n.s.) |
| SciFact | 0.626 | [0.577, 0.672] | 10/11 | -0.053 (p<0.05) — significant LOSS |

The user's acceptance test ("ruflo beats BM25 on both") fails on SciFact — we significantly lose to BM25 by 0.053. On NFCorpus, only SBERT msmarco and ColBERT are statistically significant wins; the gaps to SPLADE++/GTR-XL/BM25 are within CI overlap.

Two-dataset mean: ours 0.489, BM25 0.502, SPLADE++ 0.526, BGE-large 0.551. Below BM25 on the mean. BGE-base zero-shot is competent on NFCorpus (medical IR), weak on SciFact (fact-verification favours lexical retrieval). The NFCorpus rank-2 is real but not representative.

What's in the box

  • docs/benchmarks/BEIR-MATRIX.md — dataset × pipeline × metric grid with bootstrap CIs and pipeline disclosure
  • scripts/beir-bootstrap-significance.mjs — paired bootstrap, 10K resamples, mulberry32 seed=42
  • perQuery metrics now saved in every BEIR run JSON (external bootstrap verification by anyone)
  • ADR-085 hedged appropriately ("TOP-2 on BEIR NFCorpus, not BEIR average", "direct dense, no fine-tune, no rerank")
  • ADR-086 documents the silent-fallback bug story + bootstrap method + honest two-dataset picture
  • 2 ruvector upstream issues filed: #523 (API contract bugs), #524 (bundle BGE-base/small)

Reproduce

git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

# NFCorpus + SciFact (~30 min ingest each)
mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip' && unzip -q nf.zip
node /path/to/ruflo/v3/@claude-flow/cli/scripts/run-beir-bge.mjs

mkdir -p /tmp/beir-scifact && cd /tmp/beir-scifact
curl -sL -o sf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip' && unzip -q sf.zip
BEIR_DATA_DIR=/tmp/beir-scifact/scifact node /path/to/ruflo/v3/@claude-flow/cli/scripts/run-beir-bge.mjs

# Bootstrap significance (10k resamples, ~1s)
node /path/to/scripts/beir-bootstrap-significance.mjs /path/to/run.json

Honest limits

  • Two datasets only. BEIR ships 18. The 2-dataset mean is suggestive, not definitive.
  • Zero-shot. NFCorpus has a 110K-pair train split that would close ~0.02-0.05 nDCG.
  • Single annotator on internal labels (separate from BEIR's external qrels).
  • Tailscale-via-ruvultra GPU compute discussed for larger datasets (TREC-COVID, HotpotQA, NQ) — tracked.

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-086-silent-fallback-and-bootstrap.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track claude-flow

Get notified when new releases ship.

Sign up free

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →

Related context

Beta — feedback welcome: [email protected]