claude-flow

v3.10.26 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-workflow agents ai-agents ai-assistant

+14 more

ai-coding ai-skills autonomous-agents claude-code codex harness mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates What's in the box, Honest limits, and n.s across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Feature
Feature	Low	Adds BEIR-MATRIX benchmark grid documentation in docs/benchmarks/BEIR-MATRIX.md Adds BEIR-MATRIX benchmark grid documentation in docs/benchmarks/BEIR-MATRIX.md Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Low	Adds paired bootstrap significance script scripts/beir-bootstrap-significance.mjs with 10K resamples and mulberry32 seed=42 Adds paired bootstrap significance script scripts/beir-bootstrap-significance.mjs with 10K resamples and mulberry32 seed=42 Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Low	Adds perQuery metrics to every BEIR run JSON for external verification Adds perQuery metrics to every BEIR run JSON for external verification Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Low	Updates install command to npx [email protected] for latest/alpha alignment Updates install command to npx [email protected] for latest/alpha alignment Source: llm_adapter@2026-05-30 Confidence: high	—
Dependency	Low	Files upstream ruvector issues #523 (API contract bugs) and #524 (bundle BGE‑base/small) Files upstream ruvector issues #523 (API contract bugs) and #524 (bundle BGE‑base/small) Source: llm_adapter@2026-05-30 Confidence: high	—
Bugfix	Medium	Fixes silent‑fallback bug described in ADR-086 and implements bootstrap method Fixes silent‑fallback bug described in ADR-086 and implements bootstrap method Source: llm_adapter@2026-05-30 Confidence: high	—

Full changelog

What ships

Took the user's "release hype → benchmark infrastructure" feedback to the wire. This release IS the infrastructure, not the rank.

Honest two-dataset picture

| Dataset | nDCG@10 | 95% CI | Rank | vs BM25 |
|---|---:|---|---:|---|
| NFCorpus | 0.352 | [0.317, 0.387] | 2/11 | +0.027 (n.s.) |
| SciFact | 0.626 | [0.577, 0.672] | 10/11 | -0.053 (p<0.05) — significant LOSS |

The user's acceptance test ("ruflo beats BM25 on both") fails on SciFact — we significantly lose to BM25 by 0.053. On NFCorpus, only SBERT msmarco and ColBERT are statistically significant wins; the gaps to SPLADE++/GTR-XL/BM25 are within CI overlap.

Two-dataset mean: ours 0.489, BM25 0.502, SPLADE++ 0.526, BGE-large 0.551. Below BM25 on the mean. BGE-base zero-shot is competent on NFCorpus (medical IR), weak on SciFact (fact-verification favours lexical retrieval). The NFCorpus rank-2 is real but not representative.

What's in the box

docs/benchmarks/BEIR-MATRIX.md — dataset × pipeline × metric grid with bootstrap CIs and pipeline disclosure
scripts/beir-bootstrap-significance.mjs — paired bootstrap, 10K resamples, mulberry32 seed=42
perQuery metrics now saved in every BEIR run JSON (external bootstrap verification by anyone)
ADR-085 hedged appropriately ("TOP-2 on BEIR NFCorpus, not BEIR average", "direct dense, no fine-tune, no rerank")
ADR-086 documents the silent-fallback bug story + bootstrap method + honest two-dataset picture
2 ruvector upstream issues filed: #523 (API contract bugs), #524 (bundle BGE-base/small)

Reproduce

git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

# NFCorpus + SciFact (~30 min ingest each)
mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip' && unzip -q nf.zip
node /path/to/ruflo/v3/@claude-flow/cli/scripts/run-beir-bge.mjs

mkdir -p /tmp/beir-scifact && cd /tmp/beir-scifact
curl -sL -o sf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip' && unzip -q sf.zip
BEIR_DATA_DIR=/tmp/beir-scifact/scifact node /path/to/ruflo/v3/@claude-flow/cli/scripts/run-beir-bge.mjs

# Bootstrap significance (10k resamples, ~1s)
node /path/to/scripts/beir-bootstrap-significance.mjs /path/to/run.json

Honest limits

Two datasets only. BEIR ships 18. The 2-dataset mean is suggestive, not definitive.
Zero-shot. NFCorpus has a 110K-pair train split that would close ~0.02-0.05 nDCG.
Single annotator on internal labels (separate from BEIR's external qrels).
Tailscale-via-ruvultra GPU compute discussed for larger datasets (TREC-COVID, HotpotQA, NQ) — tracked.

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-086-silent-fallback-and-bootstrap.md

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track claude-flow

Get notified when new releases ship.

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →