claude-flow

v3.10.27 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-workflow agents ai-agents ai-assistant

+14 more

ai-coding ai-skills autonomous-agents claude-code codex harness mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates What's in the box, Next steps, and BGE-base across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Feature	Low	Added `scripts/run-beir-rrf-ablation.mjs` runnable ablation harness with bootstrap CI. Added `scripts/run-beir-rrf-ablation.mjs` runnable ablation harness with bootstrap CI. Source: llm_adapter@2026-05-31 Confidence: high	—
Feature	Low	Added `scripts/run-beir-hybrid.mjs` for RRF plus optional cross‑encoder rerank. Added `scripts/run-beir-hybrid.mjs` for RRF plus optional cross‑encoder rerank. Source: llm_adapter@2026-05-31 Confidence: high	—
Bugfix	Medium	Fixed cache path hardcoding that caused SciFact to overwrite NFCorpus cache. Fixed cache path hardcoding that caused SciFact to overwrite NFCorpus cache. Source: llm_adapter@2026-05-31 Confidence: high	—
Refactor	Low	Updated `BEIR-MATRIX.md` with ablation rows and honest 2‑dataset mean comparison. Updated `BEIR-MATRIX.md` with ablation rows and honest 2‑dataset mean comparison. Source: llm_adapter@2026-05-31 Confidence: high	—

Full changelog

What ships

Honest negative result. The textbook "lowest-regret" first move — BM25+dense RRF k=60 — degrades nDCG@10 on both NFCorpus and SciFact because our multi-field BM25 is materially weaker than Lucene's. We ship the ablation harness + the finding anyway.

Acceptance test outcome

"RRF improves or preserves nDCG@10 on both NFCorpus and SciFact, bootstrap CI does not undermine the claim, defaults fixed before viewing test result." FAILS.

| Config (BOTH datasets, fixed defaults BEFORE viewing) | NFCorpus | SciFact | Mean |
|---|---:|---:|---:|
| dense alone (BGE-base) | 0.352 | 0.626 | 0.489 |
| RRF k=60 equal (textbook default) | 0.328 ↓ | 0.569 ↓ | 0.449 ↓ |
| RRF k=30 equal (best ablation) | 0.335 ↓ | 0.582 ↓ | 0.459 ↓ |
| RRF k=60 dense=1.2, bm25=0.8 | 0.334 ↓ | 0.577 ↓ | 0.456 ↓ |
| RRF k=60 dense=0.8, bm25=1.2 | 0.323 ↓ | 0.558 ↓ | 0.441 ↓ |

Every RRF variant underperforms dense-alone on the 2-dataset mean (-0.04 nDCG@10 worse).

What DID work — recall

Recall@100 IS up on both:

| Dataset | Dense R@100 | RRF R@100 | Δ |
|---|---:|---:|---:|
| NFCorpus | 0.305 | 0.321 | +0.016 |
| SciFact | 0.828 | 0.951 | +0.123 |

RRF surfaces more candidates correctly — it just ranks them worse at top-K. This is the right setup for stage 2: cross-encoder rerank on the wider candidate pool (ADR-088 / 3.10.28).

Diagnosis (why RRF hurt)

The classic RRF win assumes comparably-strong systems with different failure modes. Our setup is asymmetric: BGE-base dense is strong (0.626 SciFact), our multi-field BM25 is weak (0.576 SciFact vs Lucene published 0.679). Pure BM25 nDCG@10 on NFCorpus: 0.279 vs Lucene 0.325 — we're 14% relative below.

When one input is weak, RRF averages its noise into top positions instead of cancelling it. The math works perfectly for the documented Lucene+strong-dense case; we don't match that profile yet.

Bug found and fixed

bge-cache/ was hardcoded to /tmp/beir-nfcorpus/bge-cache/ — the SciFact run silently overwrote the NFCorpus cache. Caught only when the first RRF run returned nDCG=0.14 (random-noise level), forcing investigation. Now per-dataset path. 3.10.25 and 3.10.26 NFCorpus numbers were computed before the overwrite and are still valid.

What's in the box

scripts/run-beir-rrf-ablation.mjs — re-runnable ablation harness with bootstrap CI on the fixed default config + full ablation matrix.
scripts/run-beir-hybrid.mjs — full RRF + opt-in cross-encoder rerank runner (rerank wired but pending ADR-088 measurement).
bge-cache/ per-dataset path fix in run-beir-bge.mjs.
ADR-087 — full negative-result writeup with diagnosis + tracked next steps.
Updated BEIR-MATRIX.md with ablation rows + the honest 2-dataset mean comparison.
No default change — dense-only stays the BEIR runner default. RRF is opt-in for callers with Lucene-strength BM25.

Next steps (already tracked)

ADR-088 / 3.10.28: Cross-encoder rerank on RRF's wider candidate pool (Recall@100 0.951 on SciFact says the candidates ARE there).
Lucene-style BM25: Porter/Snowball stemmer + Lucene stopword list + length norm. Would make RRF actually work as designed.
ruvnet/ruvector#524: bundle BGE in ruvector so downstream packages stop hitting the sharp dependency.

Reproduce

git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip' && unzip -q nf.zip
node /path/to/v3/@claude-flow/cli/scripts/run-beir-bge.mjs              # ingest
node /path/to/v3/@claude-flow/cli/scripts/run-beir-rrf-ablation.mjs    # ablation matrix

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-087-rrf-negative-result.md

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track claude-flow

Get notified when new releases ship.

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →