Skip to content

claude-flow

v3.10.27 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-rag agentic-workflow agents ai-agents
+14 more
ai-assistant ai-coding ai-skills autonomous-agents claude-code codex mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates What's in the box, Next steps, and BGE-base across a mixed release.

Changes in this release

Feature Low

Added `scripts/run-beir-rrf-ablation.mjs` runnable ablation harness with bootstrap CI.

Added `scripts/run-beir-rrf-ablation.mjs` runnable ablation harness with bootstrap CI.

Source: llm_adapter@2026-05-31

Confidence: high

Feature Low

Added `scripts/run-beir-hybrid.mjs` for RRF plus optional cross‑encoder rerank.

Added `scripts/run-beir-hybrid.mjs` for RRF plus optional cross‑encoder rerank.

Source: llm_adapter@2026-05-31

Confidence: high

Bugfix Medium

Fixed cache path hardcoding that caused SciFact to overwrite NFCorpus cache.

Fixed cache path hardcoding that caused SciFact to overwrite NFCorpus cache.

Source: llm_adapter@2026-05-31

Confidence: high

Refactor Low

Updated `BEIR-MATRIX.md` with ablation rows and honest 2‑dataset mean comparison.

Updated `BEIR-MATRIX.md` with ablation rows and honest 2‑dataset mean comparison.

Source: llm_adapter@2026-05-31

Confidence: high

Full changelog

What ships

Honest negative result. The textbook "lowest-regret" first move — BM25+dense RRF k=60 — degrades nDCG@10 on both NFCorpus and SciFact because our multi-field BM25 is materially weaker than Lucene's. We ship the ablation harness + the finding anyway.

Acceptance test outcome

"RRF improves or preserves nDCG@10 on both NFCorpus and SciFact, bootstrap CI does not undermine the claim, defaults fixed before viewing test result." FAILS.

| Config (BOTH datasets, fixed defaults BEFORE viewing) | NFCorpus | SciFact | Mean |
|---|---:|---:|---:|
| dense alone (BGE-base) | 0.352 | 0.626 | 0.489 |
| RRF k=60 equal (textbook default) | 0.328 ↓ | 0.569 ↓ | 0.449 ↓ |
| RRF k=30 equal (best ablation) | 0.335 ↓ | 0.582 ↓ | 0.459 ↓ |
| RRF k=60 dense=1.2, bm25=0.8 | 0.334 ↓ | 0.577 ↓ | 0.456 ↓ |
| RRF k=60 dense=0.8, bm25=1.2 | 0.323 ↓ | 0.558 ↓ | 0.441 ↓ |

Every RRF variant underperforms dense-alone on the 2-dataset mean (-0.04 nDCG@10 worse).

What DID work — recall

Recall@100 IS up on both:

| Dataset | Dense R@100 | RRF R@100 | Δ |
|---|---:|---:|---:|
| NFCorpus | 0.305 | 0.321 | +0.016 |
| SciFact | 0.828 | 0.951 | +0.123 |

RRF surfaces more candidates correctly — it just ranks them worse at top-K. This is the right setup for stage 2: cross-encoder rerank on the wider candidate pool (ADR-088 / 3.10.28).

Diagnosis (why RRF hurt)

The classic RRF win assumes comparably-strong systems with different failure modes. Our setup is asymmetric: BGE-base dense is strong (0.626 SciFact), our multi-field BM25 is weak (0.576 SciFact vs Lucene published 0.679). Pure BM25 nDCG@10 on NFCorpus: 0.279 vs Lucene 0.325 — we're 14% relative below.

When one input is weak, RRF averages its noise into top positions instead of cancelling it. The math works perfectly for the documented Lucene+strong-dense case; we don't match that profile yet.

Bug found and fixed

bge-cache/ was hardcoded to /tmp/beir-nfcorpus/bge-cache/ — the SciFact run silently overwrote the NFCorpus cache. Caught only when the first RRF run returned nDCG=0.14 (random-noise level), forcing investigation. Now per-dataset path. 3.10.25 and 3.10.26 NFCorpus numbers were computed before the overwrite and are still valid.

What's in the box

  1. scripts/run-beir-rrf-ablation.mjs — re-runnable ablation harness with bootstrap CI on the fixed default config + full ablation matrix.
  2. scripts/run-beir-hybrid.mjs — full RRF + opt-in cross-encoder rerank runner (rerank wired but pending ADR-088 measurement).
  3. bge-cache/ per-dataset path fix in run-beir-bge.mjs.
  4. ADR-087 — full negative-result writeup with diagnosis + tracked next steps.
  5. Updated BEIR-MATRIX.md with ablation rows + the honest 2-dataset mean comparison.
  6. No default change — dense-only stays the BEIR runner default. RRF is opt-in for callers with Lucene-strength BM25.

Next steps (already tracked)

  • ADR-088 / 3.10.28: Cross-encoder rerank on RRF's wider candidate pool (Recall@100 0.951 on SciFact says the candidates ARE there).
  • Lucene-style BM25: Porter/Snowball stemmer + Lucene stopword list + length norm. Would make RRF actually work as designed.
  • ruvnet/ruvector#524: bundle BGE in ruvector so downstream packages stop hitting the sharp dependency.

Reproduce

git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip' && unzip -q nf.zip
node /path/to/v3/@claude-flow/cli/scripts/run-beir-bge.mjs              # ingest
node /path/to/v3/@claude-flow/cli/scripts/run-beir-rrf-ablation.mjs    # ablation matrix

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-087-rrf-negative-result.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track claude-flow

Get notified when new releases ship.

Sign up free

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →

Related context

Beta — feedback welcome: [email protected]