Skip to content

claude-flow

v3.10.21 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-rag agentic-workflow agents ai-agents
+14 more
ai-assistant ai-coding ai-skills autonomous-agents claude-code codex mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates What changed in code, What's next, and Honest limits across a mixed release.

Changes in this release

Feature Medium

Add `expectedSubstrings` array to each query in QUERIES for hand‑curated labels.

Add `expectedSubstrings` array to each query in QUERIES for hand‑curated labels.

Source: llm_adapter@2026-05-30

Confidence: high

Feature Medium

Introduce `isRelevant(name, substrings)` helper for case‑insensitive substring relevance checks.

Introduce `isRelevant(name, substrings)` helper for case‑insensitive substring relevance checks.

Source: llm_adapter@2026-05-30

Confidence: high

Feature Medium

Implement `ndcgAtK(rankedRelevance, k)` function to compute normalized DCG for ranking metrics.

Implement `ndcgAtK(rankedRelevance, k)` function to compute normalized DCG for ranking metrics.

Source: llm_adapter@2026-05-30

Confidence: high

Feature Medium

Add six new labelled metrics (label_top1HitRate, label_top3HitRate, label_mrr3, label_precision3, label_ndcg3, label_ndcg5) to summary JSON and console output.

Add six new labelled metrics (label_top1HitRate, label_top3HitRate, label_mrr3, label_precision3, label_ndcg3, label_ndcg5) to summary JSON and console output.

Source: llm_adapter@2026-05-30

Confidence: high

Feature Medium

Preserve existing regex‑proxy metrics under "regex proxy" labels for reproducibility of ADR 077‑080 numbers.

Preserve existing regex‑proxy metrics under "regex proxy" labels for reproducibility of ADR 077‑080 numbers.

Source: llm_adapter@2026-05-30

Confidence: high

Feature Low

Document that default `rerank: false` suits top‑1‑first callers; opt‑in `rerank: true` benefits richer top‑K consumers.

Document that default `rerank: false` suits top‑1‑first callers; opt‑in `rerank: true` benefits richer top‑K consumers.

Source: granite4.1:30b@2026-05-30-audit

Confidence: low

Bugfix Medium

Address cross‑encoder trade‑off: rerank enabled improves precision@3 from 0.40 to 0.67 while reducing top‑1 hit rate slightly.

Address cross‑encoder trade‑off: rerank enabled improves precision@3 from 0.40 to 0.67 while reducing top‑1 hit rate slightly.

Source: llm_adapter@2026-05-30

Confidence: low

Bugfix Medium

Fix latency regression: hybrid config now runs at ~42 ms average query latency (previously 29 ms in 3.10.17).

Fix latency regression: hybrid config now runs at ~42 ms average query latency (previously 29 ms in 3.10.17).

Source: llm_adapter@2026-05-30

Confidence: low

Refactor Low

Replace regex relevance proxy with labelled held‑out corpus and nDCG/precision metrics as the canonical measurement method.

Replace regex relevance proxy with labelled held‑out corpus and nDCG/precision metrics as the canonical measurement method.

Source: granite4.1:30b@2026-05-30-audit

Confidence: low

Full changelog

What ships

Labelled held-out corpus + nDCG/precision metrics replace the regex-over-subject
relevance proxy used in ADRs 077-080. Honest SOTA needs honest measurement.

The honest-measurement finding

The regex proxy was both over- and under-reporting. When the same 4 configs run
through the labelled corpus, the truth shifts in both directions:

| Config | Regex top-1 | Labelled top-1 | Direction |
|---|---:|---:|---|
| Hybrid (3.10.19) | 80% | 90% | regex under-reported |
| Hybrid + Rerank (3.10.20) | 90% | 80% | regex over-reported |

Real numbers (labelled metric, the new canonical)

| Metric | 3.10.17 cosine | 3.10.19 hybrid | 3.10.20 +rerank |
|---|---:|---:|---:|
| Label top-1 hit rate | 0% | 90% | 80% |
| Label top-3 hit rate | 0% | 90% | 100% |
| Label MRR@3 | 0.000 | 0.900 | 0.883 |
| Label precision@3 | 0.000 | 0.400 | 0.667 |
| Label nDCG@3 | 0.000 | 0.900 | 0.913 |
| Avg query latency | 29 ms | 42 ms | 977 ms (opt-in) |

The cross-encoder trade-off, now visible

The cross-encoder optimises for finding all relevant docs (precision@3 0.40 → 0.67)
while hybrid alone optimises for finding THE right doc first (top-1 80% → 90%).
Neither is universally better — it depends on whether the caller wants the single
best match or a relevant set.

Default rerank: false is still correct for top-1-first callers; opt-in
rerank: true is now better-documented for richer top-K consumers.

Why the regex was wrong

  • Under-reporting: for "self-learning wiring task-completed pretrain", the regex missed the issue title "Self-learning reports success but persists nothing" — exactly the right answer — because no hyphenation variant matched.
  • Over-reporting: for "how was the Opus model alias fixed", the regex matched the release-bump chore(release): bump 3.10.10 → 3.10.11 (4-issue bug cluster) because its body mentioned Opus, but the release commit isn't the work.

What changed in code

  1. QUERIES array gains expectedSubstrings: string[] — hand-curated labels per query, encoded directly in the bench script.
  2. isRelevant(name, substrings) helper — case-insensitive substring match (any-of semantics).
  3. ndcgAtK(rankedRelevance, k) — standard binary-relevance nDCG with ideal-DCG normalisation. Smoke-checked against canonical fixture: [T,T,T]→1.0, [F,F,T]→0.5, [F,T,F]→0.631, [T,F,T]→0.920.
  4. 6 new metrics in summary JSON + console output (label_top1HitRate, label_top3HitRate, label_mrr3, label_precision3, label_ndcg3, label_ndcg5).
  5. Regex proxy metrics preserved under "regex proxy" labels so historical ADR 077-080 numbers stay reproducible.

Reproduce

git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

# Pretrain (415 patterns)
node v3/@claude-flow/cli/scripts/pretrain-from-github.mjs

# All four configs through the labelled bench
( cd v3/@claude-flow/cli && {
  HYBRID=0 BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs | grep -E "^(Top|MRR|Precision|nDCG)"
  BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs | grep -E "^(Top|MRR|Precision|nDCG)"
  RERANK=1 BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs | grep -E "^(Top|MRR|Precision|nDCG)"
})

Honest limits (acknowledged in ADR)

  • N=10 queries is still small; 50-200 would tighten confidence intervals.
  • Binary relevance — graded scheme (exact=3, close=2, related=1) would distinguish "perfect" from "passable".
  • Single annotator — I curated the labels; inter-annotator agreement is a nice-to-have.
  • No truly held-out test split — labels were authored after seeing outputs, so subsequent tuning against this set has confirmation bias risk. New queries are the right next step.

What's next

  • Larger labelled corpus (50-200 queries)
  • Graded relevance
  • Larger cross-encoder (ms-marco-MiniLM-L-12-v2) if quality > latency
  • Learned distiller (#2241 round-D)

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-081-labelled-corpus-and-ndcg.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track claude-flow

Get notified when new releases ship.

Sign up free

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →

Beta — feedback welcome: [email protected]