claude-flow

v3.10.21 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-workflow agents ai-agents ai-assistant

+14 more

ai-coding ai-skills autonomous-agents claude-code codex harness mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates What changed in code, What's next, and Honest limits across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Feature
Feature	Medium	Add `expectedSubstrings` array to each query in QUERIES for hand‑curated labels. Add `expectedSubstrings` array to each query in QUERIES for hand‑curated labels. Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Medium	Introduce `isRelevant(name, substrings)` helper for case‑insensitive substring relevance checks. Introduce `isRelevant(name, substrings)` helper for case‑insensitive substring relevance checks. Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Medium	Implement `ndcgAtK(rankedRelevance, k)` function to compute normalized DCG for ranking metrics. Implement `ndcgAtK(rankedRelevance, k)` function to compute normalized DCG for ranking metrics. Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Medium	Add six new labelled metrics (label_top1HitRate, label_top3HitRate, label_mrr3, label_precision3, label_ndcg3, label_ndcg5) to summary JSON and console output. Add six new labelled metrics (label_top1HitRate, label_top3HitRate, label_mrr3, label_precision3, label_ndcg3, label_ndcg5) to summary JSON and console output. Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Medium	Preserve existing regex‑proxy metrics under "regex proxy" labels for reproducibility of ADR 077‑080 numbers. Preserve existing regex‑proxy metrics under "regex proxy" labels for reproducibility of ADR 077‑080 numbers. Source: llm_adapter@2026-05-30 Confidence: high	—
Feature	Low	Document that default `rerank: false` suits top‑1‑first callers; opt‑in `rerank: true` benefits richer top‑K consumers. Document that default `rerank: false` suits top‑1‑first callers; opt‑in `rerank: true` benefits richer top‑K consumers. Source: granite4.1:30b@2026-05-30-audit Confidence: low	—
Bugfix	Medium	Address cross‑encoder trade‑off: rerank enabled improves precision@3 from 0.40 to 0.67 while reducing top‑1 hit rate slightly. Address cross‑encoder trade‑off: rerank enabled improves precision@3 from 0.40 to 0.67 while reducing top‑1 hit rate slightly. Source: llm_adapter@2026-05-30 Confidence: low	—
Bugfix	Medium	Fix latency regression: hybrid config now runs at ~42 ms average query latency (previously 29 ms in 3.10.17). Fix latency regression: hybrid config now runs at ~42 ms average query latency (previously 29 ms in 3.10.17). Source: llm_adapter@2026-05-30 Confidence: low	—
Refactor	Low	Replace regex relevance proxy with labelled held‑out corpus and nDCG/precision metrics as the canonical measurement method. Replace regex relevance proxy with labelled held‑out corpus and nDCG/precision metrics as the canonical measurement method. Source: granite4.1:30b@2026-05-30-audit Confidence: low	—

Full changelog

What ships

Labelled held-out corpus + nDCG/precision metrics replace the regex-over-subject
relevance proxy used in ADRs 077-080. Honest SOTA needs honest measurement.

The honest-measurement finding

The regex proxy was both over- and under-reporting. When the same 4 configs run
through the labelled corpus, the truth shifts in both directions:

| Config | Regex top-1 | Labelled top-1 | Direction |
|---|---:|---:|---|
| Hybrid (3.10.19) | 80% | 90% | regex under-reported |
| Hybrid + Rerank (3.10.20) | 90% | 80% | regex over-reported |

Real numbers (labelled metric, the new canonical)

| Metric | 3.10.17 cosine | 3.10.19 hybrid | 3.10.20 +rerank |
|---|---:|---:|---:|
| Label top-1 hit rate | 0% | 90% | 80% |
| Label top-3 hit rate | 0% | 90% | 100% |
| Label MRR@3 | 0.000 | 0.900 | 0.883 |
| Label precision@3 | 0.000 | 0.400 | 0.667 |
| Label nDCG@3 | 0.000 | 0.900 | 0.913 |
| Avg query latency | 29 ms | 42 ms | 977 ms (opt-in) |

The cross-encoder trade-off, now visible

The cross-encoder optimises for finding all relevant docs (precision@3 0.40 → 0.67)
while hybrid alone optimises for finding THE right doc first (top-1 80% → 90%).
Neither is universally better — it depends on whether the caller wants the single
best match or a relevant set.

Default rerank: false is still correct for top-1-first callers; opt-in
rerank: true is now better-documented for richer top-K consumers.

Why the regex was wrong

Under-reporting: for "self-learning wiring task-completed pretrain", the regex missed the issue title "Self-learning reports success but persists nothing" — exactly the right answer — because no hyphenation variant matched.
Over-reporting: for "how was the Opus model alias fixed", the regex matched the release-bump chore(release): bump 3.10.10 → 3.10.11 (4-issue bug cluster) because its body mentioned Opus, but the release commit isn't the work.

What changed in code

QUERIES array gains expectedSubstrings: string[] — hand-curated labels per query, encoded directly in the bench script.
isRelevant(name, substrings) helper — case-insensitive substring match (any-of semantics).
ndcgAtK(rankedRelevance, k) — standard binary-relevance nDCG with ideal-DCG normalisation. Smoke-checked against canonical fixture: [T,T,T]→1.0, [F,F,T]→0.5, [F,T,F]→0.631, [T,F,T]→0.920.
6 new metrics in summary JSON + console output (label_top1HitRate, label_top3HitRate, label_mrr3, label_precision3, label_ndcg3, label_ndcg5).
Regex proxy metrics preserved under "regex proxy" labels so historical ADR 077-080 numbers stay reproducible.

Reproduce

git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )

# Pretrain (415 patterns)
node v3/@claude-flow/cli/scripts/pretrain-from-github.mjs

# All four configs through the labelled bench
( cd v3/@claude-flow/cli && {
  HYBRID=0 BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs | grep -E "^(Top|MRR|Precision|nDCG)"
  BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs | grep -E "^(Top|MRR|Precision|nDCG)"
  RERANK=1 BENCH_NO_WRITE=1 node scripts/benchmark-pretrained-retrieval.mjs | grep -E "^(Top|MRR|Precision|nDCG)"
})

Honest limits (acknowledged in ADR)

N=10 queries is still small; 50-200 would tighten confidence intervals.
Binary relevance — graded scheme (exact=3, close=2, related=1) would distinguish "perfect" from "passable".
Single annotator — I curated the labels; inter-annotator agreement is a nice-to-have.
No truly held-out test split — labels were authored after seeing outputs, so subsequent tuning against this set has confirmation bias risk. New queries are the right next step.

What's next

Larger labelled corpus (50-200 queries)
Graded relevance
Larger cross-encoder (ms-marco-MiniLM-L-12-v2) if quality > latency
Learned distiller (#2241 round-D)

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-081-labelled-corpus-and-ndcg.md

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track claude-flow

Get notified when new releases ship.

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →