Skip to content

claude-flow

v3.10.30 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-rag agentic-workflow agents ai-agents
+14 more
ai-assistant ai-coding ai-skills autonomous-agents claude-code codex mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates Honest limits, 1.2B, and SciDocs across a mixed release.

Changes in this release

Feature Low

Adds SciDocs dataset to BEIR suite.

Adds SciDocs dataset to BEIR suite.

Source: llm_adapter@2026-05-31

Confidence: high

Feature Low

Updates installation command to npx [email protected].

Updates installation command to npx [email protected].

Source: llm_adapter@2026-05-31

Confidence: high

Performance Medium

dense alone (BGE-base) achieves nDCG@10 0.211 on SciDocs, ranking 2/11.

dense alone (BGE-base) achieves nDCG@10 0.211 on SciDocs, ranking 2/11.

Source: llm_adapter@2026-05-31

Confidence: high

Performance Medium

Lucene RRF without rerank scores 0.203 on SciDocs, a slight regression (-0.008).

Lucene RRF without rerank scores 0.203 on SciDocs, a slight regression (-0.008).

Source: llm_adapter@2026-05-31

Confidence: high

Full changelog

What ships

4th BEIR dataset (SciDocs) joins NFCorpus + SciFact + ArguAna. New finding: no single pipeline wins everywhere.

SciDocs results

| Pipeline | nDCG@10 | Rank |
|---|---:|---:|
| dense alone (BGE-base) | 0.211 | 2/11 |
| Lucene RRF (no rerank) | 0.203 | (-0.008, RRF hurt) |

Only behind BGE-large (335M, 0.225). Beats BM25, GTR-XL (1.2B), every other published baseline.

4-dataset mean leaderboard

| System | Params | NFCorpus | SciFact | ArguAna | SciDocs | Mean |
|---|---:|---:|---:|---:|---:|---:|
| BGE-large (published) | 335M | 0.380 | 0.722 | 0.636 | 0.225 | 0.491 |
| SPLADE++ (published) | 110M | 0.347 | 0.704 | 0.521 | 0.159 | 0.433 |
| ruflo best (per-dataset) | 110M | 0.358 | 0.683 | 0.432 | 0.211 | 0.421 |
| GTR-XL (1.2B) | 1.2B | 0.343 | 0.662 | 0.439 | 0.174 | 0.405 |
| GenQ | 110M | 0.319 | 0.644 | 0.493 | 0.143 | 0.400 |
| BM25 (Lucene published) | — | 0.325 | 0.679 | 0.397 | 0.158 | 0.390 |

Rank 3 of 11 on 4-dataset mean. Beats GTR-XL with 1/10× the params. Loses only to SPLADE++ (-0.012, basically tied) and BGE-large (-0.070, mostly the ArguAna gap).

The config-divergence finding

After 4 datasets, no single pipeline wins everywhere:

| Dataset | Best config | What hurts |
|---|---|---|
| NFCorpus | Lucene + RRF + CE rerank | nothing |
| SciFact | Lucene + RRF + CE rerank | nothing |
| ArguAna | Lucene + RRF (no CE) | CE rerank actively hurts |
| SciDocs | dense alone | RRF hurt by 0.008 |

Three of four datasets pick a different best config. Auto-pipeline-selection would need a per-corpus calibrator (cheap, doesn't need GPU — tracked).

Honest limits

  • 4/18 BEIR datasets. The 0.421 mean is suggestive, not BEIR-average.
  • Zero-shot — NFCorpus and ArguAna train splits remain unused.
  • The 5 biggest BEIR datasets (TREC-COVID, FiQA, HotpotQA, NQ, DBPedia, all >50k docs) remain GPU-gated.

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-091-scidocs-and-config-divergence.md

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track claude-flow

Get notified when new releases ship.

Sign up free

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →

Related context

Beta — feedback welcome: [email protected]