claude-flow

v3.10.30 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agentic-ai agentic-framework agentic-workflow agents ai-agents ai-assistant

+14 more

ai-coding ai-skills autonomous-agents claude-code codex harness mcp-server multi-agent multi-agent-systems npm skills swarm swarm-intelligence typescript

Summary

AI summary

Updates Honest limits, 1.2B, and SciDocs across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Feature	Low	Adds SciDocs dataset to BEIR suite. Adds SciDocs dataset to BEIR suite. Source: llm_adapter@2026-05-31 Confidence: high	—
Feature	Low	Updates installation command to npx [email protected]. Updates installation command to npx [email protected]. Source: llm_adapter@2026-05-31 Confidence: high	—
Performance	Medium	dense alone (BGE-base) achieves nDCG@10 0.211 on SciDocs, ranking 2/11. dense alone (BGE-base) achieves nDCG@10 0.211 on SciDocs, ranking 2/11. Source: llm_adapter@2026-05-31 Confidence: high	—
Performance	Medium	Lucene RRF without rerank scores 0.203 on SciDocs, a slight regression (-0.008). Lucene RRF without rerank scores 0.203 on SciDocs, a slight regression (-0.008). Source: llm_adapter@2026-05-31 Confidence: high	—

Full changelog

What ships

4th BEIR dataset (SciDocs) joins NFCorpus + SciFact + ArguAna. New finding: no single pipeline wins everywhere.

SciDocs results

| Pipeline | nDCG@10 | Rank |
|---|---:|---:|
| dense alone (BGE-base) | 0.211 | 2/11 |
| Lucene RRF (no rerank) | 0.203 | (-0.008, RRF hurt) |

Only behind BGE-large (335M, 0.225). Beats BM25, GTR-XL (1.2B), every other published baseline.

4-dataset mean leaderboard

| System | Params | NFCorpus | SciFact | ArguAna | SciDocs | Mean |
|---|---:|---:|---:|---:|---:|---:|
| BGE-large (published) | 335M | 0.380 | 0.722 | 0.636 | 0.225 | 0.491 |
| SPLADE++ (published) | 110M | 0.347 | 0.704 | 0.521 | 0.159 | 0.433 |
| ruflo best (per-dataset) | 110M | 0.358 | 0.683 | 0.432 | 0.211 | 0.421 |
| GTR-XL (1.2B) | 1.2B | 0.343 | 0.662 | 0.439 | 0.174 | 0.405 |
| GenQ | 110M | 0.319 | 0.644 | 0.493 | 0.143 | 0.400 |
| BM25 (Lucene published) | — | 0.325 | 0.679 | 0.397 | 0.158 | 0.390 |

Rank 3 of 11 on 4-dataset mean. Beats GTR-XL with 1/10× the params. Loses only to SPLADE++ (-0.012, basically tied) and BGE-large (-0.070, mostly the ArguAna gap).

The config-divergence finding

After 4 datasets, no single pipeline wins everywhere:

| Dataset | Best config | What hurts |
|---|---|---|
| NFCorpus | Lucene + RRF + CE rerank | nothing |
| SciFact | Lucene + RRF + CE rerank | nothing |
| ArguAna | Lucene + RRF (no CE) | CE rerank actively hurts |
| SciDocs | dense alone | RRF hurt by 0.008 |

Three of four datasets pick a different best config. Auto-pipeline-selection would need a per-corpus calibrator (cheap, doesn't need GPU — tracked).

Honest limits

4/18 BEIR datasets. The 0.421 mean is suggestive, not BEIR-average.
Zero-shot — NFCorpus and ArguAna train splits remain unused.
The 5 biggest BEIR datasets (TREC-COVID, FiQA, HotpotQA, NQ, DBPedia, all >50k docs) remain GPU-gated.

Install

npx [email protected]    # latest / alpha / v3alpha all aligned

Full ADR: v3/docs/adr/ADR-091-scidocs-and-config-divergence.md

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track claude-flow

Get notified when new releases ship.

About claude-flow

Deploy multi-agent swarms with coordinated workflows.

All releases →