This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+14 more
Summary
AI summaryUpdates What's in the box, Honest limits, and n.s across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Low |
Adds BEIR-MATRIX benchmark grid documentation in docs/benchmarks/BEIR-MATRIX.md Adds BEIR-MATRIX benchmark grid documentation in docs/benchmarks/BEIR-MATRIX.md Source: llm_adapter@2026-05-30 Confidence: high |
— |
| Feature | Low |
Adds paired bootstrap significance script scripts/beir-bootstrap-significance.mjs with 10K resamples and mulberry32 seed=42 Adds paired bootstrap significance script scripts/beir-bootstrap-significance.mjs with 10K resamples and mulberry32 seed=42 Source: llm_adapter@2026-05-30 Confidence: high |
— |
| Feature | Low |
Adds perQuery metrics to every BEIR run JSON for external verification Adds perQuery metrics to every BEIR run JSON for external verification Source: llm_adapter@2026-05-30 Confidence: high |
— |
| Feature | Low |
Updates install command to npx [email protected] for latest/alpha alignment Updates install command to npx [email protected] for latest/alpha alignment Source: llm_adapter@2026-05-30 Confidence: high |
— |
| Dependency | Low |
Files upstream ruvector issues #523 (API contract bugs) and #524 (bundle BGE‑base/small) Files upstream ruvector issues #523 (API contract bugs) and #524 (bundle BGE‑base/small) Source: llm_adapter@2026-05-30 Confidence: high |
— |
| Bugfix | Medium |
Fixes silent‑fallback bug described in ADR-086 and implements bootstrap method Fixes silent‑fallback bug described in ADR-086 and implements bootstrap method Source: llm_adapter@2026-05-30 Confidence: high |
— |
Full changelog
What ships
Took the user's "release hype → benchmark infrastructure" feedback to the wire. This release IS the infrastructure, not the rank.
Honest two-dataset picture
| Dataset | nDCG@10 | 95% CI | Rank | vs BM25 |
|---|---:|---|---:|---|
| NFCorpus | 0.352 | [0.317, 0.387] | 2/11 | +0.027 (n.s.) |
| SciFact | 0.626 | [0.577, 0.672] | 10/11 | -0.053 (p<0.05) — significant LOSS |
The user's acceptance test ("ruflo beats BM25 on both") fails on SciFact — we significantly lose to BM25 by 0.053. On NFCorpus, only SBERT msmarco and ColBERT are statistically significant wins; the gaps to SPLADE++/GTR-XL/BM25 are within CI overlap.
Two-dataset mean: ours 0.489, BM25 0.502, SPLADE++ 0.526, BGE-large 0.551. Below BM25 on the mean. BGE-base zero-shot is competent on NFCorpus (medical IR), weak on SciFact (fact-verification favours lexical retrieval). The NFCorpus rank-2 is real but not representative.
What's in the box
docs/benchmarks/BEIR-MATRIX.md— dataset × pipeline × metric grid with bootstrap CIs and pipeline disclosurescripts/beir-bootstrap-significance.mjs— paired bootstrap, 10K resamples, mulberry32 seed=42perQuerymetrics now saved in every BEIR run JSON (external bootstrap verification by anyone)- ADR-085 hedged appropriately ("TOP-2 on BEIR NFCorpus, not BEIR average", "direct dense, no fine-tune, no rerank")
- ADR-086 documents the silent-fallback bug story + bootstrap method + honest two-dataset picture
- 2 ruvector upstream issues filed: #523 (API contract bugs), #524 (bundle BGE-base/small)
Reproduce
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )
# NFCorpus + SciFact (~30 min ingest each)
mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip' && unzip -q nf.zip
node /path/to/ruflo/v3/@claude-flow/cli/scripts/run-beir-bge.mjs
mkdir -p /tmp/beir-scifact && cd /tmp/beir-scifact
curl -sL -o sf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/scifact.zip' && unzip -q sf.zip
BEIR_DATA_DIR=/tmp/beir-scifact/scifact node /path/to/ruflo/v3/@claude-flow/cli/scripts/run-beir-bge.mjs
# Bootstrap significance (10k resamples, ~1s)
node /path/to/scripts/beir-bootstrap-significance.mjs /path/to/run.json
Honest limits
- Two datasets only. BEIR ships 18. The 2-dataset mean is suggestive, not definitive.
- Zero-shot. NFCorpus has a 110K-pair train split that would close ~0.02-0.05 nDCG.
- Single annotator on internal labels (separate from BEIR's external qrels).
- Tailscale-via-ruvultra GPU compute discussed for larger datasets (TREC-COVID, HotpotQA, NQ) — tracked.
Install
npx [email protected] # latest / alpha / v3alpha all aligned
Full ADR: v3/docs/adr/ADR-086-silent-fallback-and-bootstrap.md
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
Related context
Related tools
Beta — feedback welcome: [email protected]