This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+14 more
Summary
AI summaryUpdates What's in the box, Next steps, and BGE-base across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Low |
Added `scripts/run-beir-rrf-ablation.mjs` runnable ablation harness with bootstrap CI. Added `scripts/run-beir-rrf-ablation.mjs` runnable ablation harness with bootstrap CI. Source: llm_adapter@2026-05-31 Confidence: high |
— |
| Feature | Low |
Added `scripts/run-beir-hybrid.mjs` for RRF plus optional cross‑encoder rerank. Added `scripts/run-beir-hybrid.mjs` for RRF plus optional cross‑encoder rerank. Source: llm_adapter@2026-05-31 Confidence: high |
— |
| Bugfix | Medium |
Fixed cache path hardcoding that caused SciFact to overwrite NFCorpus cache. Fixed cache path hardcoding that caused SciFact to overwrite NFCorpus cache. Source: llm_adapter@2026-05-31 Confidence: high |
— |
| Refactor | Low |
Updated `BEIR-MATRIX.md` with ablation rows and honest 2‑dataset mean comparison. Updated `BEIR-MATRIX.md` with ablation rows and honest 2‑dataset mean comparison. Source: llm_adapter@2026-05-31 Confidence: high |
— |
Full changelog
What ships
Honest negative result. The textbook "lowest-regret" first move — BM25+dense RRF k=60 — degrades nDCG@10 on both NFCorpus and SciFact because our multi-field BM25 is materially weaker than Lucene's. We ship the ablation harness + the finding anyway.
Acceptance test outcome
"RRF improves or preserves nDCG@10 on both NFCorpus and SciFact, bootstrap CI does not undermine the claim, defaults fixed before viewing test result." FAILS.
| Config (BOTH datasets, fixed defaults BEFORE viewing) | NFCorpus | SciFact | Mean |
|---|---:|---:|---:|
| dense alone (BGE-base) | 0.352 | 0.626 | 0.489 |
| RRF k=60 equal (textbook default) | 0.328 ↓ | 0.569 ↓ | 0.449 ↓ |
| RRF k=30 equal (best ablation) | 0.335 ↓ | 0.582 ↓ | 0.459 ↓ |
| RRF k=60 dense=1.2, bm25=0.8 | 0.334 ↓ | 0.577 ↓ | 0.456 ↓ |
| RRF k=60 dense=0.8, bm25=1.2 | 0.323 ↓ | 0.558 ↓ | 0.441 ↓ |
Every RRF variant underperforms dense-alone on the 2-dataset mean (-0.04 nDCG@10 worse).
What DID work — recall
Recall@100 IS up on both:
| Dataset | Dense R@100 | RRF R@100 | Δ |
|---|---:|---:|---:|
| NFCorpus | 0.305 | 0.321 | +0.016 |
| SciFact | 0.828 | 0.951 | +0.123 |
RRF surfaces more candidates correctly — it just ranks them worse at top-K. This is the right setup for stage 2: cross-encoder rerank on the wider candidate pool (ADR-088 / 3.10.28).
Diagnosis (why RRF hurt)
The classic RRF win assumes comparably-strong systems with different failure modes. Our setup is asymmetric: BGE-base dense is strong (0.626 SciFact), our multi-field BM25 is weak (0.576 SciFact vs Lucene published 0.679). Pure BM25 nDCG@10 on NFCorpus: 0.279 vs Lucene 0.325 — we're 14% relative below.
When one input is weak, RRF averages its noise into top positions instead of cancelling it. The math works perfectly for the documented Lucene+strong-dense case; we don't match that profile yet.
Bug found and fixed
bge-cache/ was hardcoded to /tmp/beir-nfcorpus/bge-cache/ — the SciFact run silently overwrote the NFCorpus cache. Caught only when the first RRF run returned nDCG=0.14 (random-noise level), forcing investigation. Now per-dataset path. 3.10.25 and 3.10.26 NFCorpus numbers were computed before the overwrite and are still valid.
What's in the box
scripts/run-beir-rrf-ablation.mjs— re-runnable ablation harness with bootstrap CI on the fixed default config + full ablation matrix.scripts/run-beir-hybrid.mjs— full RRF + opt-in cross-encoder rerank runner (rerank wired but pending ADR-088 measurement).bge-cache/per-dataset path fix in run-beir-bge.mjs.- ADR-087 — full negative-result writeup with diagnosis + tracked next steps.
- Updated BEIR-MATRIX.md with ablation rows + the honest 2-dataset mean comparison.
- No default change — dense-only stays the BEIR runner default. RRF is opt-in for callers with Lucene-strength BM25.
Next steps (already tracked)
- ADR-088 / 3.10.28: Cross-encoder rerank on RRF's wider candidate pool (Recall@100 0.951 on SciFact says the candidates ARE there).
- Lucene-style BM25: Porter/Snowball stemmer + Lucene stopword list + length norm. Would make RRF actually work as designed.
- ruvnet/ruvector#524: bundle BGE in ruvector so downstream packages stop hitting the sharp dependency.
Reproduce
git clone https://github.com/ruvnet/ruflo && cd ruflo
npm install && ( cd v3/@claude-flow/cli && npx tsc )
mkdir -p /tmp/beir-nfcorpus && cd /tmp/beir-nfcorpus
curl -sL -o nf.zip 'https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/nfcorpus.zip' && unzip -q nf.zip
node /path/to/v3/@claude-flow/cli/scripts/run-beir-bge.mjs # ingest
node /path/to/v3/@claude-flow/cli/scripts/run-beir-rrf-ablation.mjs # ablation matrix
Install
npx [email protected] # latest / alpha / v3alpha all aligned
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
Related context
Related tools
Beta — feedback welcome: [email protected]