This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+14 more
Summary
AI summaryUpdates Honest limits, 1.2B, and SciDocs across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Low |
Adds SciDocs dataset to BEIR suite. Adds SciDocs dataset to BEIR suite. Source: llm_adapter@2026-05-31 Confidence: high |
— |
| Feature | Low |
Updates installation command to npx [email protected]. Updates installation command to npx [email protected]. Source: llm_adapter@2026-05-31 Confidence: high |
— |
| Performance | Medium |
dense alone (BGE-base) achieves nDCG@10 0.211 on SciDocs, ranking 2/11. dense alone (BGE-base) achieves nDCG@10 0.211 on SciDocs, ranking 2/11. Source: llm_adapter@2026-05-31 Confidence: high |
— |
| Performance | Medium |
Lucene RRF without rerank scores 0.203 on SciDocs, a slight regression (-0.008). Lucene RRF without rerank scores 0.203 on SciDocs, a slight regression (-0.008). Source: llm_adapter@2026-05-31 Confidence: high |
— |
Full changelog
What ships
4th BEIR dataset (SciDocs) joins NFCorpus + SciFact + ArguAna. New finding: no single pipeline wins everywhere.
SciDocs results
| Pipeline | nDCG@10 | Rank |
|---|---:|---:|
| dense alone (BGE-base) | 0.211 | 2/11 |
| Lucene RRF (no rerank) | 0.203 | (-0.008, RRF hurt) |
Only behind BGE-large (335M, 0.225). Beats BM25, GTR-XL (1.2B), every other published baseline.
4-dataset mean leaderboard
| System | Params | NFCorpus | SciFact | ArguAna | SciDocs | Mean |
|---|---:|---:|---:|---:|---:|---:|
| BGE-large (published) | 335M | 0.380 | 0.722 | 0.636 | 0.225 | 0.491 |
| SPLADE++ (published) | 110M | 0.347 | 0.704 | 0.521 | 0.159 | 0.433 |
| ruflo best (per-dataset) | 110M | 0.358 | 0.683 | 0.432 | 0.211 | 0.421 |
| GTR-XL (1.2B) | 1.2B | 0.343 | 0.662 | 0.439 | 0.174 | 0.405 |
| GenQ | 110M | 0.319 | 0.644 | 0.493 | 0.143 | 0.400 |
| BM25 (Lucene published) | — | 0.325 | 0.679 | 0.397 | 0.158 | 0.390 |
Rank 3 of 11 on 4-dataset mean. Beats GTR-XL with 1/10× the params. Loses only to SPLADE++ (-0.012, basically tied) and BGE-large (-0.070, mostly the ArguAna gap).
The config-divergence finding
After 4 datasets, no single pipeline wins everywhere:
| Dataset | Best config | What hurts |
|---|---|---|
| NFCorpus | Lucene + RRF + CE rerank | nothing |
| SciFact | Lucene + RRF + CE rerank | nothing |
| ArguAna | Lucene + RRF (no CE) | CE rerank actively hurts |
| SciDocs | dense alone | RRF hurt by 0.008 |
Three of four datasets pick a different best config. Auto-pipeline-selection would need a per-corpus calibrator (cheap, doesn't need GPU — tracked).
Honest limits
- 4/18 BEIR datasets. The 0.421 mean is suggestive, not BEIR-average.
- Zero-shot — NFCorpus and ArguAna train splits remain unused.
- The 5 biggest BEIR datasets (TREC-COVID, FiQA, HotpotQA, NQ, DBPedia, all >50k docs) remain GPU-gated.
Install
npx [email protected] # latest / alpha / v3alpha all aligned
Full ADR: v3/docs/adr/ADR-091-scidocs-and-config-divergence.md
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
Related context
Related tools
Beta — feedback welcome: [email protected]