Skip to content

This release includes 1 breaking change for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Summary

AI summary

Reranker upgrade to bge-reranker-v2-m3 and top‑K=10 boost LongMemEval score by ~18 pp.

Full changelog

New benchmark results and documentation parity. Central Intelligence's v2 hybrid retrieval pipeline (pgvector HNSW + BM25 + RRF fusion + bge-reranker-v2-m3 cross-encoder) scores among the top results published on both LifeBench and LongMemEval.

Benchmark Results

LongMemEval (ICLR 2025) — 75.0%

Conversational memory across 500 questions: single-session recall, multi-session reasoning, temporal reasoning, knowledge updates, preference tracking.

| Overall | Single-session | Multi-session | Temporal | Preference |
|---------|----------------|---------------|----------|------------|
| 75.0% | 91.9% | 66.2% | 69.9% | 76.7% |

Answer model: gpt-5.4-mini. Judge: gpt-4o.

LifeBench (2026) — 52.2%

Long-term multi-source memory: 2,003 questions across 10 users, 51K real-world events (messages, calendar, health records, notes, calls). The hardest published memory benchmark.

| Overall | Info Extraction | Multi-hop | Temporal | Nondeclarative |
|---------|-----------------|-----------|----------|----------------|
| 52.2% | 47.2% | 52.9% | 46.4% | 64.1% |

Answer model: gpt-5.4-mini. Judge: gpt-4.1-mini.

Evaluation harness: lifebench-eval.

What Got Us Here (since v1.2.1)

The journey from v1.2.1 to v1.2.2, measured on LongMemEval:

| Change | Delta |
|--------|-------|
| Reranker upgrade: MiniLM → bge-reranker-v2-m3 + top-K=10 | ~+18pp |
| Answer model: gpt-4o-minigpt-5.4-mini | ~+5pp |
| Answer prompt: CoT counting + factual precision + multi-hop grounding | ~+3pp |
| Infrastructure fixes (ef_search, performance CPU, 4GB RAM) | enabler |

Attribution is approximate — each step was measured through the benchmark's run-to-run variance. The architecture that reached 75% is the clean v2 pipeline: pgvector HNSW + BM25 with RRF fusion, temporal decay, and a cross-encoder reranker. No fact extraction or graph traversal during recall.

Changes Since v1.2.1

Retrieval

  • Upgraded reranker to bge-reranker-v2-m3 (8K context, +14% nDCG@10 over MiniLM)
  • top-K=10 retrieval (NeurIPS 2024 finding: accuracy saturates at k=10, degrades with more docs)
  • Set hnsw.ef_search = 400 globally (was silently capped at ~40 results)
  • Reverted to v2 architecture (vector + BM25 + reranker) — removed query decomposition and fact-path that regressed multi-session scores

Benchmark infrastructure

  • Added --answer-model parameter to benchmark entrypoint
  • Performance CPU on benchmark VM (fixes extraction stalls under shared-CPU throttling)
  • DB-driven extraction queue (removed in-memory queue that caused zombie processes)

Documentation

  • README Benchmarks section now includes full category tables for both LifeBench and LongMemEval
  • Landing page at centralintelligence.online matches README numbers exactly
  • Answer/judge model attribution documented inline for reproducibility

Try It

npx central-intelligence-local signup

Full docs: README · Benchmarks: live results

Breaking Changes

  • Removed query decomposition and fact‑path components from v2 architecture

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track AlekseiMarchenko/central-intelligence

Get notified when new releases ship.

Sign up free

About AlekseiMarchenko/central-intelligence

Persistent memory for AI agents. Five tools (remember, recall, context, forget, share) with semantic search via vector embeddings and agent/user/org scoping. Works with Claude Code, Cursor, Windsurf, and any MCP client.

All releases →

Related context

Beta — feedback welcome: [email protected]