AlekseiMarchenko/central-intelligence

v1.2.2 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 3mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Summary

AI summary

Reranker upgrade to bge-reranker-v2-m3 and top‑K=10 boost LongMemEval score by ~18 pp.

Full changelog

New benchmark results and documentation parity. Central Intelligence's v2 hybrid retrieval pipeline (pgvector HNSW + BM25 + RRF fusion + bge-reranker-v2-m3 cross-encoder) scores among the top results published on both LifeBench and LongMemEval.

Benchmark Results

LongMemEval (ICLR 2025) — 75.0%

Conversational memory across 500 questions: single-session recall, multi-session reasoning, temporal reasoning, knowledge updates, preference tracking.

| Overall | Single-session | Multi-session | Temporal | Preference |
|---------|----------------|---------------|----------|------------|
| 75.0% | 91.9% | 66.2% | 69.9% | 76.7% |

Answer model: gpt-5.4-mini. Judge: gpt-4o.

LifeBench (2026) — 52.2%

Long-term multi-source memory: 2,003 questions across 10 users, 51K real-world events (messages, calendar, health records, notes, calls). The hardest published memory benchmark.

| Overall | Info Extraction | Multi-hop | Temporal | Nondeclarative |
|---------|-----------------|-----------|----------|----------------|
| 52.2% | 47.2% | 52.9% | 46.4% | 64.1% |

Answer model: gpt-5.4-mini. Judge: gpt-4.1-mini.

Evaluation harness: lifebench-eval.

What Got Us Here (since v1.2.1)

The journey from v1.2.1 to v1.2.2, measured on LongMemEval:

| Change | Delta |
|--------|-------|
| Reranker upgrade: MiniLM → bge-reranker-v2-m3 + top-K=10 | ~+18pp |
| Answer model: gpt-4o-mini → gpt-5.4-mini | ~+5pp |
| Answer prompt: CoT counting + factual precision + multi-hop grounding | ~+3pp |
| Infrastructure fixes (ef_search, performance CPU, 4GB RAM) | enabler |

Attribution is approximate — each step was measured through the benchmark's run-to-run variance. The architecture that reached 75% is the clean v2 pipeline: pgvector HNSW + BM25 with RRF fusion, temporal decay, and a cross-encoder reranker. No fact extraction or graph traversal during recall.

Changes Since v1.2.1

Retrieval

Upgraded reranker to bge-reranker-v2-m3 (8K context, +14% nDCG@10 over MiniLM)
top-K=10 retrieval (NeurIPS 2024 finding: accuracy saturates at k=10, degrades with more docs)
Set hnsw.ef_search = 400 globally (was silently capped at ~40 results)
Reverted to v2 architecture (vector + BM25 + reranker) — removed query decomposition and fact-path that regressed multi-session scores

Benchmark infrastructure

Added --answer-model parameter to benchmark entrypoint
Performance CPU on benchmark VM (fixes extraction stalls under shared-CPU throttling)
DB-driven extraction queue (removed in-memory queue that caused zombie processes)

Documentation

README Benchmarks section now includes full category tables for both LifeBench and LongMemEval
Landing page at centralintelligence.online matches README numbers exactly
Answer/judge model attribution documented inline for reproducibility

Try It

npx central-intelligence-local signup

Full docs: README · Benchmarks: live results

Breaking Changes

Removed query decomposition and fact‑path components from v2 architecture

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track AlekseiMarchenko/central-intelligence

Get notified when new releases ship.

About AlekseiMarchenko/central-intelligence

Persistent memory for AI agents. Five tools (remember, recall, context, forget, share) with semantic search via vector embeddings and agent/user/org scoping. Works with Claude Code, Cursor, Windsurf, and any MCP client.

All releases →