This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+7 more
ReleasePort's take
Light signalv8.4.0 introduces semantic caching that displays cache hit percentage in the routing footer, with configurable similarity thresholds via LLM_ROUTER_SEMANTIC_CACHE_THRESHOLD. The LLMResponse API now carries cache_hit and cache_similarity fields for observability.
Why it matters: Semantic caching deduplicates requests using local embeddings with 24-hour TTL, scoped by task type. Configure the similarity threshold and evaluate cache hit rates in dev before production rollout.
Summary
AI summarySemantic caching now surfaces cache hits in the routing footer with configurable similarity thresholds.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Semantic caching surfaces cache hits with percentage in routing footer display. Semantic caching surfaces cache hits with percentage in routing footer display. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Configurable semantic cache threshold via LLM_ROUTER_SEMANTIC_CACHE_THRESHOLD environment variable. Configurable semantic cache threshold via LLM_ROUTER_SEMANTIC_CACHE_THRESHOLD environment variable. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
LLMResponse carries new cache_hit and cache_similarity fields for cache information. LLMResponse carries new cache_hit and cache_similarity fields for cache information. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Semantic caching uses Ollama nomic-embed-text for local embedding-based deduplication. Semantic caching uses Ollama nomic-embed-text for local embedding-based deduplication. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
Cache entries persist with 24-hour TTL, task-type scoped, SQLite storage. Cache entries persist with 24-hour TTL, task-type scoped, SQLite storage. Source: llm_adapter@2026-05-21 Confidence: low |
— |
Full changelog
What's New
Semantic caching now surfaces cache hits in the routing footer and supports configurable thresholds.
Cache Hit Footer
When a semantically similar prompt is found, you see:
→ cache hit (97%) · gemini-2.5-flash · $0
Zero cost, zero latency — the cached response is returned instantly.
Configurable Threshold
export LLM_ROUTER_SEMANTIC_CACHE_THRESHOLD=0.90 # more hits, lower precision
export LLM_ROUTER_SEMANTIC_CACHE_THRESHOLD=0.95 # default (conservative)
Technical Details
LLMResponsenow carriescache_hitandcache_similarityfields- Embedding-based dedup via Ollama
nomic-embed-text(free, local) - 24h TTL, task-type scoped, SQLite storage
Upgrade
pip install --upgrade llm-routing
Full Changelog: https://github.com/ypollak2/llm-router/compare/v8.3.0...v8.4.0
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About ypollak2/llm-router
Subscription-aware LLM router for Claude Code. Routes tasks to 20+ providers (OpenAI, Gemini, Groq, Ollama, Codex) based on complexity classification, Claude subscription pressure, and cost. Free tasks stay on Claude subscription; expensive tasks fall back to the cheapest capable model. Includes 30 MCP tools, 6 auto-routing hooks, semantic dedup cache, prompt caching, daily spend cap, and a live web dashboard.
Related context
Related tools
Beta — feedback welcome: [email protected]