This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+7 more
ReleasePort's take
Light signalLLM Router v8.3.0 automatically compresses conversation context for paid models, reducing token usage by up to 50%.
Why it matters: Enables operators to cut token consumption in half for paid model calls; configure with LLM_ROUTER_CONTEXT_OPTIMIZER to enable compression.
Summary
AI summaryAutomatically compresses conversation context before sending to paid models, saving up to 50% tokens.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Two-stage compression pipeline: structural and recency-based optimization. Two-stage compression pipeline: structural and recency-based optimization. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Free models automatically skip context compression. Free models automatically skip context compression. Source: llm_adapter@2026-05-21 Confidence: high |
— |
| Feature | Medium |
Context savings metrics display in routing footer. Context savings metrics display in routing footer. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Medium |
LLM_ROUTER_CONTEXT_OPTIMIZER environment variable controls compression. LLM_ROUTER_CONTEXT_OPTIMIZER environment variable controls compression. Source: llm_adapter@2026-05-21 Confidence: low |
— |
| Feature | Low |
Shows context token savings in the routing footer output. Shows context token savings in the routing footer output. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Feature | Low |
Adds LLM_ROUTER_CONTEXT_OPTIMIZER env var to enable/disable/adjust compression. Adds LLM_ROUTER_CONTEXT_OPTIMIZER env var to enable/disable/adjust compression. Source: granite4.1:30b@2026-05-23-audit Confidence: low |
— |
| Performance | Medium |
Automatically compresses conversation context before sending to paid models. Automatically compresses conversation context before sending to paid models. Source: llm_adapter@2026-05-21 Confidence: high |
— |
Full changelog
What's New
Automatically compresses conversation context before sending to paid models. Zero latency, 20-50% fewer context tokens.
How It Works
2-stage pipeline (pure Python, no LLM calls):
- Structural — collapses whitespace, removes code comments, deduplicates repeated blocks
- Recency — keeps last 2 exchanges verbatim, truncates older messages, drops old code blocks
Free models (Ollama, Codex, Gemini CLI) skip compression automatically.
What You See
Context savings now appear in the routing footer:
→ gemini-2.5-flash · simple · $0.0002 (43x cheaper) | ctx 1500→920tok (39% saved)
Configuration
export LLM_ROUTER_CONTEXT_OPTIMIZER=auto # default — Stage 1+2
export LLM_ROUTER_CONTEXT_OPTIMIZER=off # disable
Upgrade
pip install --upgrade llm-routing
Full Changelog: https://github.com/ypollak2/llm-router/compare/v8.2.0...v8.3.0
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About ypollak2/llm-router
Subscription-aware LLM router for Claude Code. Routes tasks to 20+ providers (OpenAI, Gemini, Groq, Ollama, Codex) based on complexity classification, Claude subscription pressure, and cost. Free tasks stay on Claude subscription; expensive tasks fall back to the cheapest capable model. Includes 30 MCP tools, 6 auto-routing hooks, semantic dedup cache, prompt caching, daily spend cap, and a live web dashboard.
Related context
Related tools
Beta — feedback welcome: [email protected]