This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
ReleasePort's take
Light signalThe Inline‑dequant E8 KV‑cache path is now the default read mechanism in vLLM.
Why it matters: Defaulting to inline‑dequant improves cache hit rates for models using the E8 KV recipe; applications relying on other recipes fall back, preserving existing behavior. Expect measurable latency reductions when the new path applies.
Summary
AI summaryInline-dequant E8 KV cache path becomes the default read mechanism in vLLM.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Inline-dequant E8 KV-cache path becomes the default read path. Inline-dequant E8 KV-cache path becomes the default read path. Source: llm_adapter@2026-06-04 Confidence: high |
— |
| Feature | Low |
v3 inline-dequant attention is used by default when E8 KV cache is active. v3 inline-dequant attention is used by default when E8 KV cache is active. Source: granite4.1:30b@2026-06-04-audit Confidence: low |
— |
| Bugfix | Low |
Prebuilt Docker image cannot serve vLLM on GPU due to CUDA version mismatch. Prebuilt Docker image cannot serve vLLM on GPU due to CUDA version mismatch. Source: llm_adapter@2026-06-04 Confidence: high |
— |
| Refactor | Low |
4 bpw KV recipes automatically use the new default path; other recipes fallback to 65 K workspace path. 4 bpw KV recipes automatically use the new default path; other recipes fallback to 65 K workspace path. Source: granite4.1:30b@2026-06-04-audit Confidence: low |
— |
Full changelog
v0.5.1 — inline-dequant E8 KV is now the default
v0.5.0 shipped the inline-dequant E8 KV-cache path as opt-in. After
validating it across the consumer GPU lineup, v0.5.1 makes it the
default E8-KV read path.
When the E8 KV cache is active, vLLM now uses the v3 inline-dequant
attention (4 K codebook, FHT-butterfly inverse Hadamard, flash-decoding
KV-split, FULL cudagraph capture) by default — no extra flag. 4 bpw KV
recipes use it; other recipes fall back to the 65 K workspace path
automatically.
Enabling (unchanged bundle, no extra flag)
GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw
Opt-outs: GLQ_KV_E8_INLINE_DEQUANT_V3=0 (revert to the 65 K
workspace path) or GLQ_KV_E8_FORCE_PIECEWISE=1 (keep inline, disable
the FULL decode graph). Fully reversible.
Consumer-GPU validation (what gated the flip)
| Arch | Card (class) | Result |
|---|---|---|
| sm_86 Ampere | A10G / 3090, 24 GB | NIAH-16k 3/3, MMLU n=24 0.292 |
| sm_89 Ada | L40S / 4090 | NIAH-16k 3/3, MMLU n=24 0.333 |
| sm_120 Blackwell | RTX PRO 6000 / 5090 | full A/B, FULL == PIECEWISE quality-neutral |
The v3 Triton kernels compile and produce correct output on all three
architectures (MMLU figures are within SmolLM3-3B's small-sample noise
band). FULL-vs-PIECEWISE quality-neutrality was established rigorously on
Blackwell (bit-identical on SmolLM3; within vLLM's own greedy
non-determinism on Gemma-4); the consumer-card runs are shorter FULL-only
smokes.
Known issue
The prebuilt Docker image (ghcr.io/cnygaard/glq-env) currently can
not serve via vLLM on GPU — its vLLM wheel is a CUDA-13 build while
the image pins CUDA-12.8 torch (import vllm._C → libcudart.so.13).
The pip package is unaffected; the HF-transformers path in the image
works. A Dockerfile CUDA-alignment fix is in progress.
Install
pip install glq
Full changelog: https://github.com/cnygaard/glq/compare/v0.5.0...v0.5.1
Breaking Changes
- Changed default read path for E8 KV cache to inline-dequant v3; previous opt-in flag now implicit.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]