Skip to content

Glq

v0.5.1 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

ReleasePort's take

Light signal
editorial:auto 4h

The Inline‑dequant E8 KV‑cache path is now the default read mechanism in vLLM.

Why it matters: Defaulting to inline‑dequant improves cache hit rates for models using the E8 KV recipe; applications relying on other recipes fall back, preserving existing behavior. Expect measurable latency reductions when the new path applies.

Summary

AI summary

Inline-dequant E8 KV cache path becomes the default read mechanism in vLLM.

Changes in this release

Feature Medium

Inline-dequant E8 KV-cache path becomes the default read path.

Inline-dequant E8 KV-cache path becomes the default read path.

Source: llm_adapter@2026-06-04

Confidence: high

Feature Low

v3 inline-dequant attention is used by default when E8 KV cache is active.

v3 inline-dequant attention is used by default when E8 KV cache is active.

Source: granite4.1:30b@2026-06-04-audit

Confidence: low

Bugfix Low

Prebuilt Docker image cannot serve vLLM on GPU due to CUDA version mismatch.

Prebuilt Docker image cannot serve vLLM on GPU due to CUDA version mismatch.

Source: llm_adapter@2026-06-04

Confidence: high

Refactor Low

4 bpw KV recipes automatically use the new default path; other recipes fallback to 65 K workspace path.

4 bpw KV recipes automatically use the new default path; other recipes fallback to 65 K workspace path.

Source: granite4.1:30b@2026-06-04-audit

Confidence: low

Full changelog

v0.5.1 — inline-dequant E8 KV is now the default

v0.5.0 shipped the inline-dequant E8 KV-cache path as opt-in. After
validating it across the consumer GPU lineup, v0.5.1 makes it the
default E8-KV read path.

When the E8 KV cache is active, vLLM now uses the v3 inline-dequant
attention (4 K codebook, FHT-butterfly inverse Hadamard, flash-decoding
KV-split, FULL cudagraph capture) by default — no extra flag. 4 bpw KV
recipes use it; other recipes fall back to the 65 K workspace path
automatically.

Enabling (unchanged bundle, no extra flag)

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Opt-outs: GLQ_KV_E8_INLINE_DEQUANT_V3=0 (revert to the 65 K
workspace path) or GLQ_KV_E8_FORCE_PIECEWISE=1 (keep inline, disable
the FULL decode graph). Fully reversible.

Consumer-GPU validation (what gated the flip)

| Arch | Card (class) | Result |
|---|---|---|
| sm_86 Ampere | A10G / 3090, 24 GB | NIAH-16k 3/3, MMLU n=24 0.292 |
| sm_89 Ada | L40S / 4090 | NIAH-16k 3/3, MMLU n=24 0.333 |
| sm_120 Blackwell | RTX PRO 6000 / 5090 | full A/B, FULL == PIECEWISE quality-neutral |

The v3 Triton kernels compile and produce correct output on all three
architectures (MMLU figures are within SmolLM3-3B's small-sample noise
band). FULL-vs-PIECEWISE quality-neutrality was established rigorously on
Blackwell (bit-identical on SmolLM3; within vLLM's own greedy
non-determinism on Gemma-4); the consumer-card runs are shorter FULL-only
smokes.

Known issue

The prebuilt Docker image (ghcr.io/cnygaard/glq-env) currently can
not serve via vLLM on GPU — its vLLM wheel is a CUDA-13 build while
the image pins CUDA-12.8 torch (import vllm._Clibcudart.so.13).
The pip package is unaffected; the HF-transformers path in the image
works. A Dockerfile CUDA-alignment fix is in progress.

Install

pip install glq

Full changelog: https://github.com/cnygaard/glq/compare/v0.5.0...v0.5.1

Breaking Changes

  • Changed default read path for E8 KV cache to inline-dequant v3; previous opt-in flag now implicit.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]