Glq

v0.5.1 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 1mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

ReleasePort's take

Light signal

editorial:auto 1mo

The Inline‑dequant E8 KV‑cache path is now the default read mechanism in vLLM.

Why it matters: Defaulting to inline‑dequant improves cache hit rates for models using the E8 KV recipe; applications relying on other recipes fall back, preserving existing behavior. Expect measurable latency reductions when the new path applies.

Summary

AI summary

Inline-dequant E8 KV cache path becomes the default read mechanism in vLLM.

Changes in this release

Type	Severity	Summary	CVE
Feature	Medium	Inline-dequant E8 KV-cache path becomes the default read path. Inline-dequant E8 KV-cache path becomes the default read path. Source: llm_adapter@2026-06-04 Confidence: high	—
Feature	Low	v3 inline-dequant attention is used by default when E8 KV cache is active. v3 inline-dequant attention is used by default when E8 KV cache is active. Source: granite4.1:30b@2026-06-04-audit Confidence: low	—
Bugfix	Low	Prebuilt Docker image cannot serve vLLM on GPU due to CUDA version mismatch. Prebuilt Docker image cannot serve vLLM on GPU due to CUDA version mismatch. Source: llm_adapter@2026-06-04 Confidence: high	—
Refactor	Low	4 bpw KV recipes automatically use the new default path; other recipes fallback to 65 K workspace path. 4 bpw KV recipes automatically use the new default path; other recipes fallback to 65 K workspace path. Source: granite4.1:30b@2026-06-04-audit Confidence: low	—

Full changelog

v0.5.1 — inline-dequant E8 KV is now the default

v0.5.0 shipped the inline-dequant E8 KV-cache path as opt-in. After
validating it across the consumer GPU lineup, v0.5.1 makes it the
default E8-KV read path.

When the E8 KV cache is active, vLLM now uses the v3 inline-dequant
attention (4 K codebook, FHT-butterfly inverse Hadamard, flash-decoding
KV-split, FULL cudagraph capture) by default — no extra flag. 4 bpw KV
recipes use it; other recipes fall back to the 65 K workspace path
automatically.

Enabling (unchanged bundle, no extra flag)

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Opt-outs: GLQ_KV_E8_INLINE_DEQUANT_V3=0 (revert to the 65 K
workspace path) or GLQ_KV_E8_FORCE_PIECEWISE=1 (keep inline, disable
the FULL decode graph). Fully reversible.

Consumer-GPU validation (what gated the flip)

| Arch | Card (class) | Result |
|---|---|---|
| sm_86 Ampere | A10G / 3090, 24 GB | NIAH-16k 3/3, MMLU n=24 0.292 |
| sm_89 Ada | L40S / 4090 | NIAH-16k 3/3, MMLU n=24 0.333 |
| sm_120 Blackwell | RTX PRO 6000 / 5090 | full A/B, FULL == PIECEWISE quality-neutral |

The v3 Triton kernels compile and produce correct output on all three
architectures (MMLU figures are within SmolLM3-3B's small-sample noise
band). FULL-vs-PIECEWISE quality-neutrality was established rigorously on
Blackwell (bit-identical on SmolLM3; within vLLM's own greedy
non-determinism on Gemma-4); the consumer-card runs are shorter FULL-only
smokes.

Known issue

The prebuilt Docker image (ghcr.io/cnygaard/glq-env) currently can
not serve via vLLM on GPU — its vLLM wheel is a CUDA-13 build while
the image pins CUDA-12.8 torch (import vllm._C → libcudart.so.13).
The pip package is unaffected; the HF-transformers path in the image
works. A Dockerfile CUDA-alignment fix is in progress.

Install

pip install glq

Full changelog: https://github.com/cnygaard/glq/compare/v0.5.0...v0.5.1

Breaking Changes

Changed default read path for E8 KV cache to inline-dequant v3; previous opt-in flag now implicit.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq