This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryUpdates Quality — neutral, Scope, and Also in this release across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Medium |
Adds opt‑in inline‑dequant path for E8 lattice KV cache in vLLM. Adds opt‑in inline‑dequant path for E8 lattice KV cache in vLLM. Source: llm_adapter@2026-06-02 Confidence: high |
— |
| Feature | Low |
Ensures bit‑identical quality between inline and previous PIECEWISE E8-KV paths for SmolLM3. Ensures bit‑identical quality between inline and previous PIECEWISE E8-KV paths for SmolLM3. Source: llm_adapter@2026-06-02 Confidence: high |
— |
| Performance | Medium |
Improves decode throughput up to 3.4× with inline dequantization (batch=4). Improves decode throughput up to 3.4× with inline dequantization (batch=4). Source: llm_adapter@2026-06-02 Confidence: low |
— |
| Bugfix | Low |
Fixes CI regression caused by missing `transformers` skip‑guard on CPU‑only runner. Fixes CI regression caused by missing `transformers` skip‑guard on CPU‑only runner. Source: llm_adapter@2026-06-02 Confidence: high |
— |
Full changelog
Highlights — opt-in inline-dequant E8 KV cache
The headline of v0.5.0 is a new inline-dequant path for the E8 lattice KV
cache on vLLM — the recommended path for long-context / KV-bound serving.
A forked Triton attention kernel dequantizes the compressed E8 K/V inside
the attention tile loop (an 8-point FHT butterfly for the inverse Hadamard,
plus flash-decoding KV-split for long-context occupancy). There is no
decompress-to-workspace pass, and — because the read/write hooks are
host-sync-clean — the FULL CUDA graph captures the whole decode,
eliminating the per-token eager-dispatch overhead that dominated E8-KV
decode. This brings E8-KV decode to roughly weight-only parity.
Validated decode throughput
SmolLM3-3B-GLQ-3.5bpw, RTX PRO 6000 Blackwell, vLLM 0.20.2 — inline vs the
pre-v0.5 E8-KV path (workspace, PIECEWISE):
| | before v0.5 | inline (v0.5) |
|---|--:|--:|
| B=1 | ~15 tok/s | 38 (2.5×) |
| B=4 | ~37 | 127 (3.4×) |
| ctx=16k, B=1 | ~15 | 36 (2.4×) |
On Gemma-4-E4B-it (large heads, already compute-bound) decode is roughly
unchanged, but quality and long-context behaviour match.
Quality — neutral
- SmolLM3: the inline-FULL path is bit-identical to the previous
PIECEWISE path (MMLU-Pro n=120 and NIAH-16k match exactly). - Gemma-4-E4B-it: within vLLM's own run-to-run greedy non-determinism.
MMLU-Pro n=120, thinking, 16384-token budget — PIECEWISE 0.742 vs
inline-FULL 0.750 (a smaller gap than two PIECEWISE runs differ from each
other); NIAH-16k 10/10 both.
Enabling it (opt-in)
GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
GLQ_KV_E8_INLINE_DEQUANT_V3=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw
FULL CUDA-graph capture is the default for this path;
GLQ_KV_E8_FORCE_PIECEWISE=1 reverts to PIECEWISE.
Scope (why it's opt-in this release)
- Covers the 4 bpw KV recipe (
e8_relaxed:2); other recipes
automatically fall back to the v0.3.x workspace path. - Requires the Triton attention backend (auto-forced when E8 KV is active).
- Validated on Gemma-4-E4B-it + SmolLM3-3B / vLLM 0.20.2 / RTX PRO 6000
Blackwell — not yet on 24–32 GB consumer GPUs or other architectures,
which is why it remains opt-in for now.
Also in this release
- Weight-only GLQ and the v0.3.x workspace E8-KV path are unchanged —
default behaviour for existing checkpoints is identical to 0.3.5. - Tests CI is green again (a missing
transformersskip-guard on one HF
comparison test had been reddening every release tag on the CPU-only
runner).
Install
pip install glq
Full changelog: https://github.com/cnygaard/glq/compare/v0.3.5...v0.5.0
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]