Glq

v0.5.0 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Updates Quality — neutral, Scope, and Also in this release across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Feature	Medium	Adds opt‑in inline‑dequant path for E8 lattice KV cache in vLLM. Adds opt‑in inline‑dequant path for E8 lattice KV cache in vLLM. Source: llm_adapter@2026-06-02 Confidence: high	—
Feature	Low	Ensures bit‑identical quality between inline and previous PIECEWISE E8-KV paths for SmolLM3. Ensures bit‑identical quality between inline and previous PIECEWISE E8-KV paths for SmolLM3. Source: llm_adapter@2026-06-02 Confidence: high	—
Performance	Medium	Improves decode throughput up to 3.4× with inline dequantization (batch=4). Improves decode throughput up to 3.4× with inline dequantization (batch=4). Source: llm_adapter@2026-06-02 Confidence: low	—
Bugfix	Low	Fixes CI regression caused by missing `transformers` skip‑guard on CPU‑only runner. Fixes CI regression caused by missing `transformers` skip‑guard on CPU‑only runner. Source: llm_adapter@2026-06-02 Confidence: high	—

Full changelog

Highlights — opt-in inline-dequant E8 KV cache

The headline of v0.5.0 is a new inline-dequant path for the E8 lattice KV
cache on vLLM — the recommended path for long-context / KV-bound serving.

A forked Triton attention kernel dequantizes the compressed E8 K/V inside
the attention tile loop (an 8-point FHT butterfly for the inverse Hadamard,
plus flash-decoding KV-split for long-context occupancy). There is no
decompress-to-workspace pass, and — because the read/write hooks are
host-sync-clean — the FULL CUDA graph captures the whole decode,
eliminating the per-token eager-dispatch overhead that dominated E8-KV
decode. This brings E8-KV decode to roughly weight-only parity.

Validated decode throughput

SmolLM3-3B-GLQ-3.5bpw, RTX PRO 6000 Blackwell, vLLM 0.20.2 — inline vs the
pre-v0.5 E8-KV path (workspace, PIECEWISE):

| | before v0.5 | inline (v0.5) |
|---|--:|--:|
| B=1 | ~15 tok/s | 38 (2.5×) |
| B=4 | ~37 | 127 (3.4×) |
| ctx=16k, B=1 | ~15 | 36 (2.4×) |

On Gemma-4-E4B-it (large heads, already compute-bound) decode is roughly
unchanged, but quality and long-context behaviour match.

Quality — neutral

SmolLM3: the inline-FULL path is bit-identical to the previous
PIECEWISE path (MMLU-Pro n=120 and NIAH-16k match exactly).
Gemma-4-E4B-it: within vLLM's own run-to-run greedy non-determinism.
MMLU-Pro n=120, thinking, 16384-token budget — PIECEWISE 0.742 vs
inline-FULL 0.750 (a smaller gap than two PIECEWISE runs differ from each
other); NIAH-16k 10/10 both.

Enabling it (opt-in)

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
GLQ_KV_E8_INLINE_DEQUANT_V3=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

FULL CUDA-graph capture is the default for this path;
GLQ_KV_E8_FORCE_PIECEWISE=1 reverts to PIECEWISE.

Scope (why it's opt-in this release)

Covers the 4 bpw KV recipe (e8_relaxed:2); other recipes
automatically fall back to the v0.3.x workspace path.
Requires the Triton attention backend (auto-forced when E8 KV is active).
Validated on Gemma-4-E4B-it + SmolLM3-3B / vLLM 0.20.2 / RTX PRO 6000
Blackwell — not yet on 24–32 GB consumer GPUs or other architectures,
which is why it remains opt-in for now.

Also in this release

Weight-only GLQ and the v0.3.x workspace E8-KV path are unchanged —
default behaviour for existing checkpoints is identical to 0.3.5.
Tests CI is green again (a missing transformers skip-guard on one HF
comparison test had been reddening every release tag on the CPU-only
runner).

Install

pip install glq

Full changelog: https://github.com/cnygaard/glq/compare/v0.3.5...v0.5.0

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq