Skip to content

Glq

v0.5.0 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Updates Quality — neutral, Scope, and Also in this release across a mixed release.

Changes in this release

Feature Medium

Adds opt‑in inline‑dequant path for E8 lattice KV cache in vLLM.

Adds opt‑in inline‑dequant path for E8 lattice KV cache in vLLM.

Source: llm_adapter@2026-06-02

Confidence: high

Feature Low

Ensures bit‑identical quality between inline and previous PIECEWISE E8-KV paths for SmolLM3.

Ensures bit‑identical quality between inline and previous PIECEWISE E8-KV paths for SmolLM3.

Source: llm_adapter@2026-06-02

Confidence: high

Performance Medium

Improves decode throughput up to 3.4× with inline dequantization (batch=4).

Improves decode throughput up to 3.4× with inline dequantization (batch=4).

Source: llm_adapter@2026-06-02

Confidence: low

Bugfix Low

Fixes CI regression caused by missing `transformers` skip‑guard on CPU‑only runner.

Fixes CI regression caused by missing `transformers` skip‑guard on CPU‑only runner.

Source: llm_adapter@2026-06-02

Confidence: high

Full changelog

Highlights — opt-in inline-dequant E8 KV cache

The headline of v0.5.0 is a new inline-dequant path for the E8 lattice KV
cache
on vLLM — the recommended path for long-context / KV-bound serving.

A forked Triton attention kernel dequantizes the compressed E8 K/V inside
the attention tile loop (an 8-point FHT butterfly for the inverse Hadamard,
plus flash-decoding KV-split for long-context occupancy). There is no
decompress-to-workspace pass, and — because the read/write hooks are
host-sync-clean — the FULL CUDA graph captures the whole decode,
eliminating the per-token eager-dispatch overhead that dominated E8-KV
decode. This brings E8-KV decode to roughly weight-only parity.

Validated decode throughput

SmolLM3-3B-GLQ-3.5bpw, RTX PRO 6000 Blackwell, vLLM 0.20.2 — inline vs the
pre-v0.5 E8-KV path (workspace, PIECEWISE):

| | before v0.5 | inline (v0.5) |
|---|--:|--:|
| B=1 | ~15 tok/s | 38 (2.5×) |
| B=4 | ~37 | 127 (3.4×) |
| ctx=16k, B=1 | ~15 | 36 (2.4×) |

On Gemma-4-E4B-it (large heads, already compute-bound) decode is roughly
unchanged, but quality and long-context behaviour match.

Quality — neutral

  • SmolLM3: the inline-FULL path is bit-identical to the previous
    PIECEWISE path (MMLU-Pro n=120 and NIAH-16k match exactly).
  • Gemma-4-E4B-it: within vLLM's own run-to-run greedy non-determinism.
    MMLU-Pro n=120, thinking, 16384-token budget — PIECEWISE 0.742 vs
    inline-FULL 0.750 (a smaller gap than two PIECEWISE runs differ from each
    other); NIAH-16k 10/10 both.

Enabling it (opt-in)

GLQ_KV_QUANT=e8_relaxed:2 \
GLQ_KV_E8_SIDECAR=1 GLQ_KV_E8_SIDECAR_READ=1 \
GLQ_KV_E8_COMPRESSED_ALLOC=1 \
GLQ_KV_E8_FUSED_GATHER=1 GLQ_KV_E8_FUSED_WRITE=1 \
GLQ_KV_E8_INLINE_DEQUANT_V3=1 \
vllm serve xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

FULL CUDA-graph capture is the default for this path;
GLQ_KV_E8_FORCE_PIECEWISE=1 reverts to PIECEWISE.

Scope (why it's opt-in this release)

  • Covers the 4 bpw KV recipe (e8_relaxed:2); other recipes
    automatically fall back to the v0.3.x workspace path.
  • Requires the Triton attention backend (auto-forced when E8 KV is active).
  • Validated on Gemma-4-E4B-it + SmolLM3-3B / vLLM 0.20.2 / RTX PRO 6000
    Blackwell — not yet on 24–32 GB consumer GPUs or other architectures,
    which is why it remains opt-in for now.

Also in this release

  • Weight-only GLQ and the v0.3.x workspace E8-KV path are unchanged
    default behaviour for existing checkpoints is identical to 0.3.5.
  • Tests CI is green again (a missing transformers skip-guard on one HF
    comparison test had been reddening every release tag on the CPU-only
    runner).

Install

pip install glq

Full changelog: https://github.com/cnygaard/glq/compare/v0.3.5...v0.5.0

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]