Skip to content

Glq

v0.3.5 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 16d Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

GLQ v0.3.5 auto‑downgrades cudagraph_mode to PIECEWISE when E8 KV is active, removing the need for --enforce-eager.

Full changelog

TL;DR: Users running GLQ models with the E8 KV cache stack (GLQ_KV_E8_*=1) no longer need --enforce-eager. glq_vllm auto-downgrades cudagraph_mode to PIECEWISE when E8 KV is active.

What changed

v0.3.2 made vLLM's FULL_AND_PIECEWISE capture mode the default for weight-only GLQ (2.78× E4B B=1). But that same default crashed users with GLQ_KV_E8_*=1 because the FULL capture wraps our patched attention forward, where the block_table.flatten().unique() call is illegal during CUDA-graph capture (cudaErrorStreamCaptureUnsupported).

v0.3.5 hooks EngineArgs.create_engine_config and downgrades any FULL_* cudagraph_mode to PIECEWISE when E8 KV is active. PIECEWISE splits at vllm::unified_attention_with_output so our patched attention runs in the eager break between captured subgraphs — .unique() is legal, scoped-gather optimisation works, no kernel changes needed.

Verification on Gemma-4-E4B-it-GLQ-4bpw, RTX PRO 6000 Blackwell

| Config | cudagraph_mode | Captures fire | Output |
|---|---|---|---|
| E8 KV (no --enforce-eager) | auto → PIECEWISE | PIECEWISE only | clean |
| Weight-only (no E8 KV envs) | FULL_AND_PIECEWISE | PIECEWISE + FULL | clean |

The hook is gated on the same E8 KV env conditional that activates the sidecar, so weight-only GLQ keeps the v0.3.4 FULL_AND_PIECEWISE default and the +18.5 % B=4 FULL-graph win.

Startup notice

When the override fires, glq_vllm prints:

[glq_vllm] E8 KV active → cudagraph_mode forced from FULL_AND_PIECEWISE to PIECEWISE (FULL captures incompatible with .unique() in _patched_unified_attention; fused-dequant paged_attention in v0.4 will lift this)

Cleanups

  • glq_vllm/__init__.py: replaced v0.3.3 "requires enforce_eager" notice with the hook-installed notice.
  • benchmarks/kv_compress_{niah,mmlu}_vllm.py: dropped enforce_eager=True defaults. NIAH gains an --enforce-eager / --no-enforce-eager CLI toggle.
  • README.md KV section: dropped the prominent --enforce-eager warning, documented the auto-PIECEWISE behaviour.

Future (deferred)

Unlocking FULL capture for E8 KV requires the C++ paged_attention fork that fuses dequant into the attention kernel — the gather workspace disappears entirely and the per-token cost matches uncompressed KV. That's the v0.4 target. Until then, PIECEWISE-only is the right tradeoff: ~3× decode over today's eager mode, no FULL-graph regression risk.

Install

pip install -U glq==0.3.5

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]