This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryGLQ v0.3.5 auto‑downgrades cudagraph_mode to PIECEWISE when E8 KV is active, removing the need for --enforce-eager.
Full changelog
TL;DR: Users running GLQ models with the E8 KV cache stack (GLQ_KV_E8_*=1) no longer need --enforce-eager. glq_vllm auto-downgrades cudagraph_mode to PIECEWISE when E8 KV is active.
What changed
v0.3.2 made vLLM's FULL_AND_PIECEWISE capture mode the default for weight-only GLQ (2.78× E4B B=1). But that same default crashed users with GLQ_KV_E8_*=1 because the FULL capture wraps our patched attention forward, where the block_table.flatten().unique() call is illegal during CUDA-graph capture (cudaErrorStreamCaptureUnsupported).
v0.3.5 hooks EngineArgs.create_engine_config and downgrades any FULL_* cudagraph_mode to PIECEWISE when E8 KV is active. PIECEWISE splits at vllm::unified_attention_with_output so our patched attention runs in the eager break between captured subgraphs — .unique() is legal, scoped-gather optimisation works, no kernel changes needed.
Verification on Gemma-4-E4B-it-GLQ-4bpw, RTX PRO 6000 Blackwell
| Config | cudagraph_mode | Captures fire | Output |
|---|---|---|---|
| E8 KV (no --enforce-eager) | auto → PIECEWISE | PIECEWISE only | clean |
| Weight-only (no E8 KV envs) | FULL_AND_PIECEWISE | PIECEWISE + FULL | clean |
The hook is gated on the same E8 KV env conditional that activates the sidecar, so weight-only GLQ keeps the v0.3.4 FULL_AND_PIECEWISE default and the +18.5 % B=4 FULL-graph win.
Startup notice
When the override fires, glq_vllm prints:
[glq_vllm] E8 KV active → cudagraph_mode forced from FULL_AND_PIECEWISE to PIECEWISE (FULL captures incompatible with .unique() in _patched_unified_attention; fused-dequant paged_attention in v0.4 will lift this)
Cleanups
glq_vllm/__init__.py: replaced v0.3.3 "requires enforce_eager" notice with the hook-installed notice.benchmarks/kv_compress_{niah,mmlu}_vllm.py: droppedenforce_eager=Truedefaults. NIAH gains an--enforce-eager/--no-enforce-eagerCLI toggle.README.mdKV section: dropped the prominent--enforce-eagerwarning, documented the auto-PIECEWISE behaviour.
Future (deferred)
Unlocking FULL capture for E8 KV requires the C++ paged_attention fork that fuses dequant into the attention kernel — the gather workspace disappears entirely and the per-token cost matches uncompressed KV. That's the v0.4 target. Until then, PIECEWISE-only is the right tradeoff: ~3× decode over today's eager mode, no FULL-graph regression risk.
Install
pip install -U glq==0.3.5
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]