This release includes 1 breaking change for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryUpdates Side changes, v0.3.1, and v0.3.2 across a mixed release.
Full changelog
TL;DR: --enforce-eager / --disable-piecewise-cuda-graph are no longer required when serving GLQ checkpoints on vLLM 0.20.x. Drop the flag and decode gets ~2.8× faster on small/medium models.
Measured
RTX PRO 6000 Blackwell, vLLM 0.20.2, single sequence, 256 generated tokens (incl. prefill):
| Model | Mode | tok/s | Speedup |
|---|---|---:|---:|
| Gemma-4-E4B-it-GLQ-4bpw | eager (v0.3.1) | 14.46 | 1.00× |
| Gemma-4-E4B-it-GLQ-4bpw | piecewise (v0.3.2) | 40.20 | 2.78× |
Gemma-4-31B-it-GLQ-5.0bpw-mix3-8 piecewise smoke: loads at 17.25 GiB (unchanged vs v0.3.1), AOT-compiles in 132 s on first launch (cached at ~/.cache/vllm/torch_compile_cache/), produces bit-identical output.
What changed
glq_vllm/linear_method.py:_glq_apply_shard had three if _use_custom_ops … else _ik._glq_cuda.glq_*_cuda(...) branches. The else branch was dead in production (custom_ops register at import glq_vllm time) but its visibility in the trace forced torch.dynamo to fall back to eager. Removing the fallback gives dynamo a single, traceable code path through:
torch.ops.glq.fused_linear_block_diag(block-diagonal fast path, n_pad ≤ 32768)torch.ops.glq.input_rht(n_pad ≤ 16384)torch.ops.glq.output_rht(m_pad ≤ 16384)
The > 16384 Triton fallback paths in _input_rht_kernel / _output_rht_kernel stay — current checkpoints (Gemma-4, SmolLM3, Devstral, Nemotron) don't hit them. Wrapping them as torch.library ops is a separate follow-up.
Side changes
- README KV-cache + sglang sections drop the
--enforce-eager/--disable-piecewise-cuda-graphcaveats (they apply to v0.3.1 and earlier). examples/inference_vllm.pyno longer setsenforce_eager=Trueby default.benchmarks/kv_compress_{mmlu,niah}_vllm.pyandbenchmarks/probe_vllm_kv_capacity.pyno longer default to eager. NIAH script gains an--enforce-eagerCLI flag for debugging.benchmarks/profile_vllm_*keepenforce_eager=True(intentional — eager simplifies nsys / py-spy attribution).
Install
pip install -U glq==0.3.2
vLLM users can now just:
vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw
If you hit a compile-time issue on an untested model architecture, the --enforce-eager flag is still available as a fallback.
Breaking Changes
- Flags --enforce-eager and --disable-piecewise-cuda-graph are no longer supported or required for GLQ checkpoints on vLLM 0.20.x.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]