Glq

v0.3.2 Breaking

This release includes 1 breaking change for platform teams planning a safe upgrade.

Published 2mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Updates Side changes, v0.3.1, and v0.3.2 across a mixed release.

Full changelog

TL;DR: --enforce-eager / --disable-piecewise-cuda-graph are no longer required when serving GLQ checkpoints on vLLM 0.20.x. Drop the flag and decode gets ~2.8× faster on small/medium models.

Measured

RTX PRO 6000 Blackwell, vLLM 0.20.2, single sequence, 256 generated tokens (incl. prefill):

| Model | Mode | tok/s | Speedup |
|---|---|---:|---:|
| Gemma-4-E4B-it-GLQ-4bpw | eager (v0.3.1) | 14.46 | 1.00× |
| Gemma-4-E4B-it-GLQ-4bpw | piecewise (v0.3.2) | 40.20 | 2.78× |

Gemma-4-31B-it-GLQ-5.0bpw-mix3-8 piecewise smoke: loads at 17.25 GiB (unchanged vs v0.3.1), AOT-compiles in 132 s on first launch (cached at ~/.cache/vllm/torch_compile_cache/), produces bit-identical output.

What changed

glq_vllm/linear_method.py:_glq_apply_shard had three if _use_custom_ops … else _ik._glq_cuda.glq_*_cuda(...) branches. The else branch was dead in production (custom_ops register at import glq_vllm time) but its visibility in the trace forced torch.dynamo to fall back to eager. Removing the fallback gives dynamo a single, traceable code path through:

torch.ops.glq.fused_linear_block_diag (block-diagonal fast path, n_pad ≤ 32768)
torch.ops.glq.input_rht (n_pad ≤ 16384)
torch.ops.glq.output_rht (m_pad ≤ 16384)

The > 16384 Triton fallback paths in _input_rht_kernel / _output_rht_kernel stay — current checkpoints (Gemma-4, SmolLM3, Devstral, Nemotron) don't hit them. Wrapping them as torch.library ops is a separate follow-up.

Side changes

README KV-cache + sglang sections drop the --enforce-eager / --disable-piecewise-cuda-graph caveats (they apply to v0.3.1 and earlier).
examples/inference_vllm.py no longer sets enforce_eager=True by default.
benchmarks/kv_compress_{mmlu,niah}_vllm.py and benchmarks/probe_vllm_kv_capacity.py no longer default to eager. NIAH script gains an --enforce-eager CLI flag for debugging.
benchmarks/profile_vllm_* keep enforce_eager=True (intentional — eager simplifies nsys / py-spy attribution).

Install

pip install -U glq==0.3.2

vLLM users can now just:

vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-4bpw

If you hit a compile-time issue on an untested model architecture, the --enforce-eager flag is still available as a fallback.

Breaking Changes

Flags --enforce-eager and --disable-piecewise-cuda-graph are no longer supported or required for GLQ checkpoints on vLLM 0.20.x.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq