This release adds 1 notable feature for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryUpdates Fix, loaded_weight, and device across a mixed release.
Full changelog
TL;DR: Loading xv0y5ncu/Gemma-4-31B-it-GLQ-5.0bpw-mix3-8 under vLLM now uses 17.25 GiB instead of 66.21 GiB on a single RTX PRO 6000 Blackwell. Available KV cache budget at gpu_memory_utilization=0.95 rises from 14.88 GiB → 63.91 GiB. No checkpoint format changes — every existing GLQ checkpoint benefits.
Root cause
GLQShardedParameter (the per-shard buffer for fused QKV / gate_up layers) was allocating two copies of every compressed weight:
- the full concat dummy in
param.data(intended for vLLM shape checks, but never read on our custom loader path) - the real per-shard buffer in
_shard_data[i]
Stage-3 / stage-4 RVQ indices (Qidxs3, Qidxs4) were also pre-allocated for every fused layer regardless of bpw, even though they only populate for bpw ≥ 5 / ≥ 7. The non-fused _register_glq_buffers path already used empty(0) sentinels — the fused path was the divergent / wasteful branch.
Fix
GLQShardedParameter.__new__allocatesparam.dataas a zero-byte placeholder. The fused weight loader uses_shard_data[idx]directly via_glq_shard_loader, so the concat dummy was dead weight.GLQShardedParameter.__init__acceptssentinel=True;Qidxs3/Qidxs4are now lazy-allocated empty(0) per shard, resized by the loader on first store viatorch.empty_like(loaded_weight).cuda()/to()rewritten in place — emptyparam.datamade PyTorch's defaultTensor.cuda()downcast to plain Tensor and lose_shard_data.
Verification
| Metric | v0.3.0 | v0.3.1 |
|---|---|---|
| Gemma-4-31B model load | 66.21 GiB | 17.25 GiB |
| Available KV cache @ gpu_mem=0.95 | 14.88 GiB | 63.91 GiB |
| Chat "Paris" regression | ok | ok |
| NIAH 16k @ e8_relaxed:2 | 3/3 | 3/3 |
| pytest (excluding heavy model-load tests) | — | 386 passed / 0 failed / 4 xfailed |
10 new unit tests in tests/test_glq_sharded_param.py cover the sentinel resize path, zero-byte param.data, and .to(device) propagation.
Side change
benchmarks/kv_compress_niah_vllm.py exposes --max-batched-tokens and --max-num-seqs so long-context runs can cap profile_run activation peak independently of max_model_len.
Install
pip install glq==0.3.1
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]