Glq

v0.3.1 Feature

This release adds 1 notable feature for engineering teams evaluating rollout.

Published 2mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Updates Fix, loaded_weight, and device across a mixed release.

Full changelog

TL;DR: Loading xv0y5ncu/Gemma-4-31B-it-GLQ-5.0bpw-mix3-8 under vLLM now uses 17.25 GiB instead of 66.21 GiB on a single RTX PRO 6000 Blackwell. Available KV cache budget at gpu_memory_utilization=0.95 rises from 14.88 GiB → 63.91 GiB. No checkpoint format changes — every existing GLQ checkpoint benefits.

Root cause

GLQShardedParameter (the per-shard buffer for fused QKV / gate_up layers) was allocating two copies of every compressed weight:

the full concat dummy in param.data (intended for vLLM shape checks, but never read on our custom loader path)
the real per-shard buffer in _shard_data[i]

Stage-3 / stage-4 RVQ indices (Qidxs3, Qidxs4) were also pre-allocated for every fused layer regardless of bpw, even though they only populate for bpw ≥ 5 / ≥ 7. The non-fused _register_glq_buffers path already used empty(0) sentinels — the fused path was the divergent / wasteful branch.

Fix

GLQShardedParameter.__new__ allocates param.data as a zero-byte placeholder. The fused weight loader uses _shard_data[idx] directly via _glq_shard_loader, so the concat dummy was dead weight.
GLQShardedParameter.__init__ accepts sentinel=True; Qidxs3 / Qidxs4 are now lazy-allocated empty(0) per shard, resized by the loader on first store via torch.empty_like(loaded_weight).
cuda() / to() rewritten in place — empty param.data made PyTorch's default Tensor.cuda() downcast to plain Tensor and lose _shard_data.

Verification

| Metric | v0.3.0 | v0.3.1 |
|---|---|---|
| Gemma-4-31B model load | 66.21 GiB | 17.25 GiB |
| Available KV cache @ gpu_mem=0.95 | 14.88 GiB | 63.91 GiB |
| Chat "Paris" regression | ok | ok |
| NIAH 16k @ e8_relaxed:2 | 3/3 | 3/3 |
| pytest (excluding heavy model-load tests) | — | 386 passed / 0 failed / 4 xfailed |

10 new unit tests in tests/test_glq_sharded_param.py cover the sentinel resize path, zero-byte param.data, and .to(device) propagation.

Side change

benchmarks/kv_compress_niah_vllm.py exposes --max-batched-tokens and --max-num-seqs so long-context runs can cap profile_run activation peak independently of max_model_len.

Install

pip install glq==0.3.1

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq

Summary

Root cause

Fix

Verification

Side change

Install

Related context

Related tools