This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryUpdates Non-breaking, Highlights, and Tests across a mixed release.
Full changelog
Highlights
Opt-in shape-bucketed CUDA graphs (CUDAGraphBucketWrapper)
New glq.cuda_graph.CUDAGraphBucketWrapper captures one CUDA graph per (B, seqlen) bucket for stateless prefill / batched scoring (use_cache=False). At call time, pads the input up to the smallest bucket that fits and replays the graph. Shares a torch.cuda.graph_pool_handle() across all captures so peak VRAM is bounded by the largest bucket.
wrap_hflm(hflm_instance, buckets=...) monkey-patches lm-eval's HFLM _model_call for transparent integration. --bucket-graph CLI flag added to examples/inference_hf.py.
Ships as opt-in only — not enabled by default. Docstring documents honest tradeoffs:
| Scenario | Result |
|---|---|
| Small models + small batches, Python-dispatch-bound forward | Win (SmolLM2-135M 4bpw winogrande limit=5: 4.16s → 3.33s, 1.25×) |
| Larger models where forward is already GPU-bound | Loss (SmolLM3-3B 6bpw limit=5: 0.39×, capture cost dominates) |
| Variable-shape workloads at scale | Loss (SmolLM2-135M limit=200: 0.74×, padding waste) |
Productive niches: fixed-shape serving, known-workload batched inference. Not a default under lm-eval.
glq_dequant_matmul Python helper now N-stage-aware
The Python helper at glq.inference_kernel.glq_dequant_matmul now accepts Qidxs3/codebook3/inv_resid_scale2 and Qidxs4/codebook4/inv_resid_scale3 kwargs, so sglang, glq_vllm, and the glq fallback path can thread Phase D stage-3/4 tensors through without calling the raw 14-arg pybind. Defaults are None/0.0, so every pre-Phase-D caller is unchanged.
This unblocked the Phase B + Phase D ports to the sglang fork:
cnygaard/sglang @ glq-quantization: commitsf3f9a3219(N-stage support) ande1739fdd3(block-diagonal dispatch). Validated end-to-end on SmolLM2-135M 4bpw (pow2 and block-diag) and SmolLM3-3B 6bpw block-diag 3-stage — all return "The capital of France is Paris." via sglang's/generate.
Tests
tests/test_cuda_graph_buckets.py: 6/6 pass (right/left-pad equivalence vs eager, oversized-input fallback,use_cachefallback, replay-faster-than-capture, Phase A decode wrapper still works).- Full suite: 273 passed, 3 skipped (was 267/3 at v0.2.10).
Non-breaking
- Phase A
CUDAGraphWrapper(B=1 decode) untouched. - All existing kernels and dispatch paths unchanged.
- Version bumped; API additions only.
Install
pip install glq==0.2.11
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]