Skip to content

Glq

v0.2.11 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Updates Non-breaking, Highlights, and Tests across a mixed release.

Full changelog

Highlights

Opt-in shape-bucketed CUDA graphs (CUDAGraphBucketWrapper)

New glq.cuda_graph.CUDAGraphBucketWrapper captures one CUDA graph per (B, seqlen) bucket for stateless prefill / batched scoring (use_cache=False). At call time, pads the input up to the smallest bucket that fits and replays the graph. Shares a torch.cuda.graph_pool_handle() across all captures so peak VRAM is bounded by the largest bucket.

wrap_hflm(hflm_instance, buckets=...) monkey-patches lm-eval's HFLM _model_call for transparent integration. --bucket-graph CLI flag added to examples/inference_hf.py.

Ships as opt-in only — not enabled by default. Docstring documents honest tradeoffs:

| Scenario | Result |
|---|---|
| Small models + small batches, Python-dispatch-bound forward | Win (SmolLM2-135M 4bpw winogrande limit=5: 4.16s → 3.33s, 1.25×) |
| Larger models where forward is already GPU-bound | Loss (SmolLM3-3B 6bpw limit=5: 0.39×, capture cost dominates) |
| Variable-shape workloads at scale | Loss (SmolLM2-135M limit=200: 0.74×, padding waste) |

Productive niches: fixed-shape serving, known-workload batched inference. Not a default under lm-eval.

glq_dequant_matmul Python helper now N-stage-aware

The Python helper at glq.inference_kernel.glq_dequant_matmul now accepts Qidxs3/codebook3/inv_resid_scale2 and Qidxs4/codebook4/inv_resid_scale3 kwargs, so sglang, glq_vllm, and the glq fallback path can thread Phase D stage-3/4 tensors through without calling the raw 14-arg pybind. Defaults are None/0.0, so every pre-Phase-D caller is unchanged.

This unblocked the Phase B + Phase D ports to the sglang fork:

  • cnygaard/sglang @ glq-quantization: commits f3f9a3219 (N-stage support) and e1739fdd3 (block-diagonal dispatch). Validated end-to-end on SmolLM2-135M 4bpw (pow2 and block-diag) and SmolLM3-3B 6bpw block-diag 3-stage — all return "The capital of France is Paris." via sglang's /generate.

Tests

  • tests/test_cuda_graph_buckets.py: 6/6 pass (right/left-pad equivalence vs eager, oversized-input fallback, use_cache fallback, replay-faster-than-capture, Phase A decode wrapper still works).
  • Full suite: 273 passed, 3 skipped (was 267/3 at v0.2.10).

Non-breaking

  • Phase A CUDAGraphWrapper (B=1 decode) untouched.
  • All existing kernels and dispatch paths unchanged.
  • Version bumped; API additions only.

Install

pip install glq==0.2.11

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]