Skip to content

Glq

v0.3.3 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 17d Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Fixes the v0.3.2 regression where E8 KV cache combined with CUDA‑graph default caused cudaErrorStreamCaptureUnsupported and adds four new torch.library ops.

Full changelog

TL;DR: Every Triton kernel GLQ launches is now a registered torch.library op with a meta kernel. Fixes the v0.3.2 regression for users combining E8 KV cache with the new piecewise CUDA-graph default — that combo now correctly forces eager mode (or hits a clear startup warning instead of cudaErrorStreamCaptureUnsupported).

What changed

Kernel wrapping (4 new ops in torch.ops.glq.*):

| Op | Triggered by | Status |
|---|---|---|
| glq::input_rht_triton | n_pad > 16384 (70B-class MLPs) | dormant today, future-proofs the path |
| glq::output_rht_triton | m_pad > 16384 | same |
| glq::gather_kv_paged_dequant | every E8 KV attention read | active |
| glq::scatter_kv_paged_quant | every E8 KV attention write | active |

3 new fullgraph regression tests in tests/test_glq_compile_fullgraph.py capture toy modules under torch.compile(fullgraph=True). If anyone re-introduces a raw pybind/Triton call on the hot path, CI surfaces it with BackendCompilerFailed.

v0.3.2 E8 KV regression fix:

v0.3.2 dropped --enforce-eager as the default — that broke users with GLQ_KV_E8_*=1 because _patched_unified_attention calls block_table.flatten().unique() (illegal during CUDA-graph capture). v0.3.3:

  • glq_vllm/kv_compression.py: skip the .unique() scoped-decompress optimisation when torch.cuda.is_current_stream_capturing() returns True (defensive fallback to all-blocks gather).
  • glq_vllm/__init__.py: startup notice that E8 KV requires enforce_eager=True in v0.3.x.
  • benchmarks/kv_compress_{niah,mmlu}_vllm.py: enforce_eager=True is back as the default for these scripts; NIAH has --enforce-eager / --no-enforce-eager CLI toggle.
  • README KV-cache section explicitly says E8 KV requires --enforce-eager until v0.3.4.

Weight-only GLQ (no GLQ_KV_E8_* envs) keeps the v0.3.2 piecewise win — 2.78× E4B decode is preserved.

Known limitation (v0.3.4 target)

E8 KV under piecewise / full CUDA-graph capture still fails on graph replay with cudaErrorIllegalAddress because the gather workspace is allocated via torch.empty() per attention call (not graph-safe). Refactoring the gather to use a pre-allocated workspace will unlock piecewise speedup for the E8 KV path too. Tracked for v0.3.4.

Install

pip install -U glq==0.3.3

Compatibility

| Setup | Mode | Status |
|---|---|---|
| Weight-only GLQ (no E8 KV) | piecewise (default) | works; matches v0.3.2 perf |
| Weight-only GLQ | eager (--enforce-eager) | works |
| GLQ + E8 KV cache | eager (--enforce-eager) — required | works; matches v0.3.0 perf |
| GLQ + E8 KV cache | piecewise | not supported yet (v0.3.4 target) |

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]