This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryFixes the v0.3.2 regression where E8 KV cache combined with CUDA‑graph default caused cudaErrorStreamCaptureUnsupported and adds four new torch.library ops.
Full changelog
TL;DR: Every Triton kernel GLQ launches is now a registered torch.library op with a meta kernel. Fixes the v0.3.2 regression for users combining E8 KV cache with the new piecewise CUDA-graph default — that combo now correctly forces eager mode (or hits a clear startup warning instead of cudaErrorStreamCaptureUnsupported).
What changed
Kernel wrapping (4 new ops in torch.ops.glq.*):
| Op | Triggered by | Status |
|---|---|---|
| glq::input_rht_triton | n_pad > 16384 (70B-class MLPs) | dormant today, future-proofs the path |
| glq::output_rht_triton | m_pad > 16384 | same |
| glq::gather_kv_paged_dequant | every E8 KV attention read | active |
| glq::scatter_kv_paged_quant | every E8 KV attention write | active |
3 new fullgraph regression tests in tests/test_glq_compile_fullgraph.py capture toy modules under torch.compile(fullgraph=True). If anyone re-introduces a raw pybind/Triton call on the hot path, CI surfaces it with BackendCompilerFailed.
v0.3.2 E8 KV regression fix:
v0.3.2 dropped --enforce-eager as the default — that broke users with GLQ_KV_E8_*=1 because _patched_unified_attention calls block_table.flatten().unique() (illegal during CUDA-graph capture). v0.3.3:
glq_vllm/kv_compression.py: skip the.unique()scoped-decompress optimisation whentorch.cuda.is_current_stream_capturing()returns True (defensive fallback to all-blocks gather).glq_vllm/__init__.py: startup notice that E8 KV requiresenforce_eager=Truein v0.3.x.benchmarks/kv_compress_{niah,mmlu}_vllm.py:enforce_eager=Trueis back as the default for these scripts; NIAH has--enforce-eager/--no-enforce-eagerCLI toggle.- README KV-cache section explicitly says E8 KV requires
--enforce-eageruntil v0.3.4.
Weight-only GLQ (no GLQ_KV_E8_* envs) keeps the v0.3.2 piecewise win — 2.78× E4B decode is preserved.
Known limitation (v0.3.4 target)
E8 KV under piecewise / full CUDA-graph capture still fails on graph replay with cudaErrorIllegalAddress because the gather workspace is allocated via torch.empty() per attention call (not graph-safe). Refactoring the gather to use a pre-allocated workspace will unlock piecewise speedup for the E8 KV path too. Tracked for v0.3.4.
Install
pip install -U glq==0.3.3
Compatibility
| Setup | Mode | Status |
|---|---|---|
| Weight-only GLQ (no E8 KV) | piecewise (default) | works; matches v0.3.2 perf |
| Weight-only GLQ | eager (--enforce-eager) | works |
| GLQ + E8 KV cache | eager (--enforce-eager) — required | works; matches v0.3.0 perf |
| GLQ + E8 KV cache | piecewise | not supported yet (v0.3.4 target) |
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]