Glq

v0.2.10 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 3mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Block‑diagonal decode now beats legacy power‑of‑2 by ~10 % on Blackwell RTX PRO 6000.

Full changelog

Highlights

Block-diagonal GLQ decode is now faster than the legacy power-of-2 fused path. On Blackwell RTX PRO 6000 with SmolLM2-135M 4bpw:

| Path | v0.2.9 | v0.2.10 |
|------|--------|---------|
| Block-diag eager | 51.1 tok/s | 53.6 tok/s |
| Block-diag + CUDA graph | 121.6 tok/s | 136.3 tok/s |
| Legacy pow2 + CUDA graph | 124.3 tok/s | 124.3 tok/s |

Block-diag graph now beats pow2 graph by ~10% — same butterfly work, 3 launches total, exact in_features (no padding).

Features

Phase A — CUDA-graph-safe forward for block-diagonal E8RHTLinear. Eager _blocks_n/m_tensor construction in __init__, cached empty placeholders, explicit device="cpu" pin so HF's init_empty_weights meta-default-device doesn't silently promote bookkeeping tensors to meta.
Phase B — Fused multi-block FHT kernel (glq_{input,output}_rht_multiblock_kernel). Collapses N per-sub-block launches into one. gridDim.y = num_blocks, blockIdx.y selects the sub-block via packed int4 device metadata. Dispatches via max_bs ≤ 8192 gate; legacy per-block loop retained for larger blocks.

Fixes

fast_hadamard_transform falls back to the PyTorch implementation when the input is on CPU (previously raised even with the CUDA pkg installed).
_process_model_before_weight_loading no longer crashes on models without .config (e.g. bare nn.Sequential).

Testing

30 new CUDA fast-path tests for block-diagonal E8RHTLinear covering multiblock↔legacy equivalence, B=1 matvec vs B≥2 TC consistency, CUDA graph bit-exactness, and large-block fallback.
8 stale tests updated for block-diag-default and nsamples=128-default semantics.
Full suite: 260 passed, 3 skipped, 0 failed.

Install

pip install glq==0.2.10

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq