Skip to content

Glq

v0.2.10 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 1mo Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Block‑diagonal decode now beats legacy power‑of‑2 by ~10 % on Blackwell RTX PRO 6000.

Full changelog

Highlights

Block-diagonal GLQ decode is now faster than the legacy power-of-2 fused path. On Blackwell RTX PRO 6000 with SmolLM2-135M 4bpw:

| Path | v0.2.9 | v0.2.10 |
|------|--------|---------|
| Block-diag eager | 51.1 tok/s | 53.6 tok/s |
| Block-diag + CUDA graph | 121.6 tok/s | 136.3 tok/s |
| Legacy pow2 + CUDA graph | 124.3 tok/s | 124.3 tok/s |

Block-diag graph now beats pow2 graph by ~10% — same butterfly work, 3 launches total, exact in_features (no padding).

Features

  • Phase A — CUDA-graph-safe forward for block-diagonal E8RHTLinear. Eager _blocks_n/m_tensor construction in __init__, cached empty placeholders, explicit device="cpu" pin so HF's init_empty_weights meta-default-device doesn't silently promote bookkeeping tensors to meta.
  • Phase B — Fused multi-block FHT kernel (glq_{input,output}_rht_multiblock_kernel). Collapses N per-sub-block launches into one. gridDim.y = num_blocks, blockIdx.y selects the sub-block via packed int4 device metadata. Dispatches via max_bs ≤ 8192 gate; legacy per-block loop retained for larger blocks.

Fixes

  • fast_hadamard_transform falls back to the PyTorch implementation when the input is on CPU (previously raised even with the CUDA pkg installed).
  • _process_model_before_weight_loading no longer crashes on models without .config (e.g. bare nn.Sequential).

Testing

  • 30 new CUDA fast-path tests for block-diagonal E8RHTLinear covering multiblock↔legacy equivalence, B=1 matvec vs B≥2 TC consistency, CUDA graph bit-exactness, and large-block fallback.
  • 8 stale tests updated for block-diag-default and nsamples=128-default semantics.
  • Full suite: 260 passed, 3 skipped, 0 failed.

Install

pip install glq==0.2.10

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]