Glq

v0.1.6 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 4mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Tiled Triton kernel achieves up to 12.5x faster codebook nearest‑neighbor quantization.

Full changelog

Tiled Tensor Core codebook kernel

5-12x faster quantization via rewritten Triton codebook nearest-neighbor kernel.

Changes

Tiled Triton kernel: Tiles BLOCK_N query rows per program with D=8→16 zero-padding for fp16 Tensor Core (mma.m16n8k16). Amortizes codebook L2 reads across rows instead of each program independently scanning the full 1MB codebook.
FP16 feedback matmul + incremental residual in LDLQ loop
Pre-computed codebook_half passed to Triton kernel (avoids redundant fp32→fp16 conversion per call)
Fix device=="cuda" checks to handle "cuda:0" correctly with CPU offloading

Benchmarks (NVIDIA A10G)

| Benchmark | v0.1.5 | v0.1.6 | Speedup |
|---|---|---|---|
| Codebook NN (9216 rows) | 12.2ms | 0.98ms | 12.5x |
| LDLQ gate_proj 9216×3072 | 4.92s | 0.53s | 9.3x |
| SmolLM2-360M full quantize | 167s | 84s | 2.0x |

Larger models (3B+) see greater improvement due to bigger weight matrices (5-9x on LDLQ step).

Perplexity verified unchanged (SmolLM2-360M 2bpw: PPL=18.10 with 32 cal samples, matching v0.1.5).

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq

Summary

Tiled Tensor Core codebook kernel

Changes

Benchmarks (NVIDIA A10G)

Related context

Related tools