This release adds 3 notable features for engineering teams evaluating rollout.
Published 1mo
Model Serving & MLOps
✓ No known CVEs patched
✓ No known CVEs patched in this version
Topics
inference
llm
model-compression
pytorch
quantization
Summary
AI summaryBlock‑diagonal FHT eliminates padding overhead and N‑stage RVQ enables true 2‑8 bpw quantization.
Full changelog
Features
- Block-diagonal FHT: eliminates power-of-2 padding overhead (6.8 → 4.0 effective bpw for Nemotron-30B)
- N-stage RVQ: true 2-8 bpw quantization via multi-stage codebooks
- CUDA kernel support for block-diagonal (col_offset parameter)
- Default nsamples=128 with warning if <64
Quality (SmolLM2-135M lm-eval 5-task)
| bpw | Stages | % of bf16 |
|-----|--------|-----------|
| 4 | 2 | 97.0% |
| 5 | 3 | 99.3% |
| 6 | 3 | 100.2% |
| 8 | 4 | 100.5% |
New Model
- xv0y5ncu/SmolLM3-3B-GLQ-6bpw — 99.6% of bf16 at true 6.0 bpw (block-diagonal FHT, zero padding)
Compatibility
- Fully backward compatible: existing power-of-2 models load and run unchanged
- Requires transformers >= 5.0 for small models
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]