This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryUpdates Other Changes, decode, and prefill across a mixed release.
Full changelog
CUDA C Kernels
Dequant split-K matvec (glq/csrc/glq_cuda.cu):
- 4 rows/warp with
__shfl_xor_syncreduction,__launch_bounds__(256,2) - Beats cuBLAS dense fp16 matmul on 2/3 benchmark shapes
- 2.7-3.0× faster than Triton kernels
| Shape | CUDA C | Triton | cuBLAS |
|-------|--------|--------|--------|
| 3072×3072 | 39μs | 104μs | 47μs |
| 3072×9216 | 51μs | 142μs | 39μs |
| 9216×3072 | 52μs | 158μs | 99μs |
Shared-memory FHT for input/output RHT:
- Double-buffered butterfly stages in shared memory
- 1.6-3.1× faster than Triton global-memory FHT (n_pad ≤ 8192)
Triton Now Optional
CUDA C handles all batch sizes:
- B=1: split-K matvec (decode)
- B>1: batched matvec (prefill)
- Dispatch: CUDA C > Triton > PyTorch fallback
Performance (SmolLM3-3B 3.5bpw, L40S)
| Metric | v0.2.2 (Triton) | v0.2.5 (CUDA C) | Speedup |
|--------|-----------------|------------------|---------|
| Decode (B=1) | 12.8 tok/s | 17.7 tok/s | +38% |
| Prefill (B=16) | — | 59 tok/s | new |
| Generate 128 | 14.0 tok/s | 17.1 tok/s | +22% |
Perplexity unchanged (7.20).
Other Changes
- Fix
ProcessPoolExecutorfork+CUDA deadlock (mp_context='spawn') - GLQ 3.5bpw mixed lm-eval results: 96.6% of bf16 accuracy
- 217 tests pass
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]