Skip to content

Glq

v0.2.8 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 2mo Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Kernel optimizations, inference enhancements, and quantization improvements boost performance across GLQ modules.

Full changelog

What's New

Kernel Optimizations

  • Two-pass FHT for n_pad=32768: 38μs vs Triton 128μs (3.4× faster). Enables Devstral-24B and other models with large intermediate dimensions.
  • Matvec inner loop unroll-by-2: Two codebook gathers issued back-to-back for L2 latency overlap.
  • Fused linear extended to 32768: All dims now use the fused C++ path with stream-safe CUDACachingAllocator temp buffers.

Inference

  • CUDAGraphWrapper works with generate(): Uses StaticCache for fixed-shape KV buffers. SmolLM3-3B: 37 tok/s (1.79× over eager).
  • vLLM general_plugins entry point: GLQ auto-registers in all vLLM processes including v1 engine subprocess.

Quantization

  • Mistral3 streaming quantization: FP8 dequant, text_config extraction, rotary_emb for streaming mode.
  • Tokenizer_class stripping: Prevents vLLM/transformers compat issues.

Benchmarks (NVIDIA L40S)

| Model | Method | vLLM tok/s | Quality (5-task avg) |
|-------|--------|-----------|---------------------|
| SmolLM3-3B | bf16 | 39.4 | 0.709 |
| SmolLM3-3B | GLQ 3.5bpw | 37.1 (94%) | 0.685 (96.6%) |
| SmolLM3-3B | GPTQ W4 g128 | 34.6 (88%) | 0.698 (98.5%) |
| SmolLM2-360M | bf16 | — | 0.557 |
| SmolLM2-360M | GLQ 4bpw | — | 0.555 (99.6%) |
| SmolLM2-360M | GPTQ W4 g64 | — | 0.486 (87.2%) |

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]