Glq

v0.2.6 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 4mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Inline PTX Tensor Core kernel yields up to 5× faster model-level prefill throughput.

Full changelog

Inline PTX Tensor Core Kernel

Rewrote B>=2 prefill kernel with inline PTX mma.sync.aligned.m16n8k16 using correct register-to-element mapping from the PTX ISA spec. Direct codebook→register loading with no shared memory staging.

| B | CUDA C (PTX) | Triton TC | Speedup |
|---|-------------|-----------|---------|
| B=8 | 30μs | ~100μs | 3.3× |
| B=16 | 37μs | ~120μs | 3.2× |

Model-level prefill: 292 tok/s at B=16 (was 59 with Triton = 5× faster).

Key lesson: wmma with shared memory staging was 5.6× slower than inline PTX with direct register loading.

Clean HuggingFace Loading

Suppress MISSING warnings for 2bpw layers via _keys_to_ignore_on_load_missing
First quantized model published: xv0y5ncu/SmolLM3-3B-GLQ-3.5bpw

Performance (SmolLM3-3B 3.5bpw, L40S)

| Metric | Speed |
|--------|-------|
| B=1 decode | 17 tok/s |
| B=16 prefill | 292 tok/s |
| B=64 prefill | 882 tok/s |
| Generate 128 | 17.3 tok/s |

Perplexity unchanged (7.20). 217 tests pass.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq

Summary

Inline PTX Tensor Core Kernel

Clean HuggingFace Loading

Performance (SmolLM3-3B 3.5bpw, L40S)

Related context

Related tools