Glq

v0.2.13 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 3mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Performance optimizations increase token throughput by 19 % on Cascade‑2‑30B GLQ decoding.

Full changelog

Two CUDA-side optimisations on top of v0.2.12's fused MoE kernels. Cumulative tok/s gain: 13.24 → 15.73 (+19 %) on Cascade-2-30B GLQ 4.5bpw long-prompt cached decode (RTX PRO 6000 Blackwell, transformers 5.6.2 native nemotron_h).

v3a: hoist per-expert scalar sync points (`0cc838f`)

Each .item<float>() call inside the per-(token, expert) host loop in glq_fused_moe_cuda / glq_fused_moe_block_diag_cuda was a full GPU→CPU sync (~7 µs each). With 6 reads × ~8 active experts × 23 MoE layers per decoded token, this dominated the host budget — nsys showed 19,200 cudaStreamSynchronize calls / 150 ms / 7.6 % of decode wall time on a 1.97 s capture.

Pre-fetch Wscale, inv_resid_scale, inv_resid_scale2 (and topk_ids / topk_weights) to CPU once at the top of each entry function, then index via raw data_ptr<float>() pointers inside the loop. The cumulative effect was bigger than the standalone sync time would suggest — removing the syncs lets host-side dispatch run ahead of GPU work, amortising both launch and sync costs simultaneously.

v3b: drop dead memsets, fuse `output_rht_w13` + activation (`3f6d56c`)

The deterministic split-K matvec path (glq_matvec_splitk_scratch_kernel → glq_reduce_splits_kernel) writes its result via plain store, not atomicAdd, so the y_rht_*.zero_() calls before each matvec were wasted work. New glq_output_rht_act_multiblock_kernel + launch_output_rht_act_block_diag fuse the post-matvec output RHT with the per-element activation (relu² for non-gated NemotronH); the activation is applied in the kernel's final-store loop before the global write. Saves 3 launches per (token, expert) total.

Output bit-identical to v3a / v0.2.12. Headline tok/s unchanged from v3a — the cuts are correctness/cleanliness wins; the next limiter is GPU-side per-expert matvec compute.

Where the time goes now (RTX PRO 6000, post v3b nsys)

GLQ kernels: ~62 % of GPU time
- matvec_splitk_scratch<2/3>: ~42 %
- output_rht / input_rht: ~20 %
Mamba SSM (selective_state_update + causal_conv1d): <1 %
Host overhead: ~27 % (down from 32 % pre-v3a)

Remaining gap to bf16-native (~34 tok/s)

The remaining gap is GPU compute time on the per-expert matvec kernels, not host overhead. Each matvec already saturates ~7 400 thread blocks on ~120 Blackwell SMs, so multi-stream concurrency was profiled and skipped — won't help when individual kernels already saturate the GPU. Closing the gap further requires either a single-launch parallel-experts kernel (multi-day CUDA work) or accepting GLQ at ~46 % of bf16 on this MoE topology.

Cumulative since v0.2.11

| Path | tok/s | Notes |
|---|---|---|
| v0.2.11 (trust-remote-code + auto-patch) | 1.46 | only path that worked |
| v0.2.12 (native + fused MoE block-diag + stage-3) | 13.24 | +9.1× |
| v0.2.13 (+ host-overhead optimisations) | 15.73 | +10.8× over v0.2.11 |

Compatibility unchanged: glq>=0.2.13, transformers>=4.45,<5 for trust-remote-code path or transformers>=5.2 for the native nemotron_h path. mamba-ssm and causal-conv1d still required for NemotronH variants.

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Glq

Get notified when new releases ship.

About Glq

All releases →

Glq

Summary

v3a: hoist per-expert scalar sync points (`0cc838f`)

v3b: drop dead memsets, fuse `output_rht_w13` + activation (`3f6d56c`)

Where the time goes now (RTX PRO 6000, post v3b nsys)

Remaining gap to bf16-native (~34 tok/s)

Cumulative since v0.2.11

Related context

Related tools

Glq

Summary

v3a: hoist per-expert scalar sync points (0cc838f)

v3b: drop dead memsets, fuse output_rht_w13 + activation (3f6d56c)

Where the time goes now (RTX PRO 6000, post v3b nsys)

Remaining gap to bf16-native (~34 tok/s)

Cumulative since v0.2.11

Related context

Related tools

v3a: hoist per-expert scalar sync points (`0cc838f`)

v3b: drop dead memsets, fuse `output_rht_w13` + activation (`3f6d56c`)