This release adds 2 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryPerformance optimizations increase token throughput by 19 % on Cascade‑2‑30B GLQ decoding.
Full changelog
Two CUDA-side optimisations on top of v0.2.12's fused MoE kernels. Cumulative tok/s gain: 13.24 → 15.73 (+19 %) on Cascade-2-30B GLQ 4.5bpw long-prompt cached decode (RTX PRO 6000 Blackwell, transformers 5.6.2 native nemotron_h).
v3a: hoist per-expert scalar sync points (0cc838f)
Each .item<float>() call inside the per-(token, expert) host loop in glq_fused_moe_cuda / glq_fused_moe_block_diag_cuda was a full GPU→CPU sync (~7 µs each). With 6 reads × ~8 active experts × 23 MoE layers per decoded token, this dominated the host budget — nsys showed 19,200 cudaStreamSynchronize calls / 150 ms / 7.6 % of decode wall time on a 1.97 s capture.
Pre-fetch Wscale, inv_resid_scale, inv_resid_scale2 (and topk_ids / topk_weights) to CPU once at the top of each entry function, then index via raw data_ptr<float>() pointers inside the loop. The cumulative effect was bigger than the standalone sync time would suggest — removing the syncs lets host-side dispatch run ahead of GPU work, amortising both launch and sync costs simultaneously.
v3b: drop dead memsets, fuse output_rht_w13 + activation (3f6d56c)
The deterministic split-K matvec path (glq_matvec_splitk_scratch_kernel → glq_reduce_splits_kernel) writes its result via plain store, not atomicAdd, so the y_rht_*.zero_() calls before each matvec were wasted work. New glq_output_rht_act_multiblock_kernel + launch_output_rht_act_block_diag fuse the post-matvec output RHT with the per-element activation (relu² for non-gated NemotronH); the activation is applied in the kernel's final-store loop before the global write. Saves 3 launches per (token, expert) total.
Output bit-identical to v3a / v0.2.12. Headline tok/s unchanged from v3a — the cuts are correctness/cleanliness wins; the next limiter is GPU-side per-expert matvec compute.
Where the time goes now (RTX PRO 6000, post v3b nsys)
- GLQ kernels: ~62 % of GPU time
matvec_splitk_scratch<2/3>: ~42 %output_rht/input_rht: ~20 %
- Mamba SSM (
selective_state_update+causal_conv1d): <1 % - Host overhead: ~27 % (down from 32 % pre-v3a)
Remaining gap to bf16-native (~34 tok/s)
The remaining gap is GPU compute time on the per-expert matvec kernels, not host overhead. Each matvec already saturates ~7 400 thread blocks on ~120 Blackwell SMs, so multi-stream concurrency was profiled and skipped — won't help when individual kernels already saturate the GPU. Closing the gap further requires either a single-launch parallel-experts kernel (multi-day CUDA work) or accepting GLQ at ~46 % of bf16 on this MoE topology.
Cumulative since v0.2.11
| Path | tok/s | Notes |
|---|---|---|
| v0.2.11 (trust-remote-code + auto-patch) | 1.46 | only path that worked |
| v0.2.12 (native + fused MoE block-diag + stage-3) | 13.24 | +9.1× |
| v0.2.13 (+ host-overhead optimisations) | 15.73 | +10.8× over v0.2.11 |
Compatibility unchanged: glq>=0.2.13, transformers>=4.45,<5 for trust-remote-code path or transformers>=5.2 for the native nemotron_h path. mamba-ssm and causal-conv1d still required for NemotronH variants.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]