This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryUpdates Fused MoE kernels, Out of scope, and Infra across a mixed release.
Full changelog
Highlights since v0.2.11:
Native HF nemotron_h integration
GLQ checkpoints (e.g. xv0y5ncu/Nemotron-Cascade-2-30B-A3B-GLQ-4.5bpw) now load through transformers' native nemotron_h integration without trust_remote_code=True.
E8RHTFusedExpertsmirrors the nativeNemotronHExpertsinterface but is backed by per-expertE8RHTLinearmodules (commitsae849ad,41554a8).- State-dict prefix renamer translates legacy
backbone.*keys tomodel.*via the standard_checkpoint_conversion_mappinghook. - Auto-patch for NemotronH cache bugs — fixes 5 latent issues in NVIDIA's trust-remote-code modeling file so
use_cache=Trueactually threads the cache. Auto-applied atfrom_pretrainedtime whenglq.hf_integrationis imported (commitd12a3ee).
Fused MoE kernels
glq_fused_moe_cudaextended with stage-3 RVQ support (Qidxs3/inv_resid_scale2/codebook3); backward-compatible defaults for pre-existing 2-stage callers.glq_fused_moe_block_diag_cuda— new entry point for non-power-of-2 expert dims (e.g. Cascade-2's 1856/2688). Reuses the existing block-diagonal multiblock FHT kernels; adds twostatichelpers (launch_input_rht_block_diag,launch_output_rht_block_diag).E8RHTFusedExperts.forwardlazily stacks per-expert buffers and dispatches the kernel for B≤4-token decode; per-expert Python loop fallback for prefill or unsupported cases.
Measured (RTX PRO 6000 Blackwell, Cascade-2-30B-A3B GLQ 4.5bpw)
Long prompt, use_cache=True, forced 20 new tokens:
| Path | tok/s |
|---|---|
| v0.2.11 (trust-remote-code + auto-patch) | 1.46 |
| v0.2.12 (native + fused MoE) | 13.24 |
That's a ~9.1× headline speedup on cached decode. Output bit-similar across both paths.
Infra
- Pinned
torch==2.11.0+cu128insetup.sh.tftplsofast_hadamard_transformbuilds against the cu12.9 host toolkit (6ba48b9).
Out of scope (deferred to future work)
- True GPU-parallel multi-expert dispatch (current kernel still iterates
(num_tokens × top_k)on host) — needed to close the remaining gap to bf16-native. - Tensor-parallel sharding for fused experts.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]