Skip to content

Glq

v0.2.12 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 1mo Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Updates Fused MoE kernels, Out of scope, and Infra across a mixed release.

Full changelog

Highlights since v0.2.11:

Native HF nemotron_h integration

GLQ checkpoints (e.g. xv0y5ncu/Nemotron-Cascade-2-30B-A3B-GLQ-4.5bpw) now load through transformers' native nemotron_h integration without trust_remote_code=True.

  • E8RHTFusedExperts mirrors the native NemotronHExperts interface but is backed by per-expert E8RHTLinear modules (commits ae849ad, 41554a8).
  • State-dict prefix renamer translates legacy backbone.* keys to model.* via the standard _checkpoint_conversion_mapping hook.
  • Auto-patch for NemotronH cache bugs — fixes 5 latent issues in NVIDIA's trust-remote-code modeling file so use_cache=True actually threads the cache. Auto-applied at from_pretrained time when glq.hf_integration is imported (commit d12a3ee).

Fused MoE kernels

  • glq_fused_moe_cuda extended with stage-3 RVQ support (Qidxs3/inv_resid_scale2/codebook3); backward-compatible defaults for pre-existing 2-stage callers.
  • glq_fused_moe_block_diag_cuda — new entry point for non-power-of-2 expert dims (e.g. Cascade-2's 1856/2688). Reuses the existing block-diagonal multiblock FHT kernels; adds two static helpers (launch_input_rht_block_diag, launch_output_rht_block_diag).
  • E8RHTFusedExperts.forward lazily stacks per-expert buffers and dispatches the kernel for B≤4-token decode; per-expert Python loop fallback for prefill or unsupported cases.

Measured (RTX PRO 6000 Blackwell, Cascade-2-30B-A3B GLQ 4.5bpw)

Long prompt, use_cache=True, forced 20 new tokens:

| Path | tok/s |
|---|---|
| v0.2.11 (trust-remote-code + auto-patch) | 1.46 |
| v0.2.12 (native + fused MoE) | 13.24 |

That's a ~9.1× headline speedup on cached decode. Output bit-similar across both paths.

Infra

  • Pinned torch==2.11.0+cu128 in setup.sh.tftpl so fast_hadamard_transform builds against the cu12.9 host toolkit (6ba48b9).

Out of scope (deferred to future work)

  • True GPU-parallel multi-expert dispatch (current kernel still iterates (num_tokens × top_k) on host) — needed to close the remaining gap to bf16-native.
  • Tensor-parallel sharding for fused experts.

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]