Skip to content

Glq

v0.2.7 Feature

This release adds 2 notable features for engineering teams evaluating rollout.

Published 2mo Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

CUDA Graph wrapper delivers a 2.3× decode speedup and INT8 KV cache halves memory usage for long-context serving.

Full changelog

What's New

  • CUDA Graph wrapper — 2.3× decode speedup (16.4 → 38.0 tok/s on SmolLM3-3B) by eliminating Python dispatch overhead between kernel launches
  • INT8 KV cache — pure PyTorch quantized KV cache (no quanto/hqq dependency), halves cache memory for long-context serving
  • NVTX profiling annotationsinput_rht, dequant_matmul, output_rht ranges in E8RHTLinear.forward() for nsys/ncu profiling
  • nsys/ncu profiling harnessbenchmarks/profile_nsys.py for GPU kernel analysis

Profiling Findings (SmolLM3-3B 3.5bpw, L40S)

  • 54% GPU time in dequant+matmul TC kernel (memory-bound, 42% of peak BW)
  • 35% in standard transformer ops (RMS norm, residual adds)
  • 60% of wall-clock was Python dispatch overhead — eliminated by CUDA graphs

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]