This release adds 2 notable features for engineering teams evaluating rollout.
Published 2mo
Model Serving & MLOps
✓ No known CVEs patched
✓ No known CVEs patched in this version
Topics
inference
llm
model-compression
pytorch
quantization
Summary
AI summaryCUDA Graph wrapper delivers a 2.3× decode speedup and INT8 KV cache halves memory usage for long-context serving.
Full changelog
What's New
- CUDA Graph wrapper — 2.3× decode speedup (16.4 → 38.0 tok/s on SmolLM3-3B) by eliminating Python dispatch overhead between kernel launches
- INT8 KV cache — pure PyTorch quantized KV cache (no quanto/hqq dependency), halves cache memory for long-context serving
- NVTX profiling annotations —
input_rht,dequant_matmul,output_rhtranges in E8RHTLinear.forward() for nsys/ncu profiling - nsys/ncu profiling harness —
benchmarks/profile_nsys.pyfor GPU kernel analysis
Profiling Findings (SmolLM3-3B 3.5bpw, L40S)
- 54% GPU time in dequant+matmul TC kernel (memory-bound, 42% of peak BW)
- 35% in standard transformer ops (RMS norm, residual adds)
- 60% of wall-clock was Python dispatch overhead — eliminated by CUDA graphs
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]