Skip to content

Glq

v0.3.4 Feature

This release adds 1 notable feature for engineering teams evaluating rollout.

Published 16d Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

inference llm model-compression pytorch quantization

Summary

AI summary

Explicit cudagraph_capture_sizes list yields up to +18.5 % batched‑decode throughput on E4B.

Full changelog

TL;DR: examples/inference_vllm.py now sets compilation_config={"cudagraph_capture_sizes": [1, 2, 4, 8, 16]}. Batched-decode throughput at B=4 rises from 132.7 → 157.3 tok/s (+18.5 %) on E4B. Combined with the v0.3.2 piecewise default, that's now 4.5× over eager at B=4.

What changed

vLLM 0.20 derives its capture set from max_num_seqs * 2, which collapses to [1, 2] for single-sequence harnesses. Decode at B ≥ 3 then falls back to PIECEWISE-only — the attention forward still replays from a captured subgraph, but the linear path pays a full Python dispatch every iteration. Setting the list explicitly keeps the FULL model-forward graph active up to B=16.

Measured

Gemma-4-E4B-it-GLQ-4bpw, RTX PRO 6000 Blackwell, 256-tok decode:

| Mode | B=1 tok/s | B=4 tok/s (total) | per-seq @ B=4 |
|---|---:|---:|---:|
| Eager | 14.4 | 35.0 | 8.8 |
| Piecewise + default [1, 2] | 39.4 | 132.7 | 33.2 |
| Piecewise + [1, 2, 4, 8, 16] | 40.0 | 157.3 | 39.3 |

B=1 unchanged (FULL was already captured). At B=4 the per-sequence tok/s reaches 39.3 — essentially matching the B=1 ceiling, so the FULL graph at higher batch sizes reclaims the launch-overhead delta that PIECEWISE-only was leaving on the table.

Tradeoffs (documented in README)

  • VRAM cost: ~10-20 MB per captured shape on 3B/E4B-class models; ~100-200 MB per shape on 24-31B.
  • Capture time: ~1 s per shape, one-time at LLM init.
  • vLLM filters out capture sizes ≥ max_num_batched_tokens, so very small token budgets may silently drop the larger entries.

Tuning

Users can extend or trim the list directly:

LLM(model=..., compilation_config={
    "cudagraph_capture_sizes": [1, 4, 16, 32, 64],  # high-throughput
})

See the new "Tuning vLLM CUDA-graph capture sizes" sub-section under Advanced in the README for guidance.

Install

pip install -U glq==0.3.4

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Glq

Get notified when new releases ship.

Sign up free

Related context

Beta — feedback welcome: [email protected]