This release adds 1 notable feature for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
Summary
AI summaryExplicit cudagraph_capture_sizes list yields up to +18.5 % batched‑decode throughput on E4B.
Full changelog
TL;DR: examples/inference_vllm.py now sets compilation_config={"cudagraph_capture_sizes": [1, 2, 4, 8, 16]}. Batched-decode throughput at B=4 rises from 132.7 → 157.3 tok/s (+18.5 %) on E4B. Combined with the v0.3.2 piecewise default, that's now 4.5× over eager at B=4.
What changed
vLLM 0.20 derives its capture set from max_num_seqs * 2, which collapses to [1, 2] for single-sequence harnesses. Decode at B ≥ 3 then falls back to PIECEWISE-only — the attention forward still replays from a captured subgraph, but the linear path pays a full Python dispatch every iteration. Setting the list explicitly keeps the FULL model-forward graph active up to B=16.
Measured
Gemma-4-E4B-it-GLQ-4bpw, RTX PRO 6000 Blackwell, 256-tok decode:
| Mode | B=1 tok/s | B=4 tok/s (total) | per-seq @ B=4 |
|---|---:|---:|---:|
| Eager | 14.4 | 35.0 | 8.8 |
| Piecewise + default [1, 2] | 39.4 | 132.7 | 33.2 |
| Piecewise + [1, 2, 4, 8, 16] | 40.0 | 157.3 | 39.3 |
B=1 unchanged (FULL was already captured). At B=4 the per-sequence tok/s reaches 39.3 — essentially matching the B=1 ceiling, so the FULL graph at higher batch sizes reclaims the launch-overhead delta that PIECEWISE-only was leaving on the table.
Tradeoffs (documented in README)
- VRAM cost: ~10-20 MB per captured shape on 3B/E4B-class models; ~100-200 MB per shape on 24-31B.
- Capture time: ~1 s per shape, one-time at LLM init.
- vLLM filters out capture sizes ≥
max_num_batched_tokens, so very small token budgets may silently drop the larger entries.
Tuning
Users can extend or trim the list directly:
LLM(model=..., compilation_config={
"cudagraph_capture_sizes": [1, 4, 16, 32, 64], # high-throughput
})
See the new "Tuning vLLM CUDA-graph capture sizes" sub-section under Advanced in the README for guidance.
Install
pip install -U glq==0.3.4
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]