This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+13 more
Summary
AI summaryFixed DeepSeek V4 sparse attention hang and KV cache allocation error.
Full changelog
vLLM v0.20.2
Highlights
This release features 6 commits from 6 contributors (0 new)!
This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL
Bug Fixes
- DeepSeek V4 sparse attention: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of
max_seq_len, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605). - DeepSeek V4 KV cache: Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager (#41282).
- gpt-oss MXFP4 + torch.compile: Plumbed
hidden_dim_unpaddedthrough themoe_forwardfake op so MXFP4 works undertorch.compileon v0.20.x (#42002, backport of #41646). - Qwen3-VL: Removed an invalid deepstack boundary check that could fail under heavy load (#40932).
Contributors
@ywang96, @zyongye, @stecasta, @wzhao18, @Isotr0py, @khluu
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Related context
Related tools
Earlier breaking changes
- v0.21.0 C++20 compiler requirement for PyTorch compatibility
Beta — feedback welcome: [email protected]