Skip to content

vllm

v0.20.2 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 24d Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

amd blackwell cuda deepseek deepseek-v3 llm
+13 more
gpt-oss inference kimi llama llm-serving model-serving moe openai pytorch qwen qwen3 tpu transformer

Summary

AI summary

Fixed DeepSeek V4 sparse attention hang and KV cache allocation error.

Full changelog

vLLM v0.20.2

Highlights

This release features 6 commits from 6 contributors (0 new)!

This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL

Bug Fixes

  • DeepSeek V4 sparse attention: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of max_seq_len, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605).
  • DeepSeek V4 KV cache: Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager (#41282).
  • gpt-oss MXFP4 + torch.compile: Plumbed hidden_dim_unpadded through the moe_forward fake op so MXFP4 works under torch.compile on v0.20.x (#42002, backport of #41646).
  • Qwen3-VL: Removed an invalid deepstack boundary check that could fail under heavy load (#40932).

Contributors

@ywang96, @zyongye, @stecasta, @wzhao18, @Isotr0py, @khluu

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track vllm

Get notified when new releases ship.

Sign up free

About vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

All releases →

Related context

Earlier breaking changes

  • v0.21.0 C++20 compiler requirement for PyTorch compatibility

Beta — feedback welcome: [email protected]