vllm

v0.20.2 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 24d Model Serving & MLOps

✓ No known CVEs patched

✓ No known CVEs patched in this version

Topics

amd blackwell cuda deepseek deepseek-v3 llm

+13 more

gpt-oss inference kimi llama llm-serving model-serving moe openai pytorch qwen qwen3 tpu transformer

Summary

AI summary

Fixed DeepSeek V4 sparse attention hang and KV cache allocation error.

Full changelog

This release features 6 commits from 6 contributors (0 new)!

This is a small patch release with bug fixes for DeepSeek V4, gpt-oss, and Qwen3-VL

DeepSeek V4 sparse attention: Re-enable the persistent topk path on Hopper and ensure the memset kernel runs at CUDA graph capture time regardless of max_seq_len, fixing the MTP=1 hang on DeepSeek V4 (#41665, revert of #41605).
DeepSeek V4 KV cache: Fixed a "failure to allocate KV blocks" error in the V1 engine KV cache manager (#41282).
gpt-oss MXFP4 + torch.compile: Plumbed hidden_dim_unpadded through the moe_forward fake op so MXFP4 works under torch.compile on v0.20.x (#42002, backport of #41646).
Qwen3-VL: Removed an invalid deepstack boundary check that could fail under heavy load (#40932).

@ywang96, @zyongye, @stecasta, @wzhao18, @Isotr0py, @khluu

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track vllm

Get notified when new releases ship.

About vllm

A high-throughput and memory-efficient inference and serving engine for LLMs