vllm

v0.22.1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 1mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

amd blackwell cuda deepseek deepseek-v3 llm

+13 more

gpt-oss inference kimi llama llm-serving model-serving moe openai pytorch qwen qwen3 tpu transformer

ReleasePort's take

Light signal

editorial:auto 1mo

Version v0.22.1 adds JetBrains Mellum v2 model support and resolves multiple initialization, loading, performance, and build issues.

Why it matters: Addresses critical bugs affecting DeepSeek‑V4 init, OlmoHybridForCausalLM checkpoint changes, HyperCLOVAX loading, AMD Zen inference speed, and Ray multi‑node hangs; also fixes Docker builds dependent on quarantined `flashinfer-jit-cache`.

Summary

AI summary

Updates Model Support, Build & CI, and Hardware & Performance across a mixed release.

Changes in this release

Type	Severity	Summary	CVE
Feature	Low	Adds support for JetBrains' Mellum v2 model. Adds support for JetBrains' Mellum v2 model. Source: llm_adapter@2026-06-05 Confidence: high	—
Dependency	Low	Stops installing `flashinfer-jit-cache` via extra index URL while it is quarantined on PyPI, fixing Docker image builds. Stops installing `flashinfer-jit-cache` via extra index URL while it is quarantined on PyPI, fixing Docker image builds. Source: llm_adapter@2026-06-05 Confidence: high	—
Performance	Medium	Routes W8A8 and W4A16 linear inference through zentorch kernels on AMD Zen CPUs for faster execution. Routes W8A8 and W4A16 linear inference through zentorch kernels on AMD Zen CPUs for faster execution. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix
Bugfix	Medium	Fixes DeepSeek-V4 initialization issue caused by CUTLASS `fmin` compatibility problem. Fixes DeepSeek-V4 initialization issue caused by CUTLASS `fmin` compatibility problem. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix	Medium	Resolves `OlmoHybridForCausalLM` failure after checkpoint changed `rope_parameters`. Resolves `OlmoHybridForCausalLM` failure after checkpoint changed `rope_parameters`. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix	Medium	Restores loading of HyperCLOVAX after upstream repo removal by using vendored config. Restores loading of HyperCLOVAX after upstream repo removal by using vendored config. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix	Medium	Fixes deterministic hang in multi-node Ray data-parallel serving when `num_api_servers > 1`. Fixes deterministic hang in multi-node Ray data-parallel serving when `num_api_servers > 1`. Source: llm_adapter@2026-06-05 Confidence: high	—
Bugfix	Low	Normalizes NIXL KV‑connector wheel installs to match the CUDA major version, preventing `ImportError: libcudart.so.12`. Normalizes NIXL KV‑connector wheel installs to match the CUDA major version, preventing `ImportError: libcudart.so.12`. Source: llm_adapter@2026-06-05 Confidence: high	—

Full changelog

Highlights

This release features 8 commits from 6 contributors (1 new)!

v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.

Model Support

New model: JetBrains' Mellum v2, an open-weights Mixture-of-Experts code-generation model (#43992).
DeepSeek-V4: resolve a CUTLASS fmin compatibility issue that broke initialization (0decac0d).
Fix OlmoHybridForCausalLM failing to initialise after the checkpoint changed rope_parameters from None to {"rope_type": None} (#43846).
Fix HyperCLOVAX loading after the upstream HuggingFace repo removed its remote code (now native in transformers >= 5.9.0): register the hyperclovax model_type so vLLM uses its vendored config instead of the stale auto_map (#43860).

Hardware & Performance

AMD Zen CPUs: route W8A8 (int8 dynamic-symmetric) and W4A16 (GPTQ) linear inference through zentorch kernels, registered ahead of the generic oneDNN CPU kernels, with transparent fallback on non-Zen CPUs, GPUs, and XPU (#41813).

Large Scale Serving

Fix a deterministic hang in multi-node Ray data-parallel serving with num_api_servers > 1 by excluding the Ray DP backend from the deferred (kernel-assigned) port allocation introduced in #42585 (#43864).

Build & CI

Docker: stop installing flashinfer-jit-cache via --extra-index-url while it is quarantined on PyPI, fixing image builds (#44366).
Normalize NIXL KV-connector wheel installs so only the wheel matching the image's CUDA major is kept, fixing ImportError: libcudart.so.12 when importing nixl_ep on CUDA 13 images (#44266).

Contributors

@khluu, @vadiklyutiy, @aadwived, @shadeMe, @alec-flowers, @hmellor

New Contributors

@aadwived made their first contribution in https://github.com/vllm-project/vllm/pull/41813

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track vllm

Get notified when new releases ship.

About vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

All releases →

Related context

Related tools

Earlier breaking changes

v0.25.0 Makes Model Runner V2 the default for all dense models.
v0.25.0 Deletes PagedAttention implementation.
v0.21.0 C++20 compiler requirement for PyTorch compatibility