Skip to content

vllm

v0.22.1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 12h Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

amd blackwell cuda deepseek deepseek-v3 llm
+13 more
gpt-oss inference kimi llama llm-serving model-serving moe openai pytorch qwen qwen3 tpu transformer

ReleasePort's take

Light signal
editorial:auto 11h

Version v0.22.1 adds JetBrains Mellum v2 model support and resolves multiple initialization, loading, performance, and build issues.

Why it matters: Addresses critical bugs affecting DeepSeek‑V4 init, OlmoHybridForCausalLM checkpoint changes, HyperCLOVAX loading, AMD Zen inference speed, and Ray multi‑node hangs; also fixes Docker builds dependent on quarantined `flashinfer-jit-cache`.

Summary

AI summary

Updates Model Support, Build & CI, and Hardware & Performance across a mixed release.

Changes in this release

Feature Low

Adds support for JetBrains' Mellum v2 model.

Adds support for JetBrains' Mellum v2 model.

Source: llm_adapter@2026-06-05

Confidence: high

Dependency Low

Stops installing `flashinfer-jit-cache` via extra index URL while it is quarantined on PyPI, fixing Docker image builds.

Stops installing `flashinfer-jit-cache` via extra index URL while it is quarantined on PyPI, fixing Docker image builds.

Source: llm_adapter@2026-06-05

Confidence: high

Performance Medium

Routes W8A8 and W4A16 linear inference through zentorch kernels on AMD Zen CPUs for faster execution.

Routes W8A8 and W4A16 linear inference through zentorch kernels on AMD Zen CPUs for faster execution.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Medium

Fixes DeepSeek-V4 initialization issue caused by CUTLASS `fmin` compatibility problem.

Fixes DeepSeek-V4 initialization issue caused by CUTLASS `fmin` compatibility problem.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Medium

Resolves `OlmoHybridForCausalLM` failure after checkpoint changed `rope_parameters`.

Resolves `OlmoHybridForCausalLM` failure after checkpoint changed `rope_parameters`.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Medium

Restores loading of HyperCLOVAX after upstream repo removal by using vendored config.

Restores loading of HyperCLOVAX after upstream repo removal by using vendored config.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Medium

Fixes deterministic hang in multi-node Ray data-parallel serving when `num_api_servers > 1`.

Fixes deterministic hang in multi-node Ray data-parallel serving when `num_api_servers > 1`.

Source: llm_adapter@2026-06-05

Confidence: high

Bugfix Low

Normalizes NIXL KV‑connector wheel installs to match the CUDA major version, preventing `ImportError: libcudart.so.12`.

Normalizes NIXL KV‑connector wheel installs to match the CUDA major version, preventing `ImportError: libcudart.so.12`.

Source: llm_adapter@2026-06-05

Confidence: high

Full changelog

Highlights

This release features 8 commits from 6 contributors (1 new)!

v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.

Model Support

  • New model: JetBrains' Mellum v2, an open-weights Mixture-of-Experts code-generation model (#43992).
  • DeepSeek-V4: resolve a CUTLASS fmin compatibility issue that broke initialization (0decac0d).
  • Fix OlmoHybridForCausalLM failing to initialise after the checkpoint changed rope_parameters from None to {"rope_type": None} (#43846).
  • Fix HyperCLOVAX loading after the upstream HuggingFace repo removed its remote code (now native in transformers >= 5.9.0): register the hyperclovax model_type so vLLM uses its vendored config instead of the stale auto_map (#43860).

Hardware & Performance

  • AMD Zen CPUs: route W8A8 (int8 dynamic-symmetric) and W4A16 (GPTQ) linear inference through zentorch kernels, registered ahead of the generic oneDNN CPU kernels, with transparent fallback on non-Zen CPUs, GPUs, and XPU (#41813).

Large Scale Serving

  • Fix a deterministic hang in multi-node Ray data-parallel serving with num_api_servers > 1 by excluding the Ray DP backend from the deferred (kernel-assigned) port allocation introduced in #42585 (#43864).

Build & CI

  • Docker: stop installing flashinfer-jit-cache via --extra-index-url while it is quarantined on PyPI, fixing image builds (#44366).
  • Normalize NIXL KV-connector wheel installs so only the wheel matching the image's CUDA major is kept, fixing ImportError: libcudart.so.12 when importing nixl_ep on CUDA 13 images (#44266).

Contributors

@khluu, @vadiklyutiy, @aadwived, @shadeMe, @alec-flowers, @hmellor

New Contributors

  • @aadwived made their first contribution in https://github.com/vllm-project/vllm/pull/41813

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track vllm

Get notified when new releases ship.

Sign up free

About vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

All releases →

Related context

Earlier breaking changes

  • v0.21.0 C++20 compiler requirement for PyTorch compatibility

Beta — feedback welcome: [email protected]