This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+13 more
ReleasePort's take
Light signalVersion v0.22.1 adds JetBrains Mellum v2 model support and resolves multiple initialization, loading, performance, and build issues.
Why it matters: Addresses critical bugs affecting DeepSeek‑V4 init, OlmoHybridForCausalLM checkpoint changes, HyperCLOVAX loading, AMD Zen inference speed, and Ray multi‑node hangs; also fixes Docker builds dependent on quarantined `flashinfer-jit-cache`.
Summary
AI summaryUpdates Model Support, Build & CI, and Hardware & Performance across a mixed release.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Feature | Low |
Adds support for JetBrains' Mellum v2 model. Adds support for JetBrains' Mellum v2 model. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Dependency | Low |
Stops installing `flashinfer-jit-cache` via extra index URL while it is quarantined on PyPI, fixing Docker image builds. Stops installing `flashinfer-jit-cache` via extra index URL while it is quarantined on PyPI, fixing Docker image builds. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Performance | Medium |
Routes W8A8 and W4A16 linear inference through zentorch kernels on AMD Zen CPUs for faster execution. Routes W8A8 and W4A16 linear inference through zentorch kernels on AMD Zen CPUs for faster execution. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Medium |
Fixes DeepSeek-V4 initialization issue caused by CUTLASS `fmin` compatibility problem. Fixes DeepSeek-V4 initialization issue caused by CUTLASS `fmin` compatibility problem. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Medium |
Resolves `OlmoHybridForCausalLM` failure after checkpoint changed `rope_parameters`. Resolves `OlmoHybridForCausalLM` failure after checkpoint changed `rope_parameters`. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Medium |
Restores loading of HyperCLOVAX after upstream repo removal by using vendored config. Restores loading of HyperCLOVAX after upstream repo removal by using vendored config. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Medium |
Fixes deterministic hang in multi-node Ray data-parallel serving when `num_api_servers > 1`. Fixes deterministic hang in multi-node Ray data-parallel serving when `num_api_servers > 1`. Source: llm_adapter@2026-06-05 Confidence: high |
— |
| Bugfix | Low |
Normalizes NIXL KV‑connector wheel installs to match the CUDA major version, preventing `ImportError: libcudart.so.12`. Normalizes NIXL KV‑connector wheel installs to match the CUDA major version, preventing `ImportError: libcudart.so.12`. Source: llm_adapter@2026-06-05 Confidence: high |
— |
Full changelog
Highlights
This release features 8 commits from 6 contributors (1 new)!
v0.22.1 is a patch release on top of v0.22.0 with targeted bug fixes plus a couple of additions: new model support for JetBrains' Mellum v2, zentorch-accelerated quantized linear inference on AMD Zen CPUs, and fixes for multi-node Ray data-parallel serving, DeepSeek-V4 initialization, and a few model-loading regressions.
Model Support
- New model: JetBrains' Mellum v2, an open-weights Mixture-of-Experts code-generation model (#43992).
- DeepSeek-V4: resolve a CUTLASS
fmincompatibility issue that broke initialization (0decac0d). - Fix
OlmoHybridForCausalLMfailing to initialise after the checkpoint changedrope_parametersfromNoneto{"rope_type": None}(#43846). - Fix HyperCLOVAX loading after the upstream HuggingFace repo removed its remote code (now native in
transformers >= 5.9.0): register thehyperclovaxmodel_type so vLLM uses its vendored config instead of the staleauto_map(#43860).
Hardware & Performance
- AMD Zen CPUs: route W8A8 (int8 dynamic-symmetric) and W4A16 (GPTQ) linear inference through zentorch kernels, registered ahead of the generic oneDNN CPU kernels, with transparent fallback on non-Zen CPUs, GPUs, and XPU (#41813).
Large Scale Serving
- Fix a deterministic hang in multi-node Ray data-parallel serving with
num_api_servers > 1by excluding the Ray DP backend from the deferred (kernel-assigned) port allocation introduced in #42585 (#43864).
Build & CI
- Docker: stop installing
flashinfer-jit-cachevia--extra-index-urlwhile it is quarantined on PyPI, fixing image builds (#44366). - Normalize NIXL KV-connector wheel installs so only the wheel matching the image's CUDA major is kept, fixing
ImportError: libcudart.so.12when importingnixl_epon CUDA 13 images (#44266).
Contributors
@khluu, @vadiklyutiy, @aadwived, @shadeMe, @alec-flowers, @hmellor
New Contributors
- @aadwived made their first contribution in https://github.com/vllm-project/vllm/pull/41813
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Related context
Related tools
Earlier breaking changes
- v0.21.0 C++20 compiler requirement for PyTorch compatibility
Beta — feedback welcome: [email protected]