This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+12 more
ReleasePort's take
Moderate signalDeepSeek V4 v0.5.12.post1 resolves crashes in disaggregation decode after ~2000 requests and restores HiSparse GSM8K accuracy to 0.960 when the compressor flag is enabled.
Why it matters: Fixes SWA allocator assertion failures post‑~2000 DSV4 + EAGLE/MTP disaggregation decode requests; raises HiSparse accuracy from 0.825 to 0.960 with `SGLANG_OPT_USE_COMPRESSOR_V2=1`.
Summary
AI summaryStability patch fixes DeepSeek V4 crashes, restores HiSparse accuracy, resolves disaggregation and PD issues, adds performance warm‑up for MHC buckets, and updates cu13 dependency.
Changes in this release
| Type | Severity | Summary | CVE |
|---|---|---|---|
| Dependency | Low |
Uses [cu13] extra for nvidia-cutlass-dsl, defaulting to CUDA 13 (required for sm_103 / B300). Uses [cu13] extra for nvidia-cutlass-dsl, defaulting to CUDA 13 (required for sm_103 / B300). Source: granite4.1:30b@2026-05-27-audit Confidence: low |
— |
| Performance | Medium |
Restores DSV4 HiSparse GSM8K accuracy from 0.825 to 0.960 when `SGLANG_OPT_USE_COMPRESSOR_V2=1` is enabled. Restores DSV4 HiSparse GSM8K accuracy from 0.825 to 0.960 when `SGLANG_OPT_USE_COMPRESSOR_V2=1` is enabled. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Performance | Medium |
Warm MHC token‑count buckets at DSV4 startup (gated by specific options) to eliminate 20–40 s cold‑bucket forward stalls. Warm MHC token‑count buckets at DSV4 startup (gated by specific options) to eliminate 20–40 s cold‑bucket forward stalls. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Performance | Low |
Precompiles DeepGEMM branch for `_dispatch_bf16_fp32_backend` in DSV4‑Pro to cut runtime JIT compile cost. Precompiles DeepGEMM branch for `_dispatch_bf16_fp32_backend` in DSV4‑Pro to cut runtime JIT compile cost. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Bugfix | Medium |
Fixes garbled text in DSV4-Pro single-token decode on B200/B300 by ceiling activation scales before packing deep_gemm UE8M0 path. Fixes garbled text in DSV4-Pro single-token decode on B200/B300 by ceiling activation scales before packing deep_gemm UE8M0 path. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Bugfix | Medium |
Resolves SWA allocator assertion crashes in DSV4 + EAGLE/MTP disaggregation decode after ~2000 requests by fixing stale sliding-window KV page mappings. Resolves SWA allocator assertion crashes in DSV4 + EAGLE/MTP disaggregation decode after ~2000 requests by fixing stale sliding-window KV page mappings. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Bugfix | Medium |
Prevents scheduler crash at startup for DSV4 NSA prefill context‑parallel mode with round‑robin‑split in disaggregation. Prevents scheduler crash at startup for DSV4 NSA prefill context‑parallel mode with round‑robin‑split in disaggregation. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Bugfix | Medium |
Enables DSV4 PD disaggregation to work with pipeline parallelism greater than 1 by removing stale `pp_size=1` assertion. Enables DSV4 PD disaggregation to work with pipeline parallelism greater than 1 by removing stale `pp_size=1` assertion. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Bugfix | Medium |
Prevents CUDA illegal memory access in DSV4‑Flash with dummy load format during CUDA‑graph capture by initializing `HashTopK.tid2eid` lookup table. Prevents CUDA illegal memory access in DSV4‑Flash with dummy load format during CUDA‑graph capture by initializing `HashTopK.tid2eid` lookup table. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Bugfix | Medium |
Corrects stale translation indices in DSV4 HiCache when `SGLANG_OPT_CACHE_SWA_TRANSLATION=1` after a cache rebuild, avoiding OOB writes and wrong outputs. Corrects stale translation indices in DSV4 HiCache when `SGLANG_OPT_CACHE_SWA_TRANSLATION=1` after a cache rebuild, avoiding OOB writes and wrong outputs. Source: llm_adapter@2026-05-27 Confidence: high |
— |
| Bugfix | Low |
Fixes missing `group` argument in `get_dp_buffer` function. Fixes missing `group` argument in `get_dp_buffer` function. Source: llm_adapter@2026-05-27 Confidence: high |
— |
Full changelog
v0.5.12.post1 is a stability patch on top of v0.5.12. It cherry-picks 12 fixes — primarily for DeepSeek V4 — onto the release branch.
Bug Fixes
DeepSeek V4
- DSV4-Pro emits garbled text during single-token decode on B200/B300 (fix
deep_gemmUE8M0 scale-packing path by ceiling activation scales before packing): #25733 - DSV4 + EAGLE/MTP in disaggregation decode crashes around 2000 requests with a SWA allocator assertion (recycled KV pages kept stale sliding-window mappings): #25805
- DSV4 NSA prefill context-parallel (
--enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split) in--disaggregation-mode prefill: scheduler crash at startup: #25396 - DSV4 HiSparse +
SGLANG_OPT_USE_COMPRESSOR_V2=1: GSM8K accuracy restored from 0.825 → 0.960: #25646 - DSV4 PD disaggregation now works with pipeline parallelism > 1 (removed stale
pp_size=1assertion): #25771 - DSV4-Flash with
--load-format dummy+ FlashInfer mxfp4 hits CUDA illegal memory access during CUDA-graph capture (the integerHashTopK.tid2eidlookup table was left uninitialized by dummy load): #25892 - DSV4 HiCache +
SGLANG_OPT_CACHE_SWA_TRANSLATION=1returns stale translation indices after a cache rebuild, causing OOB writes / wrong outputs: #25889
Disaggregation
- [PD][NIXL] Always send aux on
is_last; only expect state when truthy: #25699
Other
- Fix missing
grouparg inget_dp_buffer: #25585
Performance
- DSV4: warm MHC token-count buckets at startup (gated to
SGLANG_OPT_DEEPGEMM_HC_PRENORM=1+SGLANG_OPT_USE_TILELANG_MHC_PRE=1+ hybrid SWA) to eliminate 20–40s cold-bucket forward stalls: #25810 - DSV4-Pro: precompile a DeepGEMM branch for
_dispatch_bf16_fp32_backendto cut runtime JIT compile cost: #25860
Dependencies
- Use
[cu13]extra fornvidia-cutlass-dsl(default to CUDA 13; required for sm_103 / B300): #25576
All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.12...v0.5.12.post1
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.12...v0.5.12.post1
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About sglang
SGLang is a high-performance serving framework for large language models and multimodal models.
Related context
Related tools
Beta — feedback welcome: [email protected]