sglang

v0.5.12.post1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

attention blackwell cuda deepseek diffusion glm

+12 more

gpt-oss inference llama llm minimax moe qwen qwen-image reinforcement-learning transformer vlm wan

ReleasePort's take

Moderate signal

editorial:auto 2mo

DeepSeek V4 v0.5.12.post1 resolves crashes in disaggregation decode after ~2000 requests and restores HiSparse GSM8K accuracy to 0.960 when the compressor flag is enabled.

Why it matters: Fixes SWA allocator assertion failures post‑~2000 DSV4 + EAGLE/MTP disaggregation decode requests; raises HiSparse accuracy from 0.825 to 0.960 with `SGLANG_OPT_USE_COMPRESSOR_V2=1`.

Summary

AI summary

Stability patch fixes DeepSeek V4 crashes, restores HiSparse accuracy, resolves disaggregation and PD issues, adds performance warm‑up for MHC buckets, and updates cu13 dependency.

Changes in this release

Type	Severity	Summary	CVE
Dependency	Low	Uses [cu13] extra for nvidia-cutlass-dsl, defaulting to CUDA 13 (required for sm_103 / B300). Uses [cu13] extra for nvidia-cutlass-dsl, defaulting to CUDA 13 (required for sm_103 / B300). Source: granite4.1:30b@2026-05-27-audit Confidence: low	—
Performance
Performance	Medium	Restores DSV4 HiSparse GSM8K accuracy from 0.825 to 0.960 when `SGLANG_OPT_USE_COMPRESSOR_V2=1` is enabled. Restores DSV4 HiSparse GSM8K accuracy from 0.825 to 0.960 when `SGLANG_OPT_USE_COMPRESSOR_V2=1` is enabled. Source: llm_adapter@2026-05-27 Confidence: high	—
Performance	Medium	Warm MHC token‑count buckets at DSV4 startup (gated by specific options) to eliminate 20–40 s cold‑bucket forward stalls. Warm MHC token‑count buckets at DSV4 startup (gated by specific options) to eliminate 20–40 s cold‑bucket forward stalls. Source: llm_adapter@2026-05-27 Confidence: high	—
Performance	Low	Precompiles DeepGEMM branch for `_dispatch_bf16_fp32_backend` in DSV4‑Pro to cut runtime JIT compile cost. Precompiles DeepGEMM branch for `_dispatch_bf16_fp32_backend` in DSV4‑Pro to cut runtime JIT compile cost. Source: llm_adapter@2026-05-27 Confidence: high	—
Bugfix
Bugfix	Medium	Fixes garbled text in DSV4-Pro single-token decode on B200/B300 by ceiling activation scales before packing deep_gemm UE8M0 path. Fixes garbled text in DSV4-Pro single-token decode on B200/B300 by ceiling activation scales before packing deep_gemm UE8M0 path. Source: llm_adapter@2026-05-27 Confidence: high	—
Bugfix	Medium	Resolves SWA allocator assertion crashes in DSV4 + EAGLE/MTP disaggregation decode after ~2000 requests by fixing stale sliding-window KV page mappings. Resolves SWA allocator assertion crashes in DSV4 + EAGLE/MTP disaggregation decode after ~2000 requests by fixing stale sliding-window KV page mappings. Source: llm_adapter@2026-05-27 Confidence: high	—
Bugfix	Medium	Prevents scheduler crash at startup for DSV4 NSA prefill context‑parallel mode with round‑robin‑split in disaggregation. Prevents scheduler crash at startup for DSV4 NSA prefill context‑parallel mode with round‑robin‑split in disaggregation. Source: llm_adapter@2026-05-27 Confidence: high	—
Bugfix	Medium	Enables DSV4 PD disaggregation to work with pipeline parallelism greater than 1 by removing stale `pp_size=1` assertion. Enables DSV4 PD disaggregation to work with pipeline parallelism greater than 1 by removing stale `pp_size=1` assertion. Source: llm_adapter@2026-05-27 Confidence: high	—
Bugfix	Medium	Prevents CUDA illegal memory access in DSV4‑Flash with dummy load format during CUDA‑graph capture by initializing `HashTopK.tid2eid` lookup table. Prevents CUDA illegal memory access in DSV4‑Flash with dummy load format during CUDA‑graph capture by initializing `HashTopK.tid2eid` lookup table. Source: llm_adapter@2026-05-27 Confidence: high	—
Bugfix	Medium	Corrects stale translation indices in DSV4 HiCache when `SGLANG_OPT_CACHE_SWA_TRANSLATION=1` after a cache rebuild, avoiding OOB writes and wrong outputs. Corrects stale translation indices in DSV4 HiCache when `SGLANG_OPT_CACHE_SWA_TRANSLATION=1` after a cache rebuild, avoiding OOB writes and wrong outputs. Source: llm_adapter@2026-05-27 Confidence: high	—
Bugfix	Low	Fixes missing `group` argument in `get_dp_buffer` function. Fixes missing `group` argument in `get_dp_buffer` function. Source: llm_adapter@2026-05-27 Confidence: high	—

Full changelog

v0.5.12.post1 is a stability patch on top of v0.5.12. It cherry-picks 12 fixes — primarily for DeepSeek V4 — onto the release branch.

Bug Fixes

DeepSeek V4

DSV4-Pro emits garbled text during single-token decode on B200/B300 (fix deep_gemm UE8M0 scale-packing path by ceiling activation scales before packing): #25733
DSV4 + EAGLE/MTP in disaggregation decode crashes around 2000 requests with a SWA allocator assertion (recycled KV pages kept stale sliding-window mappings): #25805
DSV4 NSA prefill context-parallel (--enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split) in --disaggregation-mode prefill: scheduler crash at startup: #25396
DSV4 HiSparse + SGLANG_OPT_USE_COMPRESSOR_V2=1: GSM8K accuracy restored from 0.825 → 0.960: #25646
DSV4 PD disaggregation now works with pipeline parallelism > 1 (removed stale pp_size=1 assertion): #25771
DSV4-Flash with --load-format dummy + FlashInfer mxfp4 hits CUDA illegal memory access during CUDA-graph capture (the integer HashTopK.tid2eid lookup table was left uninitialized by dummy load): #25892
DSV4 HiCache + SGLANG_OPT_CACHE_SWA_TRANSLATION=1 returns stale translation indices after a cache rebuild, causing OOB writes / wrong outputs: #25889

Disaggregation

[PD][NIXL] Always send aux on is_last; only expect state when truthy: #25699

Other

Fix missing group arg in get_dp_buffer: #25585

Performance

DSV4: warm MHC token-count buckets at startup (gated to SGLANG_OPT_DEEPGEMM_HC_PRENORM=1 + SGLANG_OPT_USE_TILELANG_MHC_PRE=1 + hybrid SWA) to eliminate 20–40s cold-bucket forward stalls: #25810
DSV4-Pro: precompile a DeepGEMM branch for _dispatch_bf16_fp32_backend to cut runtime JIT compile cost: #25860

Dependencies

Use [cu13] extra for nvidia-cutlass-dsl (default to CUDA 13; required for sm_103 / B300): #25576

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.12...v0.5.12.post1

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.12...v0.5.12.post1

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track sglang

Get notified when new releases ship.

About sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

All releases →