Skip to content

sglang

v0.5.12.post1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

attention blackwell cuda deepseek diffusion glm
+12 more
gpt-oss inference llama llm minimax moe qwen qwen-image reinforcement-learning transformer vlm wan

ReleasePort's take

Moderate signal
editorial:auto 7d

DeepSeek V4 v0.5.12.post1 resolves crashes in disaggregation decode after ~2000 requests and restores HiSparse GSM8K accuracy to 0.960 when the compressor flag is enabled.

Why it matters: Fixes SWA allocator assertion failures post‑~2000 DSV4 + EAGLE/MTP disaggregation decode requests; raises HiSparse accuracy from 0.825 to 0.960 with `SGLANG_OPT_USE_COMPRESSOR_V2=1`.

Summary

AI summary

Stability patch fixes DeepSeek V4 crashes, restores HiSparse accuracy, resolves disaggregation and PD issues, adds performance warm‑up for MHC buckets, and updates cu13 dependency.

Changes in this release

Dependency Low

Uses [cu13] extra for nvidia-cutlass-dsl, defaulting to CUDA 13 (required for sm_103 / B300).

Uses [cu13] extra for nvidia-cutlass-dsl, defaulting to CUDA 13 (required for sm_103 / B300).

Source: granite4.1:30b@2026-05-27-audit

Confidence: low

Performance Medium

Restores DSV4 HiSparse GSM8K accuracy from 0.825 to 0.960 when `SGLANG_OPT_USE_COMPRESSOR_V2=1` is enabled.

Restores DSV4 HiSparse GSM8K accuracy from 0.825 to 0.960 when `SGLANG_OPT_USE_COMPRESSOR_V2=1` is enabled.

Source: llm_adapter@2026-05-27

Confidence: high

Performance Medium

Warm MHC token‑count buckets at DSV4 startup (gated by specific options) to eliminate 20–40 s cold‑bucket forward stalls.

Warm MHC token‑count buckets at DSV4 startup (gated by specific options) to eliminate 20–40 s cold‑bucket forward stalls.

Source: llm_adapter@2026-05-27

Confidence: high

Performance Low

Precompiles DeepGEMM branch for `_dispatch_bf16_fp32_backend` in DSV4‑Pro to cut runtime JIT compile cost.

Precompiles DeepGEMM branch for `_dispatch_bf16_fp32_backend` in DSV4‑Pro to cut runtime JIT compile cost.

Source: llm_adapter@2026-05-27

Confidence: high

Bugfix Medium

Fixes garbled text in DSV4-Pro single-token decode on B200/B300 by ceiling activation scales before packing deep_gemm UE8M0 path.

Fixes garbled text in DSV4-Pro single-token decode on B200/B300 by ceiling activation scales before packing deep_gemm UE8M0 path.

Source: llm_adapter@2026-05-27

Confidence: high

Bugfix Medium

Resolves SWA allocator assertion crashes in DSV4 + EAGLE/MTP disaggregation decode after ~2000 requests by fixing stale sliding-window KV page mappings.

Resolves SWA allocator assertion crashes in DSV4 + EAGLE/MTP disaggregation decode after ~2000 requests by fixing stale sliding-window KV page mappings.

Source: llm_adapter@2026-05-27

Confidence: high

Bugfix Medium

Prevents scheduler crash at startup for DSV4 NSA prefill context‑parallel mode with round‑robin‑split in disaggregation.

Prevents scheduler crash at startup for DSV4 NSA prefill context‑parallel mode with round‑robin‑split in disaggregation.

Source: llm_adapter@2026-05-27

Confidence: high

Bugfix Medium

Enables DSV4 PD disaggregation to work with pipeline parallelism greater than 1 by removing stale `pp_size=1` assertion.

Enables DSV4 PD disaggregation to work with pipeline parallelism greater than 1 by removing stale `pp_size=1` assertion.

Source: llm_adapter@2026-05-27

Confidence: high

Bugfix Medium

Prevents CUDA illegal memory access in DSV4‑Flash with dummy load format during CUDA‑graph capture by initializing `HashTopK.tid2eid` lookup table.

Prevents CUDA illegal memory access in DSV4‑Flash with dummy load format during CUDA‑graph capture by initializing `HashTopK.tid2eid` lookup table.

Source: llm_adapter@2026-05-27

Confidence: high

Bugfix Medium

Corrects stale translation indices in DSV4 HiCache when `SGLANG_OPT_CACHE_SWA_TRANSLATION=1` after a cache rebuild, avoiding OOB writes and wrong outputs.

Corrects stale translation indices in DSV4 HiCache when `SGLANG_OPT_CACHE_SWA_TRANSLATION=1` after a cache rebuild, avoiding OOB writes and wrong outputs.

Source: llm_adapter@2026-05-27

Confidence: high

Bugfix Low

Fixes missing `group` argument in `get_dp_buffer` function.

Fixes missing `group` argument in `get_dp_buffer` function.

Source: llm_adapter@2026-05-27

Confidence: high

Full changelog

v0.5.12.post1 is a stability patch on top of v0.5.12. It cherry-picks 12 fixes — primarily for DeepSeek V4 — onto the release branch.

Bug Fixes

DeepSeek V4

  • DSV4-Pro emits garbled text during single-token decode on B200/B300 (fix deep_gemm UE8M0 scale-packing path by ceiling activation scales before packing): #25733
  • DSV4 + EAGLE/MTP in disaggregation decode crashes around 2000 requests with a SWA allocator assertion (recycled KV pages kept stale sliding-window mappings): #25805
  • DSV4 NSA prefill context-parallel (--enable-nsa-prefill-context-parallel --nsa-prefill-cp-mode round-robin-split) in --disaggregation-mode prefill: scheduler crash at startup: #25396
  • DSV4 HiSparse + SGLANG_OPT_USE_COMPRESSOR_V2=1: GSM8K accuracy restored from 0.825 → 0.960: #25646
  • DSV4 PD disaggregation now works with pipeline parallelism > 1 (removed stale pp_size=1 assertion): #25771
  • DSV4-Flash with --load-format dummy + FlashInfer mxfp4 hits CUDA illegal memory access during CUDA-graph capture (the integer HashTopK.tid2eid lookup table was left uninitialized by dummy load): #25892
  • DSV4 HiCache + SGLANG_OPT_CACHE_SWA_TRANSLATION=1 returns stale translation indices after a cache rebuild, causing OOB writes / wrong outputs: #25889

Disaggregation

  • [PD][NIXL] Always send aux on is_last; only expect state when truthy: #25699

Other

  • Fix missing group arg in get_dp_buffer: #25585

Performance

  • DSV4: warm MHC token-count buckets at startup (gated to SGLANG_OPT_DEEPGEMM_HC_PRENORM=1 + SGLANG_OPT_USE_TILELANG_MHC_PRE=1 + hybrid SWA) to eliminate 20–40s cold-bucket forward stalls: #25810
  • DSV4-Pro: precompile a DeepGEMM branch for _dispatch_bf16_fp32_backend to cut runtime JIT compile cost: #25860

Dependencies

  • Use [cu13] extra for nvidia-cutlass-dsl (default to CUDA 13; required for sm_103 / B300): #25576

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.12...v0.5.12.post1

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.12...v0.5.12.post1

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track sglang

Get notified when new releases ship.

Sign up free

About sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

All releases →

Related context

Beta — feedback welcome: [email protected]