v0.5.16 Breaking risk 2d

Broad release touches NPU / Ascend, Speculative Decoding, DeepSeek V4, and Parallelism & Disaggregation.

Full changelog

Highlights

574 PRs from 169 contributors.

DSpark: confidence-driven speculative decoding: A new speculative algorithm. It drafts semi-autoregressively in blocks, then sizes each verify window from the draft's own confidence instead of a fixed draft length. Reaches 383.7 tok/s at accept length ~5 on DeepSeek-V4-Pro, TP8 on B300 (bs=1). Enable with --speculative-algorithm DSPARK and SGLANG_RAGGED_VERIFY_MODE=compact; tune the block with --speculative-dspark-block-size (#30261, #31434, blog).

Inkling support: A 975B-parameter multimodal MoE with a 1M-token context. It mixes sliding-window, full and Mamba2 linear attention, and adds an NVFP4 MoE, optional vision/audio towers and native MTP. On Blackwell it reaches up to 71.7k tok/s input and 171.0 tok/s per-user decode. Verified on Blackwell TP4/TP8, H200 and AMD MI350X / MI355X (#31681, blog, cookbook).

Other new models added: LongCat 2.0 FP8, JetBrains Mellum v2, Pi0.5, plus diffusion support for LongLive 2.0.

UnifiedRadixTree is now the default for SWA, Mamba and DSA models. Replay SSM and Mamba int8 checkpoints are synced onto it, and a cache hit now resets only the state it used (#30468, #30636, #30626, #31643).

GLM-5.2 DSA cache layer split under prefill CP: KV and indexer cache layers are sharded across CP ranks. Each rank owns a disjoint layer range instead of all layers. That cuts per-rank KV memory by ~74% (0.77 to 0.20 GB/rank) at 8192 tokens on GLM-5.2-FP8, 78 layers, cp_size=4. Enable with --enable-dsa-cache-layer-split, which needs --enable-prefill-cp --cp-strategy interleave (#29421).

ReplaySSM Ring Spec-Verify (GDN): Drops the per-draft SSM snapshot. Speculative scratch goes from 11.5 GB to 1.8 GB per GPU (6.4x smaller) on Qwen3.5-35B-A3B at TP1, at accuracy and throughput parity. Opt in with --enable-gdn-replayssm-spec (default off; GDN with a linear draft chain only, --speculative-eagle-topk in {None, 1}), and tune the ring via --linear-replayssm-cache-len (#28695).

Linear attention on Blackwell (SM100): The first correct KDA MTP path. Its recurrent_kda decode kernel runs at 29.6 us vs 36.8 us for Triton (ncu, B=64). The full decode path reaches parity by B=128 and 1.35x at B=256, and is slower below that (#30113). Separately, GDN/KDA CuteDSL prefill fuses state I/O into the chunk-h kernel (#30169).

QServe and FBGEMM FP8 quantization are removed: the experimental QServe (QoQ) W4A8 and FBGEMM FP8 paths are gone. --fp4-gemm-backend cutlass goes too, along with the in-tree NVFP4 JIT kernels, so NVFP4 GEMM now requires FlashInfer (#31109, #30448).

Dependencies: flashinfer 0.6.14 (#29910), CuTe DSL 4.6.0 (#31714), sgl-kernel 0.4.5 (#31496), llguidance 1.7.6 (#31484).

Breaking Changes & Upgrade Notes

The experimental QServe (QoQ) W4A8 and FBGEMM FP8 quantization paths are removed (per #28543): #31109
CUTLASS FP8 blockwise deleted for SM90 / SM100, SM120 moved to JIT: #30438
--fp4-gemm-backend cutlass is removed along with the in-tree NVFP4 JIT kernels, so NVFP4 GEMM now requires FlashInfer. Use auto, which picks flashinfer_cutedsl on SM100 and flashinfer_cutlass on SM120: #30448
UnifiedRadixTree is now the default for SWA, Mamba and DSA models. A behavior change on those architectures: #30468
Chunked input-logprob processing is now on by default to cap peak memory: #31498
FA3 sparse mask kernels are off by default: #30356
Legacy Sphinx docs/ removed; the Mintlify cutover is complete: #28964
sglang.kernels namespace: kernels are relocated verbatim and only import paths change; public wrappers keep defaulting to the AOT sgl_kernel backend, so code reaching past them to internal paths must update (RFC #29630): #30044, #31582
num_tokens_per_bs renamed to num_tokens_per_req across spec-decoding runners: #30977
--enable-deepep-waterfill is renamed to --enable-waterfill with no deprecated alias, so existing launch commands fail with unrecognized arguments: #27350
--optimistic-prefill-retries is renamed to --optimistic-prefill-attempts with no deprecated alias: #30951
The SGLang-Diffusion post-training rollout endpoint now returns application/msgpack instead of JSON, with tensors as raw msgpack bytes rather than base64 (tensor_to_base64 / base64_to_tensor become tensor_to_bytes / bytes_to_tensor), so RL rollout consumers must be upgraded in lockstep with the server: #31565

Known Issues

Temperature-0 nondeterminism under DP attention with breakable prefill CUDA graph. On the DSV4-Flash FP4 recipe, the idle-rank dummy extend introduced by #30898 perturbs real requests' logits, so identical temperature-0 requests can diverge. The guarding determinism test is disabled as a stopgap rather than fixed (#31125); not enabling breakable prefill CUDA graph avoids the path.
A bump to flashinfer 0.6.15 was landed and reverted this cycle; this release pins 0.6.14 (#31502, #31625).
Mamba track-boundary seqlen under the overlap scheduler was fixed and then reverted (#31369, #31622). The underlying issue is still open.
CPU AMX optimizations for diffusion were reverted (#28527, #30716).
GB300 CI jobs were temporarily disabled for runner availability during this cycle (#31764), so GB300 coverage rests on the cookbook's manual end-to-end validation.

Full release notes by category below.

New Model Support

| Model | Type | PRs | Cookbook |
|---|---|---|---|
| Inkling | autoregressive | #31681 | link |
| LongCat 2.0 | autoregressive | #30275, #30320 | link |
| JetBrains Mellum v2 | autoregressive | #27375 | wip |
| Pi0.5 | vla | #30633 | link |
| LongLive 2.0 | diffusion | #27639 | link |

Landed this cycle but not yet usable end-to-end: MiniMax-M3 completes its four-part landing (#28715, begun in v0.5.14) but its cookbook still points at a dev image (#31819).

Inkling

Add Inkling model support: #31681 ⭐
Add Inkling cookbook: #31360
[Docs] Inkling cookbook: mark B300/GB300 recipes verified, tune B300 MTP mem fractions: #31550
[Cookbook] Inkling: add measured accuracy numbers to benchmark cards: #31823
[Docs] Inkling cookbook: LoRA cells require --disable-prefill-cuda-graph: #31418
Fix dropped Inkling reasoning at stream end: #31787
[Spec] fix inkling multi layer mtp draft extend cuda graph: #32254 (cherry-picked as #32260)

GLM-5.2

[Feature][GLM5.2] Add DSA Cache Layer Split under Prefill CP: #29421 ⭐
support GLM-5.2 MTP index sharing with prefill CP: #30992
[Fix] Stabilize GLM-5.2 MTP IndexShare across PD and CUDA graph replay: #30839
[GLM5][MoE] perf: Write FlashInfer TRT-LLM MoE output directly: #28416
Fix GLM/DeepSeek NVFP4 + flashinfer_trtllm long-context "!!!!" collapse (NaN routing): #31001
[Docs] Update GLM5.2 Cookbook with LayerSplit usage: #31577

DeepSeek V4

[DSA] Integrate Q8KV8 FP8 Sparse MLA Prefill into the DSA Backend (DeepSeek-V3.2): #30514
[DeepSeek-V4] Enable non-paged indexer by default for large prefill chunks: #30140
[Feature] Support DeepSeek-V4 Wint4Abf16 and Win4Afp8: #25763
[DeepSeek-V4] Support BF16 Compress State for Online C128: #29609
Implement SM120 DeepSeek V4 flashinfer_mxfp4 moe runner backend + TP2: #30272
[DSV4] Remove per-step seqlen D2H from speculative to make overlap scheduler work: #30365
[DSV4] Use BF16 instead of FP32 for indexer score computation: #30012
[DSA] Fix top-k v2 emitting invalid indices under tie overflow / inf scores (IMA in FA3 sparse decode): #30645
[DeepSeek-V4] Fix idle-rank dummy-extend sparse-prefill crash under DP breakable CUDA graph: #31705
Fix nvfp4 online scale with pcg: #32246 (cherry-picked as #32259)
Fix stale flashinfer-MLA fallback poisoning spec verify capture (trtllm_mla + tc_piecewise): #32288 (cherry-picked as #32346)

Speculative Decoding

[Spec] Add DSpark: confidence-scheduled speculative decoding: #30261 ⭐
[GDN] Support ReplaySSM Ring Spec-Verify: #28695 ⭐
fa3/fa4: sync-free for all backends and phases: #29589
fa3: sync-free eagle spec via fixed-window draft-extend metadata: #31364
fa3: build the topk>1 verify replay page table on-device: #31381
flashmla: sync-free spec via device-side draft-extend: #31090
[Spec] DFlash: remove per-step host syncs so the CPU runs a full step ahead (spec-v2 overlap): #31468
[Perf] Cache uniform ragged-verify layout for DSpark verify-all compact: #31434
Support speculative decoding on CPU: #27862

Piecewise & Breakable CUDA Graph

Enable breakable prefill CUDA graph for DP attention: #30898
feat: enable piecewise prefill graph for Kimi K2.5/K2.7: #30889
[Diffusion] Enable breakable CUDA graph (BCG) for diffusion DiTs: #27436

Attention Backends

[KDA] Add FlashInfer SM100 KDA decode + MTP (target_verify) backend: #30113 ⭐
[GDN/KDA] Fuse SM100 CuteDSL prefill state I/O into the chunk h kernel: #30169 ⭐
[GDN] Auto-select FlashInfer GDN prefill on validated SM100 configs: #29734
[Feature] Add FP4 KV Cache Design and support SM120 GPUs: #21601
Fix KDA prefix caching under mamba extra_buffer and enable it for kimi_linear: #31474
Fuse the preprocess kernels of trtllm-gen attention: #29690

MoE & Expert Parallelism

[1/N] elastic-ep: Add runtime EP scale-up: #30164
Support Waterfill with MegaMoE backend: #27350
Support Flashinfer one-sided A2A + CuteDSL MoE for Nemotron Ultra: #28309
Improve EPLB dispatch handling and diagnostics: #30646

Quantization

Remove QServe and FBGEMM FP8 quantization: #31109 ⭐
Delete CUTLASS FP8 blockwise for SM90 and SM100, move SM120 to JIT and add SwapAB: #30438
Refactor FP4 quantization and remove deprecated JIT kernels: #30448
[Quantization] add humming quantization kernel: #23754

Parallelism & Disaggregation

[CP] Migrate MLA prefill CP (DeepSeek V3) to CP-v2 zigzag strategy: #31619
Support MiMo V2.5 with zigzag context parallelism: #29972
Support GPT-OSS zigzag CP with TRTLLM-MHA: #31732
[DCP] Enable decode context parallel for Kimi K2.5 NVFP4: #31514
[PDD] Add true request retraction for PDD: #25372
[PD] Improve optimistic prefill: #30951
[PD] Fix optimistic prefill inflight-queue hangs on parked/aborted reqs: #31075
feat(grpc): support disaggregated generation requests: #30440
[gRPC] Native server: launcher + HTTP + server args wiring (3/4): #23508
feat: add native gRPC sidecar module launcher: #31076 (cherry-picked as #32074)

Scheduler & Runtime

Using UnifiedRadixTree by default for SWA, Mamba, and DSA models: #30468 ⭐
[Feature] Add --default-chat-template-kwargs server arg: #29579
[Scheduler] Add SGLANG_MAX_NEW_TOKENS_LIMIT to cap per-request max_new_tokens: #22591
Support priority request header override: #30811
Align reasoning_effort schema across chat, tokenize, and responses: #31784
Return top-p/top-k sampling mask/nucleus: #27408
[Scheduler] Move the WAR barrier to right after each run_batch launch: #31687
[Fix] Enable chunked input-logprob processing by default to cap peak memory: #31498
[Refactor] Unify logprob results into a single LogprobResult and rename chunk env vars: #31733
[dLLM] Make FDFO a framework capability for all dLLM algorithms: #27551

HiCache & Radix Cache

[HiCache] Add the FlexKV storage connector: --enable-flexkv routes the KV cache through FlexKV's KVManager for host-tier offload, configured via --flexkv-config-file: #29701
[HiCache] Add a client-side metadata cache for the HiCacheFile backend, bypassing directory traversal on lookups (SGLANG_HICACHE_FILE_BACKEND_ENABLE_METADATA_CACHE, off by default): #29716
[HiCache] Optimize L2 mem allocation when cache miss in L3: #19320
[HiCache] Optimize HiCache host pool free-list release: #30658
[UnifiedTree] Sync Replay SSM: #30636
[UnifiedTree] Sync mamba int8 checkpoint: #30626
Reset only the used mamba state on unified radix cache: #31648
Reset only the used mamba state on radix cache hit: #31643

LoRA

[Diffusion] post_training: Add LoRA IPC weight sync via lora_merge mode: #31029
Move LoRA cuda-graph buffers and logging into LoRAManager: #31151

Multimodal

feat: unify multimodal feature transport: #30904
vlm: batch cross-request vit encoding and reuse attention metadata: #24013
[Multimodal] Support n>1 outputs for GLM-Image generation: #31027

Model Support & Optimizations

Add DeepReinforce Ornith-1.0 to cookbook: #29404
Fix MiMo-V2 on Blackwell: FA3 fallback and TP-aware audio weight loading: #31343
Fix Ministral3 accuracy issue by aligning YaRN RoPE scaling with Transformers implementation: #31232
Fix garbage output for bare-tekken Mistral checkpoints (e.g. Leanstral): #30396
[Fix] Map reasoning_effort=low to Nemotron-3 Super low_effort + warn on unsupported levels: #30463

Kernel Library (`sglang.kernels`, RFC #29630)

[Kernel] Introduce sglang.kernels namespace and migrate scattered triton_ops kernels (Phase 2): #30044
[Kernel] Migrate scattered quantization, MoE, srt/layers, generic-attention, DSA/DSV4, linear-attention and vendored fla/mamba kernels (Phase 2.5, 1-7/7): #30784, #30786, #30787, #30789, #30792, #30793, #30795
[Kernel] Decouple KernelBackend from device + device-based CapabilityRequirement: #31292
[Kernel] Fill non-CUDA coverage: HIP (aiter/rocm-triton) + Ascend NPU backends: #31307
[Kernel] Sweep decoupled scattered kernels into sglang.kernels.ops: #31582

SGLang-Diffusion

[Diffusion] model: support fal Ideogram V4 Fast and Instant: #31177
[Diffusion] SGLang backend for GLM Image AR. Step 1 - Separate server: #25381
[Diffusion] Support SP for Krea-2: #29777
[Diffusion] msgpack raw-bytes transport (drop base64/JSON): #31565

AMD / ROCm

[AMD] Reuse fused FP8 KV cache write on standard aiter prefill/decode: #26852
[AMD] Enable mamba-extra-buffer for Qwen3.5 on ROCm: #30359
[AMD] [Fix] Fix --attention-backend triton work for DeepSeek MLA on MI355 (null-K + decode dispatch + RoPE): #30355
[AMD] Fix DeepSeek MLA prefill shape mismatch on HIP eager fallback (missing mha_companion_layers): #31675
[AMD] Remove ROCm page_first+kernel -> layer_first HiCache fallback: #30622
[Fix] fix quickreduce acc error in cudagraph mode: #29508
Fix ROCm fused KV and KDA paths: #31688
cookbook(deepseek-v4): add MORI disagg backend for AMD + bump MI355X image: #30651

NPU / Ascend

[NPU] Add support --pre-warm-nccl: #30312
[NPU] use standalone group for moe ep: #29030
[NPU] Add extra topk_weights input in deepep ll dispatch: #29480
[NPU] Determine the topk norm_type through scoring_func: #31107
[NPU] custom-ops adapt: #30731
[MoE Refactor] [NPU] Refactor Ascend MoE implementation to reduce code duplication and align with community design: #25663
[NPU][Quantization] Add W4A4 MXFP4 quantization support for Qwen3 Dense on Ascend NPU: #23795
[Fix][NPU] Fix/Refactor routed scaling factor application in MoE routing: #31449
[NPU] FIX CMB illusion of garbled characters acc problems, in prefix cache mtp scenarios: #31659

CPU / Intel / XPU

[Intel GPU] DeepSeek V4 5/N, 9/N, 11/N, 12/N, 13/N: move fused indexer RoPE/Hadamard, paged MQA logits, silu_and_mul_clamp and V2 Compressor kernels onto sgl-kernel for XPU: #27873, #28046, #28059, #28428, #28439
[Intel XPU] Enable (biased) grouped topk for xpu: #31126
[XPU] Route topk_sigmoid and topk_softmax to AOT sgl-kernel-xpu symbols: #31038
[CPU] add fused input proj for qwen3.5: #31171
[CPU] improve silu performance by replacing fp32 div with rcp14: #31304
Make UTs compatible for XPU: #27106
[MLX] Honor --max-running-requests in the model runner stub: #30547

Dependencies

[Dep] Upgrade flashinfer to 0.6.14: #29910 ⭐
Bump CuTe DSL to 4.6.0: #31714 ⭐
chore: bump sgl-kernel version to 0.4.5: #31496, #31618
Upgrade llguidance to 1.7.6: #31484

Full Changelog: v0.5.15...v0.5.16

New Contributors

@linhu-nv made their first contribution in https://github.com/sgl-project/sglang/pull/29701
@averyjones4 made their first contribution in https://github.com/sgl-project/sglang/pull/29404
@tyuchn made their first contribution in https://github.com/sgl-project/sglang/pull/29716
@hdt98 made their first contribution in https://github.com/sgl-project/sglang/pull/29275
@connorcarpenter15 made their first contribution in https://github.com/sgl-project/sglang/pull/30440
@wangjiaxin99 made their first contribution in https://github.com/sgl-project/sglang/pull/30265
@htzo made their first contribution in https://github.com/sgl-project/sglang/pull/27862
@ICENacl made their first contribution in https://github.com/sgl-project/sglang/pull/28982
@spandantiwari made their first contribution in https://github.com/sgl-project/sglang/pull/25467
@rwang5203 made their first contribution in https://github.com/sgl-project/sglang/pull/27576
@Junjie650 made their first contribution in https://github.com/sgl-project/sglang/pull/30408
@Hayden727 made their first contribution in https://github.com/sgl-project/sglang/pull/27551
@starkwj made their first contribution in https://github.com/sgl-project/sglang/pull/30747
@ZYHowell made their first contribution in https://github.com/sgl-project/sglang/pull/30828
@yz-wqf made their first contribution in https://github.com/sgl-project/sglang/pull/30846
@auroter made their first contribution in https://github.com/sgl-project/sglang/pull/30331
@N3u0ns made their first contribution in https://github.com/sgl-project/sglang/pull/28113
@jinzhen-lin made their first contribution in https://github.com/sgl-project/sglang/pull/23754
@shadeMe made their first contribution in https://github.com/sgl-project/sglang/pull/27375
@ankith117 made their first contribution in https://github.com/sgl-project/sglang/pull/31143
@hunhokim made their first contribution in https://github.com/sgl-project/sglang/pull/30351
@AuFlow made their first contribution in https://github.com/sgl-project/sglang/pull/30621
@zhihengy made their first contribution in https://github.com/sgl-project/sglang/pull/30036
@jorgeantonio21 made their first contribution in https://github.com/sgl-project/sglang/pull/30182
@sunjiweiswift made their first contribution in https://github.com/sgl-project/sglang/pull/31140
@guzekai01 made their first contribution in https://github.com/sgl-project/sglang/pull/31185
@IzacharyI made their first contribution in https://github.com/sgl-project/sglang/pull/26852
@jojoakm made their first contribution in https://github.com/sgl-project/sglang/pull/30535
@beef9999 made their first contribution in https://github.com/sgl-project/sglang/pull/19320
@nzr-niu made their first contribution in https://github.com/sgl-project/sglang/pull/31174
@Safiullah136 made their first contribution in https://github.com/sgl-project/sglang/pull/31584
@twb1235 made their first contribution in https://github.com/sgl-project/sglang/pull/25213

View release on GitHub

No immediate action

v0.5.15.post1 Bug fix 12d

NaN output fix

Open

No immediate action

v0.5.15 16d

Routine maintenance and dependency updates.

Open

No immediate action

v0.5.14 New feature 1mo

New models + DeepSeek‑V4 boost + MoE balancing + kernels

Open

No immediate action

v0.5.13 Breaking risk 1mo

Breaking changes — review before upgrading.

Open

sglang

Security Response History

Recent releases

Highlights

Breaking Changes & Upgrade Notes

Known Issues

New Model Support

Inkling

GLM-5.2

DeepSeek V4

Speculative Decoding

Piecewise & Breakable CUDA Graph

Attention Backends

MoE & Expert Parallelism

Quantization

Parallelism & Disaggregation

Scheduler & Runtime

HiCache & Radix Cache

LoRA

Multimodal

Model Support & Optimizations

Kernel Library (`sglang.kernels`, RFC #29630)

SGLang-Diffusion

AMD / ROCm

NPU / Ascend

CPU / Intel / XPU

Dependencies

New Contributors

About

Community & Support

Similar tools

sglang

Security Response History

Recent releases

Highlights

Breaking Changes & Upgrade Notes

Known Issues

New Model Support

Inkling

GLM-5.2

DeepSeek V4

Speculative Decoding

Piecewise & Breakable CUDA Graph

Attention Backends

MoE & Expert Parallelism

Quantization

Parallelism & Disaggregation

Scheduler & Runtime

HiCache & Radix Cache

LoRA

Multimodal

Model Support & Optimizations

Kernel Library (sglang.kernels, RFC #29630)

SGLang-Diffusion

AMD / ROCm

NPU / Ascend

CPU / Intel / XPU

Dependencies

New Contributors

About

Community & Support

Similar tools

Kernel Library (`sglang.kernels`, RFC #29630)