v0.5.16 Breaking risk 2d

Broad release touches NPU / Ascend, Speculative Decoding, DeepSeek V4, and Parallelism & Disaggregation.

Full changelog

Highlights

574 PRs from 169 contributors.

DSpark: confidence-driven speculative decoding: A new speculative algorithm. It drafts semi-autoregressively in blocks, then sizes each verify window from the draft's own confidence instead of a fixed draft length. Reaches 383.7 tok/s at accept length ~5 on DeepSeek-V4-Pro, TP8 on B300 (bs=1). Enable with --speculative-algorithm DSPARK and SGLANG_RAGGED_VERIFY_MODE=compact; tune the block with --speculative-dspark-block-size (#30261, #31434, blog).

Inkling support: A 975B-parameter multimodal MoE with a 1M-token context. It mixes sliding-window, full and Mamba2 linear attention, and adds an NVFP4 MoE, optional vision/audio towers and native MTP. On Blackwell it reaches up to 71.7k tok/s input and 171.0 tok/s per-user decode. Verified on Blackwell TP4/TP8, H200 and AMD MI350X / MI355X (#31681, blog, cookbook).

Other new models added: LongCat 2.0 FP8, JetBrains Mellum v2, Pi0.5, plus diffusion support for LongLive 2.0.

UnifiedRadixTree is now the default for SWA, Mamba and DSA models. Replay SSM and Mamba int8 checkpoints are synced onto it, and a cache hit now resets only the state it used (#30468, #30636, #30626, #31643).

GLM-5.2 DSA cache layer split under prefill CP: KV and indexer cache layers are sharded across CP ranks. Each rank owns a disjoint layer range instead of all layers. That cuts per-rank KV memory by ~74% (0.77 to 0.20 GB/rank) at 8192 tokens on GLM-5.2-FP8, 78 layers, cp_size=4. Enable with --enable-dsa-cache-layer-split, which needs --enable-prefill-cp --cp-strategy interleave (#29421).

ReplaySSM Ring Spec-Verify (GDN): Drops the per-draft SSM snapshot. Speculative scratch goes from 11.5 GB to 1.8 GB per GPU (6.4x smaller) on Qwen3.5-35B-A3B at TP1, at accuracy and throughput parity. Opt in with --enable-gdn-replayssm-spec (default off; GDN with a linear draft chain only, --speculative-eagle-topk in {None, 1}), and tune the ring via --linear-replayssm-cache-len (#28695).

Linear attention on Blackwell (SM100): The first correct KDA MTP path. Its recurrent_kda decode kernel runs at 29.6 us vs 36.8 us for Triton (ncu, B=64). The full decode path reaches parity by B=128 and 1.35x at B=256, and is slower below that (#30113). Separately, GDN/KDA CuteDSL prefill fuses state I/O into the chunk-h kernel (#30169).

QServe and FBGEMM FP8 quantization are removed: the experimental QServe (QoQ) W4A8 and FBGEMM FP8 paths are gone. --fp4-gemm-backend cutlass goes too, along with the in-tree NVFP4 JIT kernels, so NVFP4 GEMM now requires FlashInfer (#31109, #30448).

Dependencies: flashinfer 0.6.14 (#29910), CuTe DSL 4.6.0 (#31714), sgl-kernel 0.4.5 (#31496), llguidance 1.7.6 (#31484).

Breaking Changes & Upgrade Notes

The experimental QServe (QoQ) W4A8 and FBGEMM FP8 quantization paths are removed (per #28543): #31109
CUTLASS FP8 blockwise deleted for SM90 / SM100, SM120 moved to JIT: #30438
--fp4-gemm-backend cutlass is removed along with the in-tree NVFP4 JIT kernels, so NVFP4 GEMM now requires FlashInfer. Use auto, which picks flashinfer_cutedsl on SM100 and flashinfer_cutlass on SM120: #30448
UnifiedRadixTree is now the default for SWA, Mamba and DSA models. A behavior change on those architectures: #30468
Chunked input-logprob processing is now on by default to cap peak memory: #31498
FA3 sparse mask kernels are off by default: #30356
Legacy Sphinx docs/ removed; the Mintlify cutover is complete: #28964
sglang.kernels namespace: kernels are relocated verbatim and only import paths change; public wrappers keep defaulting to the AOT sgl_kernel backend, so code reaching past them to internal paths must update (RFC #29630): #30044, #31582
num_tokens_per_bs renamed to num_tokens_per_req across spec-decoding runners: #30977
--enable-deepep-waterfill is renamed to --enable-waterfill with no deprecated alias, so existing launch commands fail with unrecognized arguments: #27350
--optimistic-prefill-retries is renamed to --optimistic-prefill-attempts with no deprecated alias: #30951
The SGLang-Diffusion post-training rollout endpoint now returns application/msgpack instead of JSON, with tensors as raw msgpack bytes rather than base64 (tensor_to_base64 / base64_to_tensor become tensor_to_bytes / bytes_to_tensor), so RL rollout consumers must be upgraded in lockstep with the server: #31565

Known Issues

Temperature-0 nondeterminism under DP attention with breakable prefill CUDA graph. On the DSV4-Flash FP4 recipe, the idle-rank dummy extend introduced by #30898 perturbs real requests' logits, so identical temperature-0 requests can diverge. The guarding determinism test is disabled as a stopgap rather than fixed (#31125); not enabling breakable prefill CUDA graph avoids the path.
A bump to flashinfer 0.6.15 was landed and reverted this cycle; this release pins 0.6.14 (#31502, #31625).
Mamba track-boundary seqlen under the overlap scheduler was fixed and then reverted (#31369, #31622). The underlying issue is still open.
CPU AMX optimizations for diffusion were reverted (#28527, #30716).
GB300 CI jobs were temporarily disabled for runner availability during this cycle (#31764), so GB300 coverage rests on the cookbook's manual end-to-end validation.

Full release notes by category below.

New Model Support

| Model | Type | PRs | Cookbook |
|---|---|---|---|
| Inkling | autoregressive | #31681 | link |
| LongCat 2.0 | autoregressive | #30275, #30320 | link |
| JetBrains Mellum v2 | autoregressive | #27375 | wip |
| Pi0.5 | vla | #30633 | link |
| LongLive 2.0 | diffusion | #27639 | link |

Landed this cycle but not yet usable end-to-end: MiniMax-M3 completes its four-part landing (#28715, begun in v0.5.14) but its cookbook still points at a dev image (#31819).

Inkling

Add Inkling model support: #31681 ⭐
Add Inkling cookbook: #31360
[Docs] Inkling cookbook: mark B300/GB300 recipes verified, tune B300 MTP mem fractions: #31550
[Cookbook] Inkling: add measured accuracy numbers to benchmark cards: #31823
[Docs] Inkling cookbook: LoRA cells require --disable-prefill-cuda-graph: #31418
Fix dropped Inkling reasoning at stream end: #31787
[Spec] fix inkling multi layer mtp draft extend cuda graph: #32254 (cherry-picked as #32260)

GLM-5.2

[Feature][GLM5.2] Add DSA Cache Layer Split under Prefill CP: #29421 ⭐
support GLM-5.2 MTP index sharing with prefill CP: #30992
[Fix] Stabilize GLM-5.2 MTP IndexShare across PD and CUDA graph replay: #30839
[GLM5][MoE] perf: Write FlashInfer TRT-LLM MoE output directly: #28416
Fix GLM/DeepSeek NVFP4 + flashinfer_trtllm long-context "!!!!" collapse (NaN routing): #31001
[Docs] Update GLM5.2 Cookbook with LayerSplit usage: #31577

DeepSeek V4

[DSA] Integrate Q8KV8 FP8 Sparse MLA Prefill into the DSA Backend (DeepSeek-V3.2): #30514
[DeepSeek-V4] Enable non-paged indexer by default for large prefill chunks: #30140
[Feature] Support DeepSeek-V4 Wint4Abf16 and Win4Afp8: #25763
[DeepSeek-V4] Support BF16 Compress State for Online C128: #29609
Implement SM120 DeepSeek V4 flashinfer_mxfp4 moe runner backend + TP2: #30272
[DSV4] Remove per-step seqlen D2H from speculative to make overlap scheduler work: #30365
[DSV4] Use BF16 instead of FP32 for indexer score computation: #30012
[DSA] Fix top-k v2 emitting invalid indices under tie overflow / inf scores (IMA in FA3 sparse decode): #30645
[DeepSeek-V4] Fix idle-rank dummy-extend sparse-prefill crash under DP breakable CUDA graph: #31705
Fix nvfp4 online scale with pcg: #32246 (cherry-picked as #32259)
Fix stale flashinfer-MLA fallback poisoning spec verify capture (trtllm_mla + tc_piecewise): #32288 (cherry-picked as #32346)

Speculative Decoding

[Spec] Add DSpark: confidence-scheduled speculative decoding: #30261 ⭐
[GDN] Support ReplaySSM Ring Spec-Verify: #28695 ⭐
fa3/fa4: sync-free for all backends and phases: #29589
fa3: sync-free eagle spec via fixed-window draft-extend metadata: #31364
fa3: build the topk>1 verify replay page table on-device: #31381
flashmla: sync-free spec via device-side draft-extend: #31090
[Spec] DFlash: remove per-step host syncs so the CPU runs a full step ahead (spec-v2 overlap): #31468
[Perf] Cache uniform ragged-verify layout for DSpark verify-all compact: #31434
Support speculative decoding on CPU: #27862

Piecewise & Breakable CUDA Graph

Enable breakable prefill CUDA graph for DP attention: #30898
feat: enable piecewise prefill graph for Kimi K2.5/K2.7: #30889
[Diffusion] Enable breakable CUDA graph (BCG) for diffusion DiTs: #27436

Attention Backends

[KDA] Add FlashInfer SM100 KDA decode + MTP (target_verify) backend: #30113 ⭐
[GDN/KDA] Fuse SM100 CuteDSL prefill state I/O into the chunk h kernel: #30169 ⭐
[GDN] Auto-select FlashInfer GDN prefill on validated SM100 configs: #29734
[Feature] Add FP4 KV Cache Design and support SM120 GPUs: #21601
Fix KDA prefix caching under mamba extra_buffer and enable it for kimi_linear: #31474
Fuse the preprocess kernels of trtllm-gen attention: #29690

MoE & Expert Parallelism

[1/N] elastic-ep: Add runtime EP scale-up: #30164
Support Waterfill with MegaMoE backend: #27350
Support Flashinfer one-sided A2A + CuteDSL MoE for Nemotron Ultra: #28309
Improve EPLB dispatch handling and diagnostics: #30646

Quantization

Remove QServe and FBGEMM FP8 quantization: #31109 ⭐
Delete CUTLASS FP8 blockwise for SM90 and SM100, move SM120 to JIT and add SwapAB: #30438
Refactor FP4 quantization and remove deprecated JIT kernels: #30448
[Quantization] add humming quantization kernel: #23754

Parallelism & Disaggregation

[CP] Migrate MLA prefill CP (DeepSeek V3) to CP-v2 zigzag strategy: #31619
Support MiMo V2.5 with zigzag context parallelism: #29972
Support GPT-OSS zigzag CP with TRTLLM-MHA: #31732
[DCP] Enable decode context parallel for Kimi K2.5 NVFP4: #31514
[PDD] Add true request retraction for PDD: #25372
[PD] Improve optimistic prefill: #30951
[PD] Fix optimistic prefill inflight-queue hangs on parked/aborted reqs: #31075
feat(grpc): support disaggregated generation requests: #30440
[gRPC] Native server: launcher + HTTP + server args wiring (3/4): #23508
feat: add native gRPC sidecar module launcher: #31076 (cherry-picked as #32074)

Scheduler & Runtime

Using UnifiedRadixTree by default for SWA, Mamba, and DSA models: #30468 ⭐
[Feature] Add --default-chat-template-kwargs server arg: #29579
[Scheduler] Add SGLANG_MAX_NEW_TOKENS_LIMIT to cap per-request max_new_tokens: #22591
Support priority request header override: #30811
Align reasoning_effort schema across chat, tokenize, and responses: #31784
Return top-p/top-k sampling mask/nucleus: #27408
[Scheduler] Move the WAR barrier to right after each run_batch launch: #31687
[Fix] Enable chunked input-logprob processing by default to cap peak memory: #31498
[Refactor] Unify logprob results into a single LogprobResult and rename chunk env vars: #31733
[dLLM] Make FDFO a framework capability for all dLLM algorithms: #27551

HiCache & Radix Cache

[HiCache] Add the FlexKV storage connector: --enable-flexkv routes the KV cache through FlexKV's KVManager for host-tier offload, configured via --flexkv-config-file: #29701
[HiCache] Add a client-side metadata cache for the HiCacheFile backend, bypassing directory traversal on lookups (SGLANG_HICACHE_FILE_BACKEND_ENABLE_METADATA_CACHE, off by default): #29716
[HiCache] Optimize L2 mem allocation when cache miss in L3: #19320
[HiCache] Optimize HiCache host pool free-list release: #30658
[UnifiedTree] Sync Replay SSM: #30636
[UnifiedTree] Sync mamba int8 checkpoint: #30626
Reset only the used mamba state on unified radix cache: #31648
Reset only the used mamba state on radix cache hit: #31643

LoRA

[Diffusion] post_training: Add LoRA IPC weight sync via lora_merge mode: #31029
Move LoRA cuda-graph buffers and logging into LoRAManager: #31151

Multimodal

feat: unify multimodal feature transport: #30904
vlm: batch cross-request vit encoding and reuse attention metadata: #24013
[Multimodal] Support n>1 outputs for GLM-Image generation: #31027

Model Support & Optimizations

Add DeepReinforce Ornith-1.0 to cookbook: #29404
Fix MiMo-V2 on Blackwell: FA3 fallback and TP-aware audio weight loading: #31343
Fix Ministral3 accuracy issue by aligning YaRN RoPE scaling with Transformers implementation: #31232
Fix garbage output for bare-tekken Mistral checkpoints (e.g. Leanstral): #30396
[Fix] Map reasoning_effort=low to Nemotron-3 Super low_effort + warn on unsupported levels: #30463

Kernel Library (`sglang.kernels`, RFC #29630)

[Kernel] Introduce sglang.kernels namespace and migrate scattered triton_ops kernels (Phase 2): #30044
[Kernel] Migrate scattered quantization, MoE, srt/layers, generic-attention, DSA/DSV4, linear-attention and vendored fla/mamba kernels (Phase 2.5, 1-7/7): #30784, #30786, #30787, #30789, #30792, #30793, #30795
[Kernel] Decouple KernelBackend from device + device-based CapabilityRequirement: #31292
[Kernel] Fill non-CUDA coverage: HIP (aiter/rocm-triton) + Ascend NPU backends: #31307
[Kernel] Sweep decoupled scattered kernels into sglang.kernels.ops: #31582

SGLang-Diffusion

[Diffusion] model: support fal Ideogram V4 Fast and Instant: #31177
[Diffusion] SGLang backend for GLM Image AR. Step 1 - Separate server: #25381
[Diffusion] Support SP for Krea-2: #29777
[Diffusion] msgpack raw-bytes transport (drop base64/JSON): #31565

AMD / ROCm

[AMD] Reuse fused FP8 KV cache write on standard aiter prefill/decode: #26852
[AMD] Enable mamba-extra-buffer for Qwen3.5 on ROCm: #30359
[AMD] [Fix] Fix --attention-backend triton work for DeepSeek MLA on MI355 (null-K + decode dispatch + RoPE): #30355
[AMD] Fix DeepSeek MLA prefill shape mismatch on HIP eager fallback (missing mha_companion_layers): #31675
[AMD] Remove ROCm page_first+kernel -> layer_first HiCache fallback: #30622
[Fix] fix quickreduce acc error in cudagraph mode: #29508
Fix ROCm fused KV and KDA paths: #31688
cookbook(deepseek-v4): add MORI disagg backend for AMD + bump MI355X image: #30651

NPU / Ascend

[NPU] Add support --pre-warm-nccl: #30312
[NPU] use standalone group for moe ep: #29030
[NPU] Add extra topk_weights input in deepep ll dispatch: #29480
[NPU] Determine the topk norm_type through scoring_func: #31107
[NPU] custom-ops adapt: #30731
[MoE Refactor] [NPU] Refactor Ascend MoE implementation to reduce code duplication and align with community design: #25663
[NPU][Quantization] Add W4A4 MXFP4 quantization support for Qwen3 Dense on Ascend NPU: #23795
[Fix][NPU] Fix/Refactor routed scaling factor application in MoE routing: #31449
[NPU] FIX CMB illusion of garbled characters acc problems, in prefix cache mtp scenarios: #31659

CPU / Intel / XPU

[Intel GPU] DeepSeek V4 5/N, 9/N, 11/N, 12/N, 13/N: move fused indexer RoPE/Hadamard, paged MQA logits, silu_and_mul_clamp and V2 Compressor kernels onto sgl-kernel for XPU: #27873, #28046, #28059, #28428, #28439
[Intel XPU] Enable (biased) grouped topk for xpu: #31126
[XPU] Route topk_sigmoid and topk_softmax to AOT sgl-kernel-xpu symbols: #31038
[CPU] add fused input proj for qwen3.5: #31171
[CPU] improve silu performance by replacing fp32 div with rcp14: #31304
Make UTs compatible for XPU: #27106
[MLX] Honor --max-running-requests in the model runner stub: #30547

Dependencies

[Dep] Upgrade flashinfer to 0.6.14: #29910 ⭐
Bump CuTe DSL to 4.6.0: #31714 ⭐
chore: bump sgl-kernel version to 0.4.5: #31496, #31618
Upgrade llguidance to 1.7.6: #31484

Full Changelog: v0.5.15...v0.5.16

New Contributors

@linhu-nv made their first contribution in https://github.com/sgl-project/sglang/pull/29701
@averyjones4 made their first contribution in https://github.com/sgl-project/sglang/pull/29404
@tyuchn made their first contribution in https://github.com/sgl-project/sglang/pull/29716
@hdt98 made their first contribution in https://github.com/sgl-project/sglang/pull/29275
@connorcarpenter15 made their first contribution in https://github.com/sgl-project/sglang/pull/30440
@wangjiaxin99 made their first contribution in https://github.com/sgl-project/sglang/pull/30265
@htzo made their first contribution in https://github.com/sgl-project/sglang/pull/27862
@ICENacl made their first contribution in https://github.com/sgl-project/sglang/pull/28982
@spandantiwari made their first contribution in https://github.com/sgl-project/sglang/pull/25467
@rwang5203 made their first contribution in https://github.com/sgl-project/sglang/pull/27576
@Junjie650 made their first contribution in https://github.com/sgl-project/sglang/pull/30408
@Hayden727 made their first contribution in https://github.com/sgl-project/sglang/pull/27551
@starkwj made their first contribution in https://github.com/sgl-project/sglang/pull/30747
@ZYHowell made their first contribution in https://github.com/sgl-project/sglang/pull/30828
@yz-wqf made their first contribution in https://github.com/sgl-project/sglang/pull/30846
@auroter made their first contribution in https://github.com/sgl-project/sglang/pull/30331
@N3u0ns made their first contribution in https://github.com/sgl-project/sglang/pull/28113
@jinzhen-lin made their first contribution in https://github.com/sgl-project/sglang/pull/23754
@shadeMe made their first contribution in https://github.com/sgl-project/sglang/pull/27375
@ankith117 made their first contribution in https://github.com/sgl-project/sglang/pull/31143
@hunhokim made their first contribution in https://github.com/sgl-project/sglang/pull/30351
@AuFlow made their first contribution in https://github.com/sgl-project/sglang/pull/30621
@zhihengy made their first contribution in https://github.com/sgl-project/sglang/pull/30036
@jorgeantonio21 made their first contribution in https://github.com/sgl-project/sglang/pull/30182
@sunjiweiswift made their first contribution in https://github.com/sgl-project/sglang/pull/31140
@guzekai01 made their first contribution in https://github.com/sgl-project/sglang/pull/31185
@IzacharyI made their first contribution in https://github.com/sgl-project/sglang/pull/26852
@jojoakm made their first contribution in https://github.com/sgl-project/sglang/pull/30535
@beef9999 made their first contribution in https://github.com/sgl-project/sglang/pull/19320
@nzr-niu made their first contribution in https://github.com/sgl-project/sglang/pull/31174
@Safiullah136 made their first contribution in https://github.com/sgl-project/sglang/pull/31584
@twb1235 made their first contribution in https://github.com/sgl-project/sglang/pull/25213

View release on GitHub

No immediate action

v0.5.15.post1 Bug fix 12d

NaN output fix

Open

No immediate action

v0.5.15 16d

Routine maintenance and dependency updates.

Open

No immediate action

v0.5.14 New feature 1mo

New models + DeepSeek‑V4 boost + MoE balancing + kernels

Open

No immediate action

v0.5.13 Breaking risk 1mo

Breaking changes — review before upgrading.

Open

No immediate action

v0.5.12.post1 Breaking risk 2mo

DeepSeek V4 stability + performance

Open

No immediate action

v0.5.12 Security relevant 2mo

DeepSeek V4 support

patches CVE-2023-4863

Open

v0.5.11 Security relevant 2mo

Security fixes

CVE-2026-5760 — fixed in #23660

Notable features

Default CUDA version upgraded to 13.0 across sglang, sgl-kernel, and Docker images
PyTorch upgraded from 2.9 to 2.11

Full changelog

Highlights

CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)
Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062
Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746
Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394
DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553
FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796
LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381
Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for moe_dp_size = 1 paired with arbitrary attention_cp_size so MoE and attention parallelism can be tuned independently: #21249, #22003
FlashInfer CuteDSL MoE Runner Backend: New dedicated FlashInferCuteDslMoE layer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
GLM-5.1: #22543, #23037 (see cookbook)
Qwen3.6: #23486 (see cookbook)
MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
Ling-2.6-Flash: #23947 (see cookbook)
Mistral Medium 3.5: see cookbook
Kimi-K2.6: #23394, #23408 (see cookbook)
Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
Qwen3-ASR (chunk-based streaming): #22073, #22089
Voxtral (Mistral speech-to-text): #21635
Parakeet (NVIDIA Nemotron encoder): #23568
Moss-VL: #23454
SequenceClassification model architecture (powers the Score API): #22118
Stable Diffusion 3 medium (Diffusion): #19225
ERNIE-Image (Diffusion): #22439
JoyAI-Image-Edit (Diffusion): #22625

Speculative Decoding

DFLASH speculative decoding initial support: #22077
DFLASH enabled across additional model backends: #22358
DFLASH speculative decoding on AMD ROCm: #22342
Spec V2 enabled by default with overlap scheduling: #21062
Penalty support for Spec V2 overlap scheduling: #22049
Adaptive speculative_num_steps for EAGLE topk=1: #21599
Allow piecewise CUDA graph with speculative decoding: #22128
Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
Split accept_length into num_accepted_drafts / num_accepted_tokens: #23962
DFLASH speculative decoding documentation: #23553

PD Disaggregation

Decode-side radix cache support: #19746
Incremental transfer for Mooncake transfer engine: #24257
Allow PrefillDelayer in disaggregated-prefill mode: #23588
NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
Bug fixes for IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990

Context Parallel & Parallelism

All-reduce fusion support under CP: #21249
moe_dp_size = 1 paired with arbitrary attention_cp_size: #22003
All-reduce fusion enabled for DSA models: #22390
Replace all-reduce + dp_scatter with reduce_scatterv for DP attention: #22642
Step3p5: optimize all-reduce in MoE layers: #22773
Pipeline parallelism on Intel XPU: #23472
OpenTelemetry tracing for pipeline parallelism: #23169

LoRA

DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
Kimi K2 LoRA support: #22381
LoRADrainer to address high P99 TTFT: #17913
Decoupled LoRA MoE backend with Marlin support: #21858
Virtual experts for LoRA MoE (1/n): #22122, #24007
CSGMV kernel offline auto-tuning: #20391
Triton sgemm speedup with better grid selection: #22386
Dual MoE CUDA graph capture for lora/nolora batches: #22809

Performance

FA3 kernels from the kernel community: #20796
Precompute FA3 scheduler_metadata to eliminate per-layer prepare cost: #21104
Precompute gemma_weight to avoid redundant add on every forward: #22673
Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
Skip KV cache in FA backend for embedding mode: #21971
O(1) RadixKey view for EAGLE bigram key: #23106
PCG inductor path optimization for FP8 models: #23227
Combo-kernels for horizontal fusion: #21977
Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
Restore torch.compile fusion for topk postprocessing: #21771
Reduce unnecessary kernels and copies in the NSA indexer: #22232

Observability

Pending token count surfaced in prefill log and get_load: #22480
OpenTelemetry tracing for speculative decoding: #19545
OpenTelemetry tracing for pipeline parallelism: #23169
OpenTelemetry tracing in DiffGenerator: #21254
Prometheus metrics endpoint for gRPC mode: #20801
HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
Raw KV cache pool token counts as Prometheus gauges: #22726

SGLang-Diffusion

New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
Disaggregated diffusion: #21701
Dynamic batching v0: #18764
CPU platform support for SGLang Diffusion: #20816
AITER backends in Flux 2 pipeline (AMD): #22802
LTX-2 feed-forward tensor parallelism optimization: #23221
In-memory loading for URL/base64 image inputs (default): #23118
Mixed-resolution benchmark support: #20863
Auto-enable best parallel setting if unspecified: #22763

AMD

MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
Fused QK Gemma norm kernels (4 → fewer kernels): #23575
Fused all-reduce + RMSNorm simplification: #21986
GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
MTP for GLM-5-mxfp4: #23219
Aiter v0.1.12.post1 upgrade: #22264
DFLASH speculative decoding enabled on ROCm: #22342
Fix --page-size > 1 memory access fault with speculative decoding: #23596

NPU / Ascend

Ascend backend supports Qwen3 MoE attention CP: #21685
GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
MTP for Qwen3.5: #20918
TP communications compression for Qwen3 on NPU: #20520
Add support-new-models documentation for NPU: #23824
GGUF quantization for Ascend NPU (dense + MoE): #17883

CPU

GPTQ / AWQ 4-bit quantization on CPU: #22685
gemma4_rmsnorm_cpu kernel: #22842
Qwen3.5 model optimization for CPU: #19484
Apply routed scaling factor on output for biased grouped topk fusion: #22413
Fix extend_attention_cpu / flash_attn_varlen_func NaN for large seq: #22434

Quantization

MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
MXFP8 sm100 path cleanup: #21881
GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543

Dependencies

Torch upgraded 2.9 → 2.11: #21247
Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
sgl-kernel bumped to 0.4.1.post1: #23720, #23733
sgl-kernel bumped to 0.4.2: #24170
Aiter v0.1.12.post1 (AMD): #22264

Security

Fix for CVE-2026-5760: #23660
Fix Trivy CVEs and cubin download 403s in Docker image: #22322

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

New Contributors

@AethoceSora made their first contribution in https://github.com/sgl-project/sglang/pull/23426
@AlbeeSo made their first contribution in https://github.com/sgl-project/sglang/pull/23710
@alec-flowers made their first contribution in https://github.com/sgl-project/sglang/pull/24090
@AlonKejzman made their first contribution in https://github.com/sgl-project/sglang/pull/23753
@amacaskill made their first contribution in https://github.com/sgl-project/sglang/pull/22537
@AndyLi429 made their first contribution in https://github.com/sgl-project/sglang/pull/21685
@Baichuan7 made their first contribution in https://github.com/sgl-project/sglang/pull/23060
@ccullen-cert made their first contribution in https://github.com/sgl-project/sglang/pull/23660
@ChangLiu0709 made their first contribution in https://github.com/sgl-project/sglang/pull/22908
@charlotte12l made their first contribution in https://github.com/sgl-project/sglang/pull/21983
@chenkaiyue made their first contribution in https://github.com/sgl-project/sglang/pull/17195
@chx96642264 made their first contribution in https://github.com/sgl-project/sglang/pull/22705
@ColinZ22 made their first contribution in https://github.com/sgl-project/sglang/pull/22543
@cyyc0310 made their first contribution in https://github.com/sgl-project/sglang/pull/22920
@divyamagrawal06 made their first contribution in https://github.com/sgl-project/sglang/pull/23325
@dyhsup made their first contribution in https://github.com/sgl-project/sglang/pull/22439
@egvenediktov made their first contribution in https://github.com/sgl-project/sglang/pull/20520
@erikwijmans made their first contribution in https://github.com/sgl-project/sglang/pull/21974
@fengli1702 made their first contribution in https://github.com/sgl-project/sglang/pull/19143
@fergusfinn made their first contribution in https://github.com/sgl-project/sglang/pull/21035
@fortunecookiee made their first contribution in https://github.com/sgl-project/sglang/pull/20960
@gxlvera made their first contribution in https://github.com/sgl-project/sglang/pull/19225
@he-yufeng made their first contribution in https://github.com/sgl-project/sglang/pull/20739
@Henson-Zh-Ali made their first contribution in https://github.com/sgl-project/sglang/pull/20522
@icepoint666 made their first contribution in https://github.com/sgl-project/sglang/pull/22592
@iridiumine made their first contribution in https://github.com/sgl-project/sglang/pull/20918
@is-not made their first contribution in https://github.com/sgl-project/sglang/pull/18349
@JasonHe-WQ made their first contribution in https://github.com/sgl-project/sglang/pull/21944
@jh-nv made their first contribution in https://github.com/sgl-project/sglang/pull/21254
@jiangyinzuo made their first contribution in https://github.com/sgl-project/sglang/pull/23169
@JieTang66 made their first contribution in https://github.com/sgl-project/sglang/pull/23983
@JoyFuture made their first contribution in https://github.com/sgl-project/sglang/pull/23808
@jthakurH made their first contribution in https://github.com/sgl-project/sglang/pull/16793
@kangyifei made their first contribution in https://github.com/sgl-project/sglang/pull/23241
@kingkingleeljj made their first contribution in https://github.com/sgl-project/sglang/pull/20967
@kkyyxhll made their first contribution in https://github.com/sgl-project/sglang/pull/23062
@KrishnanPrash made their first contribution in https://github.com/sgl-project/sglang/pull/22175
@lahmuller made their first contribution in https://github.com/sgl-project/sglang/pull/22625
@lixuwei2333 made their first contribution in https://github.com/sgl-project/sglang/pull/22247
@lkhl made their first contribution in https://github.com/sgl-project/sglang/pull/22431
@loading66 made their first contribution in https://github.com/sgl-project/sglang/pull/22700
@luccafong made their first contribution in https://github.com/sgl-project/sglang/pull/24165
@mingyue300 made their first contribution in https://github.com/sgl-project/sglang/pull/21723
@minosfuture made their first contribution in https://github.com/sgl-project/sglang/pull/23419
@mispa-ms made their first contribution in https://github.com/sgl-project/sglang/pull/23097
@mlleo made their first contribution in https://github.com/sgl-project/sglang/pull/23537
@Napkin-AI made their first contribution in https://github.com/sgl-project/sglang/pull/23572
@nvpohanh made their first contribution in https://github.com/sgl-project/sglang/pull/22852
@officialasishkumar made their first contribution in https://github.com/sgl-project/sglang/pull/22600
@opherlieber made their first contribution in https://github.com/sgl-project/sglang/pull/22547
@ranjiewen made their first contribution in https://github.com/sgl-project/sglang/pull/21698
@RichardoMrMu made their first contribution in https://github.com/sgl-project/sglang/pull/19545
@robellliu-dev made their first contribution in https://github.com/sgl-project/sglang/pull/20835
@SammLSH made their first contribution in https://github.com/sgl-project/sglang/pull/22089
@Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
@shenxiul made their first contribution in https://github.com/sgl-project/sglang/pull/23327
@siju-samuel made their first contribution in https://github.com/sgl-project/sglang/pull/23472
@stepinto made their first contribution in https://github.com/sgl-project/sglang/pull/23478
@tfhddd made their first contribution in https://github.com/sgl-project/sglang/pull/22029
@vvagaytsev made their first contribution in https://github.com/sgl-project/sglang/pull/22363
@WangHao-hw made their first contribution in https://github.com/sgl-project/sglang/pull/22778
@Wen-xuan-Xu made their first contribution in https://github.com/sgl-project/sglang/pull/22923
@xiaobochen-amd made their first contribution in https://github.com/sgl-project/sglang/pull/22626
@yaya159456 made their first contribution in https://github.com/sgl-project/sglang/pull/21694
@YMbmzy made their first contribution in https://github.com/sgl-project/sglang/pull/22049
@yuki-brook made their first contribution in https://github.com/sgl-project/sglang/pull/18016
@Zaire404 made their first contribution in https://github.com/sgl-project/sglang/pull/22982
@ZeyuanChen2000 made their first contribution in https://github.com/sgl-project/sglang/pull/21543
@zhaozx-cn made their first contribution in https://github.com/sgl-project/sglang/pull/22266
@zhsurpass made their first contribution in https://github.com/sgl-project/sglang/pull/22697
@zsj555 made their first contribution in https://github.com/sgl-project/sglang/pull/23454

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

View release on GitHub

v0.5.10.post1 Bugfix 3mo

Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.

View release on GitHub

v0.5.10 New feature 3mo

Notable features

Transformers 5.3.0 upgrade with GLM-5 support on main branch
Piecewise CUDA graph enabled by default reducing memory overhead
Elastic EP for partial failure tolerance in MoE deployments

View release on GitHub

v0.5.9 New feature 5mo

Notable features

Native Anthropic API compatible endpoint for seamless integration
LoRA weight loading overlap with computation reducing TTFT by 78%
TRT-LLM NSA kernel integration for DeepSeek V3.2 with 3-5x speedup

View release on GitHub

v0.5.8 New feature 6mo

Security fixes

Fixed urllib and gpgv vulnerabilities

Notable features

1.5x performance improvement for all major diffusion models
Linear scaling with chunked pipeline parallelism for million-token contexts
DeepSeek V3.2 optimization with 65% TTFT improvement

View release on GitHub

gateway-v0.3.1 New feature 6mo

Notable features

Radix tree cache-aware routing with 10-12x performance improvement
99% memory reduction per tree node for cache operations
JWT/OIDC authentication for enterprise deployment

View release on GitHub

v0.5.7 New feature 6mo

Notable features

Model Gateway v0.3.0 with improved routing and multi-modal support
Scalable pipeline parallelism with dynamic chunking for ultra-long contexts
Day 0 support for Mimo-V2-Flash, Nemotron-Nano-v3, LLaDA 2.0, and Qwen-Image models

View release on GitHub

All releases

Highlights

Breaking Changes & Upgrade Notes

Known Issues

New Model Support

Inkling

GLM-5.2

DeepSeek V4

Speculative Decoding

Piecewise & Breakable CUDA Graph

Attention Backends

MoE & Expert Parallelism

Quantization

Parallelism & Disaggregation

Scheduler & Runtime

HiCache & Radix Cache

LoRA

Multimodal

Model Support & Optimizations

Kernel Library (sglang.kernels, RFC #29630)

SGLang-Diffusion

AMD / ROCm

NPU / Ascend

CPU / Intel / XPU

Dependencies

New Contributors

Highlights

New Model Support

Speculative Decoding

PD Disaggregation

Context Parallel & Parallelism

LoRA

Performance

Observability

SGLang-Diffusion

AMD

NPU / Ascend

CPU

Quantization

Dependencies

Security

New Contributors

Kernel Library (`sglang.kernels`, RFC #29630)