Release history
sglang releases
SGLang is a high-performance serving framework for large language models and multimodal models.
All releases
9 shown
- CVE-2026-5760 — fixed in #23660
- Default CUDA version upgraded to 13.0 across sglang, sgl-kernel, and Docker images
- PyTorch upgraded from 2.9 to 2.11
Full changelog
Highlights
-
CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)
-
Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062
-
Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746
-
Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394
-
DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553
-
FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796
-
LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381
-
Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for
moe_dp_size = 1paired with arbitraryattention_cp_sizeso MoE and attention parallelism can be tuned independently: #21249, #22003 -
FlashInfer CuteDSL MoE Runner Backend: New dedicated
FlashInferCuteDslMoElayer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339
New Model Support
Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.
- Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
- GLM-5.1: #22543, #23037 (see cookbook)
- Qwen3.6: #23486 (see cookbook)
- MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
- Ling-2.6-Flash: #23947 (see cookbook)
- Mistral Medium 3.5: see cookbook
- Kimi-K2.6: #23394, #23408 (see cookbook)
- Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
- FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
- FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
- Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
- LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
- Qwen3-ASR (chunk-based streaming): #22073, #22089
- Voxtral (Mistral speech-to-text): #21635
- Parakeet (NVIDIA Nemotron encoder): #23568
- Moss-VL: #23454
- SequenceClassification model architecture (powers the Score API): #22118
- Stable Diffusion 3 medium (Diffusion): #19225
- ERNIE-Image (Diffusion): #22439
- JoyAI-Image-Edit (Diffusion): #22625
Speculative Decoding
- DFLASH speculative decoding initial support: #22077
- DFLASH enabled across additional model backends: #22358
- DFLASH speculative decoding on AMD ROCm: #22342
- Spec V2 enabled by default with overlap scheduling: #21062
- Penalty support for Spec V2 overlap scheduling: #22049
- Adaptive
speculative_num_stepsfor EAGLE topk=1: #21599 - Allow piecewise CUDA graph with speculative decoding: #22128
- Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
- Split
accept_lengthintonum_accepted_drafts/num_accepted_tokens: #23962 - DFLASH speculative decoding documentation: #23553
PD Disaggregation
- Decode-side radix cache support: #19746
- Incremental transfer for Mooncake transfer engine: #24257
- Allow
PrefillDelayerin disaggregated-prefill mode: #23588 - NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
- NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
- Bug fixes for
IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990
Context Parallel & Parallelism
- All-reduce fusion support under CP: #21249
moe_dp_size = 1paired with arbitraryattention_cp_size: #22003- All-reduce fusion enabled for DSA models: #22390
- Replace all-reduce + dp_scatter with
reduce_scattervfor DP attention: #22642 - Step3p5: optimize all-reduce in MoE layers: #22773
- Pipeline parallelism on Intel XPU: #23472
- OpenTelemetry tracing for pipeline parallelism: #23169
LoRA
- DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
- Kimi K2 LoRA support: #22381
- LoRADrainer to address high P99 TTFT: #17913
- Decoupled LoRA MoE backend with Marlin support: #21858
- Virtual experts for LoRA MoE (1/n): #22122, #24007
- CSGMV kernel offline auto-tuning: #20391
- Triton
sgemmspeedup with better grid selection: #22386 - Dual MoE CUDA graph capture for lora/nolora batches: #22809
Performance
- FA3 kernels from the kernel community: #20796
- Precompute FA3
scheduler_metadatato eliminate per-layer prepare cost: #21104 - Precompute
gemma_weightto avoid redundant add on every forward: #22673 - Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
- Skip KV cache in FA backend for embedding mode: #21971
- O(1)
RadixKeyview for EAGLE bigram key: #23106 - PCG inductor path optimization for FP8 models: #23227
- Combo-kernels for horizontal fusion: #21977
- Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
- Restore torch.compile fusion for topk postprocessing: #21771
- Reduce unnecessary kernels and copies in the NSA indexer: #22232
Observability
- Pending token count surfaced in prefill log and
get_load: #22480 - OpenTelemetry tracing for speculative decoding: #19545
- OpenTelemetry tracing for pipeline parallelism: #23169
- OpenTelemetry tracing in DiffGenerator: #21254
- Prometheus metrics endpoint for gRPC mode: #20801
- HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
- Raw KV cache pool token counts as Prometheus gauges: #22726
SGLang-Diffusion
- New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
- ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
- Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
- Disaggregated diffusion: #21701
- Dynamic batching v0: #18764
- CPU platform support for SGLang Diffusion: #20816
- AITER backends in Flux 2 pipeline (AMD): #22802
- LTX-2 feed-forward tensor parallelism optimization: #23221
- In-memory loading for URL/base64 image inputs (default): #23118
- Mixed-resolution benchmark support: #20863
- Auto-enable best parallel setting if unspecified: #22763
AMD
- MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
- Fused QK Gemma norm kernels (4 → fewer kernels): #23575
- Fused all-reduce + RMSNorm simplification: #21986
- GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
- MTP for GLM-5-mxfp4: #23219
- Aiter v0.1.12.post1 upgrade: #22264
- DFLASH speculative decoding enabled on ROCm: #22342
- Fix
--page-size > 1memory access fault with speculative decoding: #23596
NPU / Ascend
- Ascend backend supports Qwen3 MoE attention CP: #21685
- GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
- MTP for Qwen3.5: #20918
- TP communications compression for Qwen3 on NPU: #20520
- Add support-new-models documentation for NPU: #23824
- GGUF quantization for Ascend NPU (dense + MoE): #17883
CPU
- GPTQ / AWQ 4-bit quantization on CPU: #22685
gemma4_rmsnorm_cpukernel: #22842- Qwen3.5 model optimization for CPU: #19484
- Apply routed scaling factor on output for biased grouped topk fusion: #22413
- Fix
extend_attention_cpu/flash_attn_varlen_funcNaN for large seq: #22434
Quantization
- MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
- NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
- DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
- MXFP8 sm100 path cleanup: #21881
- GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543
Dependencies
- Torch upgraded 2.9 → 2.11: #21247
- Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
- Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
- sgl-kernel bumped to 0.4.1.post1: #23720, #23733
- sgl-kernel bumped to 0.4.2: #24170
- Aiter v0.1.12.post1 (AMD): #22264
Security
- Fix for CVE-2026-5760: #23660
- Fix Trivy CVEs and cubin download 403s in Docker image: #22322
All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11
New Contributors
- @AethoceSora made their first contribution in https://github.com/sgl-project/sglang/pull/23426
- @AlbeeSo made their first contribution in https://github.com/sgl-project/sglang/pull/23710
- @alec-flowers made their first contribution in https://github.com/sgl-project/sglang/pull/24090
- @AlonKejzman made their first contribution in https://github.com/sgl-project/sglang/pull/23753
- @amacaskill made their first contribution in https://github.com/sgl-project/sglang/pull/22537
- @AndyLi429 made their first contribution in https://github.com/sgl-project/sglang/pull/21685
- @Baichuan7 made their first contribution in https://github.com/sgl-project/sglang/pull/23060
- @ccullen-cert made their first contribution in https://github.com/sgl-project/sglang/pull/23660
- @ChangLiu0709 made their first contribution in https://github.com/sgl-project/sglang/pull/22908
- @charlotte12l made their first contribution in https://github.com/sgl-project/sglang/pull/21983
- @chenkaiyue made their first contribution in https://github.com/sgl-project/sglang/pull/17195
- @chx96642264 made their first contribution in https://github.com/sgl-project/sglang/pull/22705
- @ColinZ22 made their first contribution in https://github.com/sgl-project/sglang/pull/22543
- @cyyc0310 made their first contribution in https://github.com/sgl-project/sglang/pull/22920
- @divyamagrawal06 made their first contribution in https://github.com/sgl-project/sglang/pull/23325
- @dyhsup made their first contribution in https://github.com/sgl-project/sglang/pull/22439
- @egvenediktov made their first contribution in https://github.com/sgl-project/sglang/pull/20520
- @erikwijmans made their first contribution in https://github.com/sgl-project/sglang/pull/21974
- @fengli1702 made their first contribution in https://github.com/sgl-project/sglang/pull/19143
- @fergusfinn made their first contribution in https://github.com/sgl-project/sglang/pull/21035
- @fortunecookiee made their first contribution in https://github.com/sgl-project/sglang/pull/20960
- @gxlvera made their first contribution in https://github.com/sgl-project/sglang/pull/19225
- @he-yufeng made their first contribution in https://github.com/sgl-project/sglang/pull/20739
- @Henson-Zh-Ali made their first contribution in https://github.com/sgl-project/sglang/pull/20522
- @icepoint666 made their first contribution in https://github.com/sgl-project/sglang/pull/22592
- @iridiumine made their first contribution in https://github.com/sgl-project/sglang/pull/20918
- @is-not made their first contribution in https://github.com/sgl-project/sglang/pull/18349
- @JasonHe-WQ made their first contribution in https://github.com/sgl-project/sglang/pull/21944
- @jh-nv made their first contribution in https://github.com/sgl-project/sglang/pull/21254
- @jiangyinzuo made their first contribution in https://github.com/sgl-project/sglang/pull/23169
- @JieTang66 made their first contribution in https://github.com/sgl-project/sglang/pull/23983
- @JoyFuture made their first contribution in https://github.com/sgl-project/sglang/pull/23808
- @jthakurH made their first contribution in https://github.com/sgl-project/sglang/pull/16793
- @kangyifei made their first contribution in https://github.com/sgl-project/sglang/pull/23241
- @kingkingleeljj made their first contribution in https://github.com/sgl-project/sglang/pull/20967
- @kkyyxhll made their first contribution in https://github.com/sgl-project/sglang/pull/23062
- @KrishnanPrash made their first contribution in https://github.com/sgl-project/sglang/pull/22175
- @lahmuller made their first contribution in https://github.com/sgl-project/sglang/pull/22625
- @lixuwei2333 made their first contribution in https://github.com/sgl-project/sglang/pull/22247
- @lkhl made their first contribution in https://github.com/sgl-project/sglang/pull/22431
- @loading66 made their first contribution in https://github.com/sgl-project/sglang/pull/22700
- @luccafong made their first contribution in https://github.com/sgl-project/sglang/pull/24165
- @mingyue300 made their first contribution in https://github.com/sgl-project/sglang/pull/21723
- @minosfuture made their first contribution in https://github.com/sgl-project/sglang/pull/23419
- @mispa-ms made their first contribution in https://github.com/sgl-project/sglang/pull/23097
- @mlleo made their first contribution in https://github.com/sgl-project/sglang/pull/23537
- @Napkin-AI made their first contribution in https://github.com/sgl-project/sglang/pull/23572
- @nvpohanh made their first contribution in https://github.com/sgl-project/sglang/pull/22852
- @officialasishkumar made their first contribution in https://github.com/sgl-project/sglang/pull/22600
- @opherlieber made their first contribution in https://github.com/sgl-project/sglang/pull/22547
- @ranjiewen made their first contribution in https://github.com/sgl-project/sglang/pull/21698
- @RichardoMrMu made their first contribution in https://github.com/sgl-project/sglang/pull/19545
- @robellliu-dev made their first contribution in https://github.com/sgl-project/sglang/pull/20835
- @SammLSH made their first contribution in https://github.com/sgl-project/sglang/pull/22089
- @Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
- @shenxiul made their first contribution in https://github.com/sgl-project/sglang/pull/23327
- @siju-samuel made their first contribution in https://github.com/sgl-project/sglang/pull/23472
- @stepinto made their first contribution in https://github.com/sgl-project/sglang/pull/23478
- @tfhddd made their first contribution in https://github.com/sgl-project/sglang/pull/22029
- @vvagaytsev made their first contribution in https://github.com/sgl-project/sglang/pull/22363
- @WangHao-hw made their first contribution in https://github.com/sgl-project/sglang/pull/22778
- @Wen-xuan-Xu made their first contribution in https://github.com/sgl-project/sglang/pull/22923
- @xiaobochen-amd made their first contribution in https://github.com/sgl-project/sglang/pull/22626
- @yaya159456 made their first contribution in https://github.com/sgl-project/sglang/pull/21694
- @YMbmzy made their first contribution in https://github.com/sgl-project/sglang/pull/22049
- @yuki-brook made their first contribution in https://github.com/sgl-project/sglang/pull/18016
- @Zaire404 made their first contribution in https://github.com/sgl-project/sglang/pull/22982
- @ZeyuanChen2000 made their first contribution in https://github.com/sgl-project/sglang/pull/21543
- @zhaozx-cn made their first contribution in https://github.com/sgl-project/sglang/pull/22266
- @zhsurpass made their first contribution in https://github.com/sgl-project/sglang/pull/22697
- @zsj555 made their first contribution in https://github.com/sgl-project/sglang/pull/23454
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11
Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.
- Transformers 5.3.0 upgrade with GLM-5 support on main branch
- Piecewise CUDA graph enabled by default reducing memory overhead
- Elastic EP for partial failure tolerance in MoE deployments
- Native Anthropic API compatible endpoint for seamless integration
- LoRA weight loading overlap with computation reducing TTFT by 78%
- TRT-LLM NSA kernel integration for DeepSeek V3.2 with 3-5x speedup
- Fixed urllib and gpgv vulnerabilities
- 1.5x performance improvement for all major diffusion models
- Linear scaling with chunked pipeline parallelism for million-token contexts
- DeepSeek V3.2 optimization with 65% TTFT improvement
- Radix tree cache-aware routing with 10-12x performance improvement
- 99% memory reduction per tree node for cache operations
- JWT/OIDC authentication for enterprise deployment
- Model Gateway v0.3.0 with improved routing and multi-modal support
- Scalable pipeline parallelism with dynamic chunking for ultra-long contexts
- Day 0 support for Mimo-V2-Flash, Nemotron-Nano-v3, LLaDA 2.0, and Qwen-Image models