Skip to content

Release history

sglang releases

SGLang is a high-performance serving framework for large language models and multimodal models.

All releases

9 shown

No immediate action
v0.5.12.post1 Breaking risk

DeepSeek V4 stability + performance

No immediate action
v0.5.12 Security relevant

DeepSeek V4 support

patches CVE-2023-4863
Open
v0.5.11 Security relevant
Security fixes
  • CVE-2026-5760 — fixed in #23660
Notable features
  • Default CUDA version upgraded to 13.0 across sglang, sgl-kernel, and Docker images
  • PyTorch upgraded from 2.9 to 2.11
Full changelog

Highlights

  • CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)

  • Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062

  • Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746

  • Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394

  • DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553

  • FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796

  • LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381

  • Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for moe_dp_size = 1 paired with arbitrary attention_cp_size so MoE and attention parallelism can be tuned independently: #21249, #22003

  • FlashInfer CuteDSL MoE Runner Backend: New dedicated FlashInferCuteDslMoE layer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

  • Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
  • GLM-5.1: #22543, #23037 (see cookbook)
  • Qwen3.6: #23486 (see cookbook)
  • MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
  • Ling-2.6-Flash: #23947 (see cookbook)
  • Mistral Medium 3.5: see cookbook
  • Kimi-K2.6: #23394, #23408 (see cookbook)
  • Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
  • FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
  • FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
  • Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
  • LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
  • Qwen3-ASR (chunk-based streaming): #22073, #22089
  • Voxtral (Mistral speech-to-text): #21635
  • Parakeet (NVIDIA Nemotron encoder): #23568
  • Moss-VL: #23454
  • SequenceClassification model architecture (powers the Score API): #22118
  • Stable Diffusion 3 medium (Diffusion): #19225
  • ERNIE-Image (Diffusion): #22439
  • JoyAI-Image-Edit (Diffusion): #22625

Speculative Decoding

  • DFLASH speculative decoding initial support: #22077
  • DFLASH enabled across additional model backends: #22358
  • DFLASH speculative decoding on AMD ROCm: #22342
  • Spec V2 enabled by default with overlap scheduling: #21062
  • Penalty support for Spec V2 overlap scheduling: #22049
  • Adaptive speculative_num_steps for EAGLE topk=1: #21599
  • Allow piecewise CUDA graph with speculative decoding: #22128
  • Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
  • Split accept_length into num_accepted_drafts / num_accepted_tokens: #23962
  • DFLASH speculative decoding documentation: #23553

PD Disaggregation

  • Decode-side radix cache support: #19746
  • Incremental transfer for Mooncake transfer engine: #24257
  • Allow PrefillDelayer in disaggregated-prefill mode: #23588
  • NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
  • NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
  • Bug fixes for IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990

Context Parallel & Parallelism

  • All-reduce fusion support under CP: #21249
  • moe_dp_size = 1 paired with arbitrary attention_cp_size: #22003
  • All-reduce fusion enabled for DSA models: #22390
  • Replace all-reduce + dp_scatter with reduce_scatterv for DP attention: #22642
  • Step3p5: optimize all-reduce in MoE layers: #22773
  • Pipeline parallelism on Intel XPU: #23472
  • OpenTelemetry tracing for pipeline parallelism: #23169

LoRA

  • DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
  • Kimi K2 LoRA support: #22381
  • LoRADrainer to address high P99 TTFT: #17913
  • Decoupled LoRA MoE backend with Marlin support: #21858
  • Virtual experts for LoRA MoE (1/n): #22122, #24007
  • CSGMV kernel offline auto-tuning: #20391
  • Triton sgemm speedup with better grid selection: #22386
  • Dual MoE CUDA graph capture for lora/nolora batches: #22809

Performance

  • FA3 kernels from the kernel community: #20796
  • Precompute FA3 scheduler_metadata to eliminate per-layer prepare cost: #21104
  • Precompute gemma_weight to avoid redundant add on every forward: #22673
  • Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
  • Skip KV cache in FA backend for embedding mode: #21971
  • O(1) RadixKey view for EAGLE bigram key: #23106
  • PCG inductor path optimization for FP8 models: #23227
  • Combo-kernels for horizontal fusion: #21977
  • Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
  • Restore torch.compile fusion for topk postprocessing: #21771
  • Reduce unnecessary kernels and copies in the NSA indexer: #22232

Observability

  • Pending token count surfaced in prefill log and get_load: #22480
  • OpenTelemetry tracing for speculative decoding: #19545
  • OpenTelemetry tracing for pipeline parallelism: #23169
  • OpenTelemetry tracing in DiffGenerator: #21254
  • Prometheus metrics endpoint for gRPC mode: #20801
  • HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
  • Raw KV cache pool token counts as Prometheus gauges: #22726

SGLang-Diffusion

  • New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
  • ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
  • Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
  • Disaggregated diffusion: #21701
  • Dynamic batching v0: #18764
  • CPU platform support for SGLang Diffusion: #20816
  • AITER backends in Flux 2 pipeline (AMD): #22802
  • LTX-2 feed-forward tensor parallelism optimization: #23221
  • In-memory loading for URL/base64 image inputs (default): #23118
  • Mixed-resolution benchmark support: #20863
  • Auto-enable best parallel setting if unspecified: #22763

AMD

  • MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
  • Fused QK Gemma norm kernels (4 → fewer kernels): #23575
  • Fused all-reduce + RMSNorm simplification: #21986
  • GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
  • MTP for GLM-5-mxfp4: #23219
  • Aiter v0.1.12.post1 upgrade: #22264
  • DFLASH speculative decoding enabled on ROCm: #22342
  • Fix --page-size > 1 memory access fault with speculative decoding: #23596

NPU / Ascend

  • Ascend backend supports Qwen3 MoE attention CP: #21685
  • GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
  • MTP for Qwen3.5: #20918
  • TP communications compression for Qwen3 on NPU: #20520
  • Add support-new-models documentation for NPU: #23824
  • GGUF quantization for Ascend NPU (dense + MoE): #17883

CPU

  • GPTQ / AWQ 4-bit quantization on CPU: #22685
  • gemma4_rmsnorm_cpu kernel: #22842
  • Qwen3.5 model optimization for CPU: #19484
  • Apply routed scaling factor on output for biased grouped topk fusion: #22413
  • Fix extend_attention_cpu / flash_attn_varlen_func NaN for large seq: #22434

Quantization

  • MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
  • NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
  • DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
  • MXFP8 sm100 path cleanup: #21881
  • GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543

Dependencies

  • Torch upgraded 2.9 → 2.11: #21247
  • Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
  • Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
  • sgl-kernel bumped to 0.4.1.post1: #23720, #23733
  • sgl-kernel bumped to 0.4.2: #24170
  • Aiter v0.1.12.post1 (AMD): #22264

Security

  • Fix for CVE-2026-5760: #23660
  • Fix Trivy CVEs and cubin download 403s in Docker image: #22322

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

New Contributors

  • @AethoceSora made their first contribution in https://github.com/sgl-project/sglang/pull/23426
  • @AlbeeSo made their first contribution in https://github.com/sgl-project/sglang/pull/23710
  • @alec-flowers made their first contribution in https://github.com/sgl-project/sglang/pull/24090
  • @AlonKejzman made their first contribution in https://github.com/sgl-project/sglang/pull/23753
  • @amacaskill made their first contribution in https://github.com/sgl-project/sglang/pull/22537
  • @AndyLi429 made their first contribution in https://github.com/sgl-project/sglang/pull/21685
  • @Baichuan7 made their first contribution in https://github.com/sgl-project/sglang/pull/23060
  • @ccullen-cert made their first contribution in https://github.com/sgl-project/sglang/pull/23660
  • @ChangLiu0709 made their first contribution in https://github.com/sgl-project/sglang/pull/22908
  • @charlotte12l made their first contribution in https://github.com/sgl-project/sglang/pull/21983
  • @chenkaiyue made their first contribution in https://github.com/sgl-project/sglang/pull/17195
  • @chx96642264 made their first contribution in https://github.com/sgl-project/sglang/pull/22705
  • @ColinZ22 made their first contribution in https://github.com/sgl-project/sglang/pull/22543
  • @cyyc0310 made their first contribution in https://github.com/sgl-project/sglang/pull/22920
  • @divyamagrawal06 made their first contribution in https://github.com/sgl-project/sglang/pull/23325
  • @dyhsup made their first contribution in https://github.com/sgl-project/sglang/pull/22439
  • @egvenediktov made their first contribution in https://github.com/sgl-project/sglang/pull/20520
  • @erikwijmans made their first contribution in https://github.com/sgl-project/sglang/pull/21974
  • @fengli1702 made their first contribution in https://github.com/sgl-project/sglang/pull/19143
  • @fergusfinn made their first contribution in https://github.com/sgl-project/sglang/pull/21035
  • @fortunecookiee made their first contribution in https://github.com/sgl-project/sglang/pull/20960
  • @gxlvera made their first contribution in https://github.com/sgl-project/sglang/pull/19225
  • @he-yufeng made their first contribution in https://github.com/sgl-project/sglang/pull/20739
  • @Henson-Zh-Ali made their first contribution in https://github.com/sgl-project/sglang/pull/20522
  • @icepoint666 made their first contribution in https://github.com/sgl-project/sglang/pull/22592
  • @iridiumine made their first contribution in https://github.com/sgl-project/sglang/pull/20918
  • @is-not made their first contribution in https://github.com/sgl-project/sglang/pull/18349
  • @JasonHe-WQ made their first contribution in https://github.com/sgl-project/sglang/pull/21944
  • @jh-nv made their first contribution in https://github.com/sgl-project/sglang/pull/21254
  • @jiangyinzuo made their first contribution in https://github.com/sgl-project/sglang/pull/23169
  • @JieTang66 made their first contribution in https://github.com/sgl-project/sglang/pull/23983
  • @JoyFuture made their first contribution in https://github.com/sgl-project/sglang/pull/23808
  • @jthakurH made their first contribution in https://github.com/sgl-project/sglang/pull/16793
  • @kangyifei made their first contribution in https://github.com/sgl-project/sglang/pull/23241
  • @kingkingleeljj made their first contribution in https://github.com/sgl-project/sglang/pull/20967
  • @kkyyxhll made their first contribution in https://github.com/sgl-project/sglang/pull/23062
  • @KrishnanPrash made their first contribution in https://github.com/sgl-project/sglang/pull/22175
  • @lahmuller made their first contribution in https://github.com/sgl-project/sglang/pull/22625
  • @lixuwei2333 made their first contribution in https://github.com/sgl-project/sglang/pull/22247
  • @lkhl made their first contribution in https://github.com/sgl-project/sglang/pull/22431
  • @loading66 made their first contribution in https://github.com/sgl-project/sglang/pull/22700
  • @luccafong made their first contribution in https://github.com/sgl-project/sglang/pull/24165
  • @mingyue300 made their first contribution in https://github.com/sgl-project/sglang/pull/21723
  • @minosfuture made their first contribution in https://github.com/sgl-project/sglang/pull/23419
  • @mispa-ms made their first contribution in https://github.com/sgl-project/sglang/pull/23097
  • @mlleo made their first contribution in https://github.com/sgl-project/sglang/pull/23537
  • @Napkin-AI made their first contribution in https://github.com/sgl-project/sglang/pull/23572
  • @nvpohanh made their first contribution in https://github.com/sgl-project/sglang/pull/22852
  • @officialasishkumar made their first contribution in https://github.com/sgl-project/sglang/pull/22600
  • @opherlieber made their first contribution in https://github.com/sgl-project/sglang/pull/22547
  • @ranjiewen made their first contribution in https://github.com/sgl-project/sglang/pull/21698
  • @RichardoMrMu made their first contribution in https://github.com/sgl-project/sglang/pull/19545
  • @robellliu-dev made their first contribution in https://github.com/sgl-project/sglang/pull/20835
  • @SammLSH made their first contribution in https://github.com/sgl-project/sglang/pull/22089
  • @Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
  • @shenxiul made their first contribution in https://github.com/sgl-project/sglang/pull/23327
  • @siju-samuel made their first contribution in https://github.com/sgl-project/sglang/pull/23472
  • @stepinto made their first contribution in https://github.com/sgl-project/sglang/pull/23478
  • @tfhddd made their first contribution in https://github.com/sgl-project/sglang/pull/22029
  • @vvagaytsev made their first contribution in https://github.com/sgl-project/sglang/pull/22363
  • @WangHao-hw made their first contribution in https://github.com/sgl-project/sglang/pull/22778
  • @Wen-xuan-Xu made their first contribution in https://github.com/sgl-project/sglang/pull/22923
  • @xiaobochen-amd made their first contribution in https://github.com/sgl-project/sglang/pull/22626
  • @yaya159456 made their first contribution in https://github.com/sgl-project/sglang/pull/21694
  • @YMbmzy made their first contribution in https://github.com/sgl-project/sglang/pull/22049
  • @yuki-brook made their first contribution in https://github.com/sgl-project/sglang/pull/18016
  • @Zaire404 made their first contribution in https://github.com/sgl-project/sglang/pull/22982
  • @ZeyuanChen2000 made their first contribution in https://github.com/sgl-project/sglang/pull/21543
  • @zhaozx-cn made their first contribution in https://github.com/sgl-project/sglang/pull/22266
  • @zhsurpass made their first contribution in https://github.com/sgl-project/sglang/pull/22697
  • @zsj555 made their first contribution in https://github.com/sgl-project/sglang/pull/23454

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

v0.5.10.post1 Bugfix

Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.

v0.5.10 New feature
Notable features
  • Transformers 5.3.0 upgrade with GLM-5 support on main branch
  • Piecewise CUDA graph enabled by default reducing memory overhead
  • Elastic EP for partial failure tolerance in MoE deployments
v0.5.9 New feature
Notable features
  • Native Anthropic API compatible endpoint for seamless integration
  • LoRA weight loading overlap with computation reducing TTFT by 78%
  • TRT-LLM NSA kernel integration for DeepSeek V3.2 with 3-5x speedup
v0.5.8 New feature
Security fixes
  • Fixed urllib and gpgv vulnerabilities
Notable features
  • 1.5x performance improvement for all major diffusion models
  • Linear scaling with chunked pipeline parallelism for million-token contexts
  • DeepSeek V3.2 optimization with 65% TTFT improvement
gateway-v0.3.1 New feature
Notable features
  • Radix tree cache-aware routing with 10-12x performance improvement
  • 99% memory reduction per tree node for cache operations
  • JWT/OIDC authentication for enterprise deployment
v0.5.7 New feature
Notable features
  • Model Gateway v0.3.0 with improved routing and multi-modal support
  • Scalable pipeline parallelism with dynamic chunking for ultra-long contexts
  • Day 0 support for Mimo-V2-Flash, Nemotron-Nano-v3, LLaDA 2.0, and Qwen-Image models

Beta — feedback welcome: [email protected]