sglang
Model Serving & MLOpsSGLang is a high-performance serving framework for large language models and multimodal models.
Features
- High‑performance LLM serving infrastructure handling trillions of tokens daily
- Zero‑overhead batch scheduler and cache‑aware load balancer for efficient GPU utilization
- Native support for multiple hardware backends (NVIDIA GPUs, AMD Instinct, TPUs)
- Day‑0 integration with popular open‑source models such as DeepSeek, Llama 3, Mistral Large 3
Security Response History
1 CVE| CVE | Severity | Disclosed | Patched (this tool) | vs Ecosystem Median |
|---|---|---|---|---|
| CVE-2023-4863 KEV |
high
CVSS 8.8
|
2023-09-13 | 2026-01-01 | 2y 4mo / median 2y 4mo |
Recent releases
View all 9 releases →- CVE-2026-5760 — fixed in #23660
- Default CUDA version upgraded to 13.0 across sglang, sgl-kernel, and Docker images
- PyTorch upgraded from 2.9 to 2.11
Full changelog
Highlights
-
CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)
-
Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062
-
Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746
-
Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394
-
DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553
-
FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796
-
LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381
-
Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for
moe_dp_size = 1paired with arbitraryattention_cp_sizeso MoE and attention parallelism can be tuned independently: #21249, #22003 -
FlashInfer CuteDSL MoE Runner Backend: New dedicated
FlashInferCuteDslMoElayer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339
New Model Support
Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.
- Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
- GLM-5.1: #22543, #23037 (see cookbook)
- Qwen3.6: #23486 (see cookbook)
- MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
- Ling-2.6-Flash: #23947 (see cookbook)
- Mistral Medium 3.5: see cookbook
- Kimi-K2.6: #23394, #23408 (see cookbook)
- Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
- FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
- FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
- Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
- LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
- Qwen3-ASR (chunk-based streaming): #22073, #22089
- Voxtral (Mistral speech-to-text): #21635
- Parakeet (NVIDIA Nemotron encoder): #23568
- Moss-VL: #23454
- SequenceClassification model architecture (powers the Score API): #22118
- Stable Diffusion 3 medium (Diffusion): #19225
- ERNIE-Image (Diffusion): #22439
- JoyAI-Image-Edit (Diffusion): #22625
Speculative Decoding
- DFLASH speculative decoding initial support: #22077
- DFLASH enabled across additional model backends: #22358
- DFLASH speculative decoding on AMD ROCm: #22342
- Spec V2 enabled by default with overlap scheduling: #21062
- Penalty support for Spec V2 overlap scheduling: #22049
- Adaptive
speculative_num_stepsfor EAGLE topk=1: #21599 - Allow piecewise CUDA graph with speculative decoding: #22128
- Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
- Split
accept_lengthintonum_accepted_drafts/num_accepted_tokens: #23962 - DFLASH speculative decoding documentation: #23553
PD Disaggregation
- Decode-side radix cache support: #19746
- Incremental transfer for Mooncake transfer engine: #24257
- Allow
PrefillDelayerin disaggregated-prefill mode: #23588 - NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
- NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
- Bug fixes for
IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990
Context Parallel & Parallelism
- All-reduce fusion support under CP: #21249
moe_dp_size = 1paired with arbitraryattention_cp_size: #22003- All-reduce fusion enabled for DSA models: #22390
- Replace all-reduce + dp_scatter with
reduce_scattervfor DP attention: #22642 - Step3p5: optimize all-reduce in MoE layers: #22773
- Pipeline parallelism on Intel XPU: #23472
- OpenTelemetry tracing for pipeline parallelism: #23169
LoRA
- DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
- Kimi K2 LoRA support: #22381
- LoRADrainer to address high P99 TTFT: #17913
- Decoupled LoRA MoE backend with Marlin support: #21858
- Virtual experts for LoRA MoE (1/n): #22122, #24007
- CSGMV kernel offline auto-tuning: #20391
- Triton
sgemmspeedup with better grid selection: #22386 - Dual MoE CUDA graph capture for lora/nolora batches: #22809
Performance
- FA3 kernels from the kernel community: #20796
- Precompute FA3
scheduler_metadatato eliminate per-layer prepare cost: #21104 - Precompute
gemma_weightto avoid redundant add on every forward: #22673 - Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
- Skip KV cache in FA backend for embedding mode: #21971
- O(1)
RadixKeyview for EAGLE bigram key: #23106 - PCG inductor path optimization for FP8 models: #23227
- Combo-kernels for horizontal fusion: #21977
- Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
- Restore torch.compile fusion for topk postprocessing: #21771
- Reduce unnecessary kernels and copies in the NSA indexer: #22232
Observability
- Pending token count surfaced in prefill log and
get_load: #22480 - OpenTelemetry tracing for speculative decoding: #19545
- OpenTelemetry tracing for pipeline parallelism: #23169
- OpenTelemetry tracing in DiffGenerator: #21254
- Prometheus metrics endpoint for gRPC mode: #20801
- HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
- Raw KV cache pool token counts as Prometheus gauges: #22726
SGLang-Diffusion
- New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
- ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
- Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
- Disaggregated diffusion: #21701
- Dynamic batching v0: #18764
- CPU platform support for SGLang Diffusion: #20816
- AITER backends in Flux 2 pipeline (AMD): #22802
- LTX-2 feed-forward tensor parallelism optimization: #23221
- In-memory loading for URL/base64 image inputs (default): #23118
- Mixed-resolution benchmark support: #20863
- Auto-enable best parallel setting if unspecified: #22763
AMD
- MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
- Fused QK Gemma norm kernels (4 → fewer kernels): #23575
- Fused all-reduce + RMSNorm simplification: #21986
- GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
- MTP for GLM-5-mxfp4: #23219
- Aiter v0.1.12.post1 upgrade: #22264
- DFLASH speculative decoding enabled on ROCm: #22342
- Fix
--page-size > 1memory access fault with speculative decoding: #23596
NPU / Ascend
- Ascend backend supports Qwen3 MoE attention CP: #21685
- GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
- MTP for Qwen3.5: #20918
- TP communications compression for Qwen3 on NPU: #20520
- Add support-new-models documentation for NPU: #23824
- GGUF quantization for Ascend NPU (dense + MoE): #17883
CPU
- GPTQ / AWQ 4-bit quantization on CPU: #22685
gemma4_rmsnorm_cpukernel: #22842- Qwen3.5 model optimization for CPU: #19484
- Apply routed scaling factor on output for biased grouped topk fusion: #22413
- Fix
extend_attention_cpu/flash_attn_varlen_funcNaN for large seq: #22434
Quantization
- MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
- NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
- DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
- MXFP8 sm100 path cleanup: #21881
- GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543
Dependencies
- Torch upgraded 2.9 → 2.11: #21247
- Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
- Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
- sgl-kernel bumped to 0.4.1.post1: #23720, #23733
- sgl-kernel bumped to 0.4.2: #24170
- Aiter v0.1.12.post1 (AMD): #22264
Security
- Fix for CVE-2026-5760: #23660
- Fix Trivy CVEs and cubin download 403s in Docker image: #22322
All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11
New Contributors
- @AethoceSora made their first contribution in https://github.com/sgl-project/sglang/pull/23426
- @AlbeeSo made their first contribution in https://github.com/sgl-project/sglang/pull/23710
- @alec-flowers made their first contribution in https://github.com/sgl-project/sglang/pull/24090
- @AlonKejzman made their first contribution in https://github.com/sgl-project/sglang/pull/23753
- @amacaskill made their first contribution in https://github.com/sgl-project/sglang/pull/22537
- @AndyLi429 made their first contribution in https://github.com/sgl-project/sglang/pull/21685
- @Baichuan7 made their first contribution in https://github.com/sgl-project/sglang/pull/23060
- @ccullen-cert made their first contribution in https://github.com/sgl-project/sglang/pull/23660
- @ChangLiu0709 made their first contribution in https://github.com/sgl-project/sglang/pull/22908
- @charlotte12l made their first contribution in https://github.com/sgl-project/sglang/pull/21983
- @chenkaiyue made their first contribution in https://github.com/sgl-project/sglang/pull/17195
- @chx96642264 made their first contribution in https://github.com/sgl-project/sglang/pull/22705
- @ColinZ22 made their first contribution in https://github.com/sgl-project/sglang/pull/22543
- @cyyc0310 made their first contribution in https://github.com/sgl-project/sglang/pull/22920
- @divyamagrawal06 made their first contribution in https://github.com/sgl-project/sglang/pull/23325
- @dyhsup made their first contribution in https://github.com/sgl-project/sglang/pull/22439
- @egvenediktov made their first contribution in https://github.com/sgl-project/sglang/pull/20520
- @erikwijmans made their first contribution in https://github.com/sgl-project/sglang/pull/21974
- @fengli1702 made their first contribution in https://github.com/sgl-project/sglang/pull/19143
- @fergusfinn made their first contribution in https://github.com/sgl-project/sglang/pull/21035
- @fortunecookiee made their first contribution in https://github.com/sgl-project/sglang/pull/20960
- @gxlvera made their first contribution in https://github.com/sgl-project/sglang/pull/19225
- @he-yufeng made their first contribution in https://github.com/sgl-project/sglang/pull/20739
- @Henson-Zh-Ali made their first contribution in https://github.com/sgl-project/sglang/pull/20522
- @icepoint666 made their first contribution in https://github.com/sgl-project/sglang/pull/22592
- @iridiumine made their first contribution in https://github.com/sgl-project/sglang/pull/20918
- @is-not made their first contribution in https://github.com/sgl-project/sglang/pull/18349
- @JasonHe-WQ made their first contribution in https://github.com/sgl-project/sglang/pull/21944
- @jh-nv made their first contribution in https://github.com/sgl-project/sglang/pull/21254
- @jiangyinzuo made their first contribution in https://github.com/sgl-project/sglang/pull/23169
- @JieTang66 made their first contribution in https://github.com/sgl-project/sglang/pull/23983
- @JoyFuture made their first contribution in https://github.com/sgl-project/sglang/pull/23808
- @jthakurH made their first contribution in https://github.com/sgl-project/sglang/pull/16793
- @kangyifei made their first contribution in https://github.com/sgl-project/sglang/pull/23241
- @kingkingleeljj made their first contribution in https://github.com/sgl-project/sglang/pull/20967
- @kkyyxhll made their first contribution in https://github.com/sgl-project/sglang/pull/23062
- @KrishnanPrash made their first contribution in https://github.com/sgl-project/sglang/pull/22175
- @lahmuller made their first contribution in https://github.com/sgl-project/sglang/pull/22625
- @lixuwei2333 made their first contribution in https://github.com/sgl-project/sglang/pull/22247
- @lkhl made their first contribution in https://github.com/sgl-project/sglang/pull/22431
- @loading66 made their first contribution in https://github.com/sgl-project/sglang/pull/22700
- @luccafong made their first contribution in https://github.com/sgl-project/sglang/pull/24165
- @mingyue300 made their first contribution in https://github.com/sgl-project/sglang/pull/21723
- @minosfuture made their first contribution in https://github.com/sgl-project/sglang/pull/23419
- @mispa-ms made their first contribution in https://github.com/sgl-project/sglang/pull/23097
- @mlleo made their first contribution in https://github.com/sgl-project/sglang/pull/23537
- @Napkin-AI made their first contribution in https://github.com/sgl-project/sglang/pull/23572
- @nvpohanh made their first contribution in https://github.com/sgl-project/sglang/pull/22852
- @officialasishkumar made their first contribution in https://github.com/sgl-project/sglang/pull/22600
- @opherlieber made their first contribution in https://github.com/sgl-project/sglang/pull/22547
- @ranjiewen made their first contribution in https://github.com/sgl-project/sglang/pull/21698
- @RichardoMrMu made their first contribution in https://github.com/sgl-project/sglang/pull/19545
- @robellliu-dev made their first contribution in https://github.com/sgl-project/sglang/pull/20835
- @SammLSH made their first contribution in https://github.com/sgl-project/sglang/pull/22089
- @Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
- @shenxiul made their first contribution in https://github.com/sgl-project/sglang/pull/23327
- @siju-samuel made their first contribution in https://github.com/sgl-project/sglang/pull/23472
- @stepinto made their first contribution in https://github.com/sgl-project/sglang/pull/23478
- @tfhddd made their first contribution in https://github.com/sgl-project/sglang/pull/22029
- @vvagaytsev made their first contribution in https://github.com/sgl-project/sglang/pull/22363
- @WangHao-hw made their first contribution in https://github.com/sgl-project/sglang/pull/22778
- @Wen-xuan-Xu made their first contribution in https://github.com/sgl-project/sglang/pull/22923
- @xiaobochen-amd made their first contribution in https://github.com/sgl-project/sglang/pull/22626
- @yaya159456 made their first contribution in https://github.com/sgl-project/sglang/pull/21694
- @YMbmzy made their first contribution in https://github.com/sgl-project/sglang/pull/22049
- @yuki-brook made their first contribution in https://github.com/sgl-project/sglang/pull/18016
- @Zaire404 made their first contribution in https://github.com/sgl-project/sglang/pull/22982
- @ZeyuanChen2000 made their first contribution in https://github.com/sgl-project/sglang/pull/21543
- @zhaozx-cn made their first contribution in https://github.com/sgl-project/sglang/pull/22266
- @zhsurpass made their first contribution in https://github.com/sgl-project/sglang/pull/22697
- @zsj555 made their first contribution in https://github.com/sgl-project/sglang/pull/23454
Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11
Bumps flashinfer from v0.6.7.post2 to v0.6.7.post3 to resolve an issue in its jit cubin downloader.
- Transformers 5.3.0 upgrade with GLM-5 support on main branch
- Piecewise CUDA graph enabled by default reducing memory overhead
- Elastic EP for partial failure tolerance in MoE deployments
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.