sglang

v0.5.11 Security

This release includes 1 security fix for security teams reviewing exposed deployments.

Published 29d Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

This release patches 1 known CVE

Topics

attention blackwell cuda deepseek diffusion glm

+12 more

gpt-oss inference llama llm minimax moe qwen qwen-image reinforcement-learning transformer vlm wan

Summary

AI summary

Fix for CVE-2026-5760.

Full changelog

Highlights

CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)
Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062
Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746
Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394
DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553
FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796
LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381
Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for moe_dp_size = 1 paired with arbitrary attention_cp_size so MoE and attention parallelism can be tuned independently: #21249, #22003
FlashInfer CuteDSL MoE Runner Backend: New dedicated FlashInferCuteDslMoE layer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
GLM-5.1: #22543, #23037 (see cookbook)
Qwen3.6: #23486 (see cookbook)
MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
Ling-2.6-Flash: #23947 (see cookbook)
Mistral Medium 3.5: see cookbook
Kimi-K2.6: #23394, #23408 (see cookbook)
Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
Qwen3-ASR (chunk-based streaming): #22073, #22089
Voxtral (Mistral speech-to-text): #21635
Parakeet (NVIDIA Nemotron encoder): #23568
Moss-VL: #23454
SequenceClassification model architecture (powers the Score API): #22118
Stable Diffusion 3 medium (Diffusion): #19225
ERNIE-Image (Diffusion): #22439
JoyAI-Image-Edit (Diffusion): #22625

Speculative Decoding

DFLASH speculative decoding initial support: #22077
DFLASH enabled across additional model backends: #22358
DFLASH speculative decoding on AMD ROCm: #22342
Spec V2 enabled by default with overlap scheduling: #21062
Penalty support for Spec V2 overlap scheduling: #22049
Adaptive speculative_num_steps for EAGLE topk=1: #21599
Allow piecewise CUDA graph with speculative decoding: #22128
Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
Split accept_length into num_accepted_drafts / num_accepted_tokens: #23962
DFLASH speculative decoding documentation: #23553

PD Disaggregation

Decode-side radix cache support: #19746
Incremental transfer for Mooncake transfer engine: #24257
Allow PrefillDelayer in disaggregated-prefill mode: #23588
NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
Bug fixes for IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990

Context Parallel & Parallelism

All-reduce fusion support under CP: #21249
moe_dp_size = 1 paired with arbitrary attention_cp_size: #22003
All-reduce fusion enabled for DSA models: #22390
Replace all-reduce + dp_scatter with reduce_scatterv for DP attention: #22642
Step3p5: optimize all-reduce in MoE layers: #22773
Pipeline parallelism on Intel XPU: #23472
OpenTelemetry tracing for pipeline parallelism: #23169

LoRA

DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
Kimi K2 LoRA support: #22381
LoRADrainer to address high P99 TTFT: #17913
Decoupled LoRA MoE backend with Marlin support: #21858
Virtual experts for LoRA MoE (1/n): #22122, #24007
CSGMV kernel offline auto-tuning: #20391
Triton sgemm speedup with better grid selection: #22386
Dual MoE CUDA graph capture for lora/nolora batches: #22809

Performance

FA3 kernels from the kernel community: #20796
Precompute FA3 scheduler_metadata to eliminate per-layer prepare cost: #21104
Precompute gemma_weight to avoid redundant add on every forward: #22673
Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
Skip KV cache in FA backend for embedding mode: #21971
O(1) RadixKey view for EAGLE bigram key: #23106
PCG inductor path optimization for FP8 models: #23227
Combo-kernels for horizontal fusion: #21977
Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
Restore torch.compile fusion for topk postprocessing: #21771
Reduce unnecessary kernels and copies in the NSA indexer: #22232

Observability

Pending token count surfaced in prefill log and get_load: #22480
OpenTelemetry tracing for speculative decoding: #19545
OpenTelemetry tracing for pipeline parallelism: #23169
OpenTelemetry tracing in DiffGenerator: #21254
Prometheus metrics endpoint for gRPC mode: #20801
HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
Raw KV cache pool token counts as Prometheus gauges: #22726

SGLang-Diffusion

New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
Disaggregated diffusion: #21701
Dynamic batching v0: #18764
CPU platform support for SGLang Diffusion: #20816
AITER backends in Flux 2 pipeline (AMD): #22802
LTX-2 feed-forward tensor parallelism optimization: #23221
In-memory loading for URL/base64 image inputs (default): #23118
Mixed-resolution benchmark support: #20863
Auto-enable best parallel setting if unspecified: #22763

AMD

MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
Fused QK Gemma norm kernels (4 → fewer kernels): #23575
Fused all-reduce + RMSNorm simplification: #21986
GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
MTP for GLM-5-mxfp4: #23219
Aiter v0.1.12.post1 upgrade: #22264
DFLASH speculative decoding enabled on ROCm: #22342
Fix --page-size > 1 memory access fault with speculative decoding: #23596

NPU / Ascend

Ascend backend supports Qwen3 MoE attention CP: #21685
GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
MTP for Qwen3.5: #20918
TP communications compression for Qwen3 on NPU: #20520
Add support-new-models documentation for NPU: #23824
GGUF quantization for Ascend NPU (dense + MoE): #17883

CPU

GPTQ / AWQ 4-bit quantization on CPU: #22685
gemma4_rmsnorm_cpu kernel: #22842
Qwen3.5 model optimization for CPU: #19484
Apply routed scaling factor on output for biased grouped topk fusion: #22413
Fix extend_attention_cpu / flash_attn_varlen_func NaN for large seq: #22434

Quantization

MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
MXFP8 sm100 path cleanup: #21881
GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543

Dependencies

Torch upgraded 2.9 → 2.11: #21247
Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
sgl-kernel bumped to 0.4.1.post1: #23720, #23733
sgl-kernel bumped to 0.4.2: #24170
Aiter v0.1.12.post1 (AMD): #22264

Security

Fix for CVE-2026-5760: #23660
Fix Trivy CVEs and cubin download 403s in Docker image: #22322

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

New Contributors

@AethoceSora made their first contribution in https://github.com/sgl-project/sglang/pull/23426
@AlbeeSo made their first contribution in https://github.com/sgl-project/sglang/pull/23710
@alec-flowers made their first contribution in https://github.com/sgl-project/sglang/pull/24090
@AlonKejzman made their first contribution in https://github.com/sgl-project/sglang/pull/23753
@amacaskill made their first contribution in https://github.com/sgl-project/sglang/pull/22537
@AndyLi429 made their first contribution in https://github.com/sgl-project/sglang/pull/21685
@Baichuan7 made their first contribution in https://github.com/sgl-project/sglang/pull/23060
@ccullen-cert made their first contribution in https://github.com/sgl-project/sglang/pull/23660
@ChangLiu0709 made their first contribution in https://github.com/sgl-project/sglang/pull/22908
@charlotte12l made their first contribution in https://github.com/sgl-project/sglang/pull/21983
@chenkaiyue made their first contribution in https://github.com/sgl-project/sglang/pull/17195
@chx96642264 made their first contribution in https://github.com/sgl-project/sglang/pull/22705
@ColinZ22 made their first contribution in https://github.com/sgl-project/sglang/pull/22543
@cyyc0310 made their first contribution in https://github.com/sgl-project/sglang/pull/22920
@divyamagrawal06 made their first contribution in https://github.com/sgl-project/sglang/pull/23325
@dyhsup made their first contribution in https://github.com/sgl-project/sglang/pull/22439
@egvenediktov made their first contribution in https://github.com/sgl-project/sglang/pull/20520
@erikwijmans made their first contribution in https://github.com/sgl-project/sglang/pull/21974
@fengli1702 made their first contribution in https://github.com/sgl-project/sglang/pull/19143
@fergusfinn made their first contribution in https://github.com/sgl-project/sglang/pull/21035
@fortunecookiee made their first contribution in https://github.com/sgl-project/sglang/pull/20960
@gxlvera made their first contribution in https://github.com/sgl-project/sglang/pull/19225
@he-yufeng made their first contribution in https://github.com/sgl-project/sglang/pull/20739
@Henson-Zh-Ali made their first contribution in https://github.com/sgl-project/sglang/pull/20522
@icepoint666 made their first contribution in https://github.com/sgl-project/sglang/pull/22592
@iridiumine made their first contribution in https://github.com/sgl-project/sglang/pull/20918
@is-not made their first contribution in https://github.com/sgl-project/sglang/pull/18349
@JasonHe-WQ made their first contribution in https://github.com/sgl-project/sglang/pull/21944
@jh-nv made their first contribution in https://github.com/sgl-project/sglang/pull/21254
@jiangyinzuo made their first contribution in https://github.com/sgl-project/sglang/pull/23169
@JieTang66 made their first contribution in https://github.com/sgl-project/sglang/pull/23983
@JoyFuture made their first contribution in https://github.com/sgl-project/sglang/pull/23808
@jthakurH made their first contribution in https://github.com/sgl-project/sglang/pull/16793
@kangyifei made their first contribution in https://github.com/sgl-project/sglang/pull/23241
@kingkingleeljj made their first contribution in https://github.com/sgl-project/sglang/pull/20967
@kkyyxhll made their first contribution in https://github.com/sgl-project/sglang/pull/23062
@KrishnanPrash made their first contribution in https://github.com/sgl-project/sglang/pull/22175
@lahmuller made their first contribution in https://github.com/sgl-project/sglang/pull/22625
@lixuwei2333 made their first contribution in https://github.com/sgl-project/sglang/pull/22247
@lkhl made their first contribution in https://github.com/sgl-project/sglang/pull/22431
@loading66 made their first contribution in https://github.com/sgl-project/sglang/pull/22700
@luccafong made their first contribution in https://github.com/sgl-project/sglang/pull/24165
@mingyue300 made their first contribution in https://github.com/sgl-project/sglang/pull/21723
@minosfuture made their first contribution in https://github.com/sgl-project/sglang/pull/23419
@mispa-ms made their first contribution in https://github.com/sgl-project/sglang/pull/23097
@mlleo made their first contribution in https://github.com/sgl-project/sglang/pull/23537
@Napkin-AI made their first contribution in https://github.com/sgl-project/sglang/pull/23572
@nvpohanh made their first contribution in https://github.com/sgl-project/sglang/pull/22852
@officialasishkumar made their first contribution in https://github.com/sgl-project/sglang/pull/22600
@opherlieber made their first contribution in https://github.com/sgl-project/sglang/pull/22547
@ranjiewen made their first contribution in https://github.com/sgl-project/sglang/pull/21698
@RichardoMrMu made their first contribution in https://github.com/sgl-project/sglang/pull/19545
@robellliu-dev made their first contribution in https://github.com/sgl-project/sglang/pull/20835
@SammLSH made their first contribution in https://github.com/sgl-project/sglang/pull/22089
@Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
@shenxiul made their first contribution in https://github.com/sgl-project/sglang/pull/23327
@siju-samuel made their first contribution in https://github.com/sgl-project/sglang/pull/23472
@stepinto made their first contribution in https://github.com/sgl-project/sglang/pull/23478
@tfhddd made their first contribution in https://github.com/sgl-project/sglang/pull/22029
@vvagaytsev made their first contribution in https://github.com/sgl-project/sglang/pull/22363
@WangHao-hw made their first contribution in https://github.com/sgl-project/sglang/pull/22778
@Wen-xuan-Xu made their first contribution in https://github.com/sgl-project/sglang/pull/22923
@xiaobochen-amd made their first contribution in https://github.com/sgl-project/sglang/pull/22626
@yaya159456 made their first contribution in https://github.com/sgl-project/sglang/pull/21694
@YMbmzy made their first contribution in https://github.com/sgl-project/sglang/pull/22049
@yuki-brook made their first contribution in https://github.com/sgl-project/sglang/pull/18016
@Zaire404 made their first contribution in https://github.com/sgl-project/sglang/pull/22982
@ZeyuanChen2000 made their first contribution in https://github.com/sgl-project/sglang/pull/21543
@zhaozx-cn made their first contribution in https://github.com/sgl-project/sglang/pull/22266
@zhsurpass made their first contribution in https://github.com/sgl-project/sglang/pull/22697
@zsj555 made their first contribution in https://github.com/sgl-project/sglang/pull/23454

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

Security Fixes

CVE-2026-5760 — fixed in #23660

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track sglang

Get notified when new releases ship.

About sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

All releases →