Skip to content

sglang

v0.5.11 Security

This release includes 1 security fix for security teams reviewing exposed deployments.

Published 29d Model Serving & MLOps
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →
This release patches 1 known CVE

Topics

attention blackwell cuda deepseek diffusion glm
+12 more
gpt-oss inference llama llm minimax moe qwen qwen-image reinforcement-learning transformer vlm wan

Summary

AI summary

Fix for CVE-2026-5760.

Full changelog

Highlights

  • CUDA 13 + Torch 2.11: Default CUDA version moves to 13.0 across SGLang, sgl-kernel, and Docker images, and PyTorch is upgraded from 2.9 to 2.11 — modernizing the build matrix and unlocking newer kernels: #21247, #24162, #24183, #23593 (tracking issue #21498)

  • Speculative Decoding V2 by default: Spec V2 (with overlap scheduling to hide CPU overhead) is now the default, materially reducing per-step CPU cost for EAGLE/MTP/DFLASH paths: #21062

  • Decode Radix Cache for PD Disaggregation: Decode-side prefix caching now works under prefill/decode disaggregation, recovering radix-cache hit rates and TTFT savings for long shared prefixes in disaggregated deployments: #19746

  • Day-0 / New Model Support: Gemma 4, GLM-5.1, Qwen3.6, MiMo-V2.5 / V2.5-Pro, Ling-2.6-Flash, Mistral Medium 3.5, and Kimi-K2.6 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook: #21952, #23808, #23811, #23851, #23947, #23486, #23394

  • DFLASH Speculative Decoding: New high-throughput spec-decode kernel from the kernel community, expanded across model backends and AMD ROCm: #22077, #22358, #22342, #23553

  • FA3 Kernels from the Kernel Community: Drop-in FA3 kernels contributed by the community, integrated alongside FA4 to give users a high-performance option that's easy to maintain: #20796

  • LoRA support for DeepSeek-V3 and Kimi-K2: LoRA now works on the largest MLA-based MoE models, including DeepSeek-V3 MLA LoRA and Kimi K2 — enabling adapter-based fine-tuning of frontier-scale models: #22323, #22381

  • Context Parallel (CP) Enhancements: All-reduce + RMSNorm fusion under CP for end-to-end speedups, plus support for moe_dp_size = 1 paired with arbitrary attention_cp_size so MoE and attention parallelism can be tuned independently: #21249, #22003

  • FlashInfer CuteDSL MoE Runner Backend: New dedicated FlashInferCuteDslMoE layer for the standard FP4 MoE path, giving an additional high-performance fused-MoE option: #21339

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

  • Gemma 4: #21952 (and follow-ups #22079, #24048, #22842; see cookbook)
  • GLM-5.1: #22543, #23037 (see cookbook)
  • Qwen3.6: #23486 (see cookbook)
  • MiMo-V2.5 / MiMo-V2.5-Pro: #23808, #23811, #23851, #23945, #24118 (see cookbook)
  • Ling-2.6-Flash: #23947 (see cookbook)
  • Mistral Medium 3.5: see cookbook
  • Kimi-K2.6: #23394, #23408 (see cookbook)
  • Hunyuan v3 (Tencent, preview): #23533 (see cookbook)
  • FLUX.1-dev ModelOpt NVFP4 (Diffusion): #22672 (see FLUX cookbook)
  • FLUX.2-small-decoder (Diffusion): #22414 (see FLUX cookbook)
  • Qwen Image ModelOpt FP8 (Diffusion): #23155 (see Qwen-Image cookbook)
  • LTX-2.3 / LTX-2.3 two-stage / TI2V (Diffusion): #22182, #22667, #22869 (see LTX cookbook)
  • Qwen3-ASR (chunk-based streaming): #22073, #22089
  • Voxtral (Mistral speech-to-text): #21635
  • Parakeet (NVIDIA Nemotron encoder): #23568
  • Moss-VL: #23454
  • SequenceClassification model architecture (powers the Score API): #22118
  • Stable Diffusion 3 medium (Diffusion): #19225
  • ERNIE-Image (Diffusion): #22439
  • JoyAI-Image-Edit (Diffusion): #22625

Speculative Decoding

  • DFLASH speculative decoding initial support: #22077
  • DFLASH enabled across additional model backends: #22358
  • DFLASH speculative decoding on AMD ROCm: #22342
  • Spec V2 enabled by default with overlap scheduling: #21062
  • Penalty support for Spec V2 overlap scheduling: #22049
  • Adaptive speculative_num_steps for EAGLE topk=1: #21599
  • Allow piecewise CUDA graph with speculative decoding: #22128
  • Eagle3 / DFLASH aux hidden state capture during CUDA graph init fixed: #22836
  • Split accept_length into num_accepted_drafts / num_accepted_tokens: #23962
  • DFLASH speculative decoding documentation: #23553

PD Disaggregation

  • Decode-side radix cache support: #19746
  • Incremental transfer for Mooncake transfer engine: #24257
  • Allow PrefillDelayer in disaggregated-prefill mode: #23588
  • NIXL: heterogeneous TP KV transfer for non-MLA models (Step 1/2 for Qwen3.5): #22145
  • NIXL: Mamba state slice transfer for heterogeneous TP (Step 2/2 for Qwen3.5): #22240
  • Bug fixes for IntraNode NVLink, MTP-layer KV transfer, and disagg-prefill DP rank resolution: #23252, #23539, #22901, #22990

Context Parallel & Parallelism

  • All-reduce fusion support under CP: #21249
  • moe_dp_size = 1 paired with arbitrary attention_cp_size: #22003
  • All-reduce fusion enabled for DSA models: #22390
  • Replace all-reduce + dp_scatter with reduce_scatterv for DP attention: #22642
  • Step3p5: optimize all-reduce in MoE layers: #22773
  • Pipeline parallelism on Intel XPU: #23472
  • OpenTelemetry tracing for pipeline parallelism: #23169

LoRA

  • DeepSeek-V3 MLA LoRA support and quantization-info refactor: #22323
  • Kimi K2 LoRA support: #22381
  • LoRADrainer to address high P99 TTFT: #17913
  • Decoupled LoRA MoE backend with Marlin support: #21858
  • Virtual experts for LoRA MoE (1/n): #22122, #24007
  • CSGMV kernel offline auto-tuning: #20391
  • Triton sgemm speedup with better grid selection: #22386
  • Dual MoE CUDA graph capture for lora/nolora batches: #22809

Performance

  • FA3 kernels from the kernel community: #20796
  • Precompute FA3 scheduler_metadata to eliminate per-layer prepare cost: #21104
  • Precompute gemma_weight to avoid redundant add on every forward: #22673
  • Eliminate attention DtoD copy by passing pre-allocated output to FA: #21985
  • Skip KV cache in FA backend for embedding mode: #21971
  • O(1) RadixKey view for EAGLE bigram key: #23106
  • PCG inductor path optimization for FP8 models: #23227
  • Combo-kernels for horizontal fusion: #21977
  • Optimize Gemma4 VLM with PCG and fused RMSNorm + residual add + scalar: #24048
  • Restore torch.compile fusion for topk postprocessing: #21771
  • Reduce unnecessary kernels and copies in the NSA indexer: #22232

Observability

  • Pending token count surfaced in prefill log and get_load: #22480
  • OpenTelemetry tracing for speculative decoding: #19545
  • OpenTelemetry tracing for pipeline parallelism: #23169
  • OpenTelemetry tracing in DiffGenerator: #21254
  • Prometheus metrics endpoint for gRPC mode: #20801
  • HTTP sidecar endpoints and FlushCache gRPC RPC for gRPC mode: #22500
  • Raw KV cache pool token counts as Prometheus gauges: #22726

SGLang-Diffusion

  • New model support: LTX-2.3 (#22182, #22667, #22869), ERNIE-Image (#22439), FLUX.2-small-decoder (#22414), JoyAI-Image-Edit (#22625), FLUX.1-dev ModelOpt NVFP4 (#22672), Qwen Image ModelOpt FP8 (#23155), Stable Diffusion 3 medium (#19225)
  • ModelOpt diffusion FP8 support for Flux1/Flux2 and Wan2.2: #22365
  • Standalone Rollout API + Denoising Environment Backpass + SP-Aligned Log-Prob for T2I post-training: #22604
  • Disaggregated diffusion: #21701
  • Dynamic batching v0: #18764
  • CPU platform support for SGLang Diffusion: #20816
  • AITER backends in Flux 2 pipeline (AMD): #22802
  • LTX-2 feed-forward tensor parallelism optimization: #23221
  • In-memory loading for URL/base64 image inputs (default): #23118
  • Mixed-resolution benchmark support: #20863
  • Auto-enable best parallel setting if unspecified: #22763

AMD

  • MiniMax-M2.5 optimizations (aiter biased grouped topk; fused FP8 KV cache write): #23611, #23620
  • Fused QK Gemma norm kernels (4 → fewer kernels): #23575
  • Fused all-reduce + RMSNorm simplification: #21986
  • GLM-5 / GLM-5.1 MXFP4 nightly accuracy + perf benchmarks (MI30x / MI35x): #21773, #22336
  • MTP for GLM-5-mxfp4: #23219
  • Aiter v0.1.12.post1 upgrade: #22264
  • DFLASH speculative decoding enabled on ROCm: #22342
  • Fix --page-size > 1 memory access fault with speculative decoding: #23596

NPU / Ascend

  • Ascend backend supports Qwen3 MoE attention CP: #21685
  • GLM-4.5V and GLM-4.7-Flash NPU support / fixes: #22961, #22509
  • MTP for Qwen3.5: #20918
  • TP communications compression for Qwen3 on NPU: #20520
  • Add support-new-models documentation for NPU: #23824
  • GGUF quantization for Ascend NPU (dense + MoE): #17883

CPU

  • GPTQ / AWQ 4-bit quantization on CPU: #22685
  • gemma4_rmsnorm_cpu kernel: #22842
  • Qwen3.5 model optimization for CPU: #19484
  • Apply routed scaling factor on output for biased grouped topk fusion: #22413
  • Fix extend_attention_cpu / flash_attn_varlen_func NaN for large seq: #22434

Quantization

  • MXFP4 quantized dense models on AMD CDNA2/CDNA3 GPUs: #19143 (later reverted in #23031, follow-up forthcoming)
  • NVFP4 KV cache: quantization strategy abstraction and kernel: #21954
  • DeepSeek-R1-0528-w4a8 + DeepEP Low-Latency FP8 dispatch: #22316
  • MXFP8 sm100 path cleanup: #21881
  • GLM-5/5.1 MXFP4 checkpoint inference compatibility fix: #22543

Dependencies

  • Torch upgraded 2.9 → 2.11: #21247
  • Default CUDA bumped to 13.0 across sglang, sgl-kernel, and Docker images: #21498 (tracking), #24162, #24183, #23593, #23119
  • Flashinfer 0.6.7.post2 → 0.6.8.post1: #23281
  • sgl-kernel bumped to 0.4.1.post1: #23720, #23733
  • sgl-kernel bumped to 0.4.2: #24170
  • Aiter v0.1.12.post1 (AMD): #22264

Security

  • Fix for CVE-2026-5760: #23660
  • Fix Trivy CVEs and cubin download 403s in Docker image: #22322

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

New Contributors

  • @AethoceSora made their first contribution in https://github.com/sgl-project/sglang/pull/23426
  • @AlbeeSo made their first contribution in https://github.com/sgl-project/sglang/pull/23710
  • @alec-flowers made their first contribution in https://github.com/sgl-project/sglang/pull/24090
  • @AlonKejzman made their first contribution in https://github.com/sgl-project/sglang/pull/23753
  • @amacaskill made their first contribution in https://github.com/sgl-project/sglang/pull/22537
  • @AndyLi429 made their first contribution in https://github.com/sgl-project/sglang/pull/21685
  • @Baichuan7 made their first contribution in https://github.com/sgl-project/sglang/pull/23060
  • @ccullen-cert made their first contribution in https://github.com/sgl-project/sglang/pull/23660
  • @ChangLiu0709 made their first contribution in https://github.com/sgl-project/sglang/pull/22908
  • @charlotte12l made their first contribution in https://github.com/sgl-project/sglang/pull/21983
  • @chenkaiyue made their first contribution in https://github.com/sgl-project/sglang/pull/17195
  • @chx96642264 made their first contribution in https://github.com/sgl-project/sglang/pull/22705
  • @ColinZ22 made their first contribution in https://github.com/sgl-project/sglang/pull/22543
  • @cyyc0310 made their first contribution in https://github.com/sgl-project/sglang/pull/22920
  • @divyamagrawal06 made their first contribution in https://github.com/sgl-project/sglang/pull/23325
  • @dyhsup made their first contribution in https://github.com/sgl-project/sglang/pull/22439
  • @egvenediktov made their first contribution in https://github.com/sgl-project/sglang/pull/20520
  • @erikwijmans made their first contribution in https://github.com/sgl-project/sglang/pull/21974
  • @fengli1702 made their first contribution in https://github.com/sgl-project/sglang/pull/19143
  • @fergusfinn made their first contribution in https://github.com/sgl-project/sglang/pull/21035
  • @fortunecookiee made their first contribution in https://github.com/sgl-project/sglang/pull/20960
  • @gxlvera made their first contribution in https://github.com/sgl-project/sglang/pull/19225
  • @he-yufeng made their first contribution in https://github.com/sgl-project/sglang/pull/20739
  • @Henson-Zh-Ali made their first contribution in https://github.com/sgl-project/sglang/pull/20522
  • @icepoint666 made their first contribution in https://github.com/sgl-project/sglang/pull/22592
  • @iridiumine made their first contribution in https://github.com/sgl-project/sglang/pull/20918
  • @is-not made their first contribution in https://github.com/sgl-project/sglang/pull/18349
  • @JasonHe-WQ made their first contribution in https://github.com/sgl-project/sglang/pull/21944
  • @jh-nv made their first contribution in https://github.com/sgl-project/sglang/pull/21254
  • @jiangyinzuo made their first contribution in https://github.com/sgl-project/sglang/pull/23169
  • @JieTang66 made their first contribution in https://github.com/sgl-project/sglang/pull/23983
  • @JoyFuture made their first contribution in https://github.com/sgl-project/sglang/pull/23808
  • @jthakurH made their first contribution in https://github.com/sgl-project/sglang/pull/16793
  • @kangyifei made their first contribution in https://github.com/sgl-project/sglang/pull/23241
  • @kingkingleeljj made their first contribution in https://github.com/sgl-project/sglang/pull/20967
  • @kkyyxhll made their first contribution in https://github.com/sgl-project/sglang/pull/23062
  • @KrishnanPrash made their first contribution in https://github.com/sgl-project/sglang/pull/22175
  • @lahmuller made their first contribution in https://github.com/sgl-project/sglang/pull/22625
  • @lixuwei2333 made their first contribution in https://github.com/sgl-project/sglang/pull/22247
  • @lkhl made their first contribution in https://github.com/sgl-project/sglang/pull/22431
  • @loading66 made their first contribution in https://github.com/sgl-project/sglang/pull/22700
  • @luccafong made their first contribution in https://github.com/sgl-project/sglang/pull/24165
  • @mingyue300 made their first contribution in https://github.com/sgl-project/sglang/pull/21723
  • @minosfuture made their first contribution in https://github.com/sgl-project/sglang/pull/23419
  • @mispa-ms made their first contribution in https://github.com/sgl-project/sglang/pull/23097
  • @mlleo made their first contribution in https://github.com/sgl-project/sglang/pull/23537
  • @Napkin-AI made their first contribution in https://github.com/sgl-project/sglang/pull/23572
  • @nvpohanh made their first contribution in https://github.com/sgl-project/sglang/pull/22852
  • @officialasishkumar made their first contribution in https://github.com/sgl-project/sglang/pull/22600
  • @opherlieber made their first contribution in https://github.com/sgl-project/sglang/pull/22547
  • @ranjiewen made their first contribution in https://github.com/sgl-project/sglang/pull/21698
  • @RichardoMrMu made their first contribution in https://github.com/sgl-project/sglang/pull/19545
  • @robellliu-dev made their first contribution in https://github.com/sgl-project/sglang/pull/20835
  • @SammLSH made their first contribution in https://github.com/sgl-project/sglang/pull/22089
  • @Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
  • @shenxiul made their first contribution in https://github.com/sgl-project/sglang/pull/23327
  • @siju-samuel made their first contribution in https://github.com/sgl-project/sglang/pull/23472
  • @stepinto made their first contribution in https://github.com/sgl-project/sglang/pull/23478
  • @tfhddd made their first contribution in https://github.com/sgl-project/sglang/pull/22029
  • @vvagaytsev made their first contribution in https://github.com/sgl-project/sglang/pull/22363
  • @WangHao-hw made their first contribution in https://github.com/sgl-project/sglang/pull/22778
  • @Wen-xuan-Xu made their first contribution in https://github.com/sgl-project/sglang/pull/22923
  • @xiaobochen-amd made their first contribution in https://github.com/sgl-project/sglang/pull/22626
  • @yaya159456 made their first contribution in https://github.com/sgl-project/sglang/pull/21694
  • @YMbmzy made their first contribution in https://github.com/sgl-project/sglang/pull/22049
  • @yuki-brook made their first contribution in https://github.com/sgl-project/sglang/pull/18016
  • @Zaire404 made their first contribution in https://github.com/sgl-project/sglang/pull/22982
  • @ZeyuanChen2000 made their first contribution in https://github.com/sgl-project/sglang/pull/21543
  • @zhaozx-cn made their first contribution in https://github.com/sgl-project/sglang/pull/22266
  • @zhsurpass made their first contribution in https://github.com/sgl-project/sglang/pull/22697
  • @zsj555 made their first contribution in https://github.com/sgl-project/sglang/pull/23454

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.10.post1...v0.5.11

Security Fixes

  • CVE-2026-5760 — fixed in #23660

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track sglang

Get notified when new releases ship.

Sign up free

About sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

All releases →

Related context

Beta — feedback welcome: [email protected]