sglang

v0.5.12 Security

This release patches 1 CVE for security teams tracking exposure across their dependency inventory.

Published 2mo Model Serving & MLOps

View tool

1 patched CVE

Read the diff → Tool health → What is this tool? →

This release patches 1 known CVE CVE-2023-4863 EPSS 100%

1 CVEs patched

Topics

attention blackwell cuda deepseek diffusion glm

+12 more

gpt-oss inference llama llm minimax moe qwen qwen-image reinforcement-learning transformer vlm wan

ReleasePort's take

Moderate signal

editorial:auto 2mo

ReleasePort Layer 1 introduces DeepSeek V4 support with Tensor, Expert, Context parallelism and unified docker tag `lmsysorg/sglang:v0.5.12`.

Why it matters: DeepSeek V4 adds full inference path with new kernels; the unified Docker tag simplifies deployment across all Nvidia GPUs.

Summary

AI summary

DeepSeek V4 support adds full inference path with new kernels and unified docker tag.

Changes in this release

Type	Severity	Summary	CVE
Feature
Feature	Medium	DeepSeek V4 support with Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallel Attention, and HiSparse offloading to CPU memory. DeepSeek V4 support with Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallel Attention, and HiSparse offloading to CPU memory. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	HiCache framework for UnifiedRadixTree and SSD offload through Mooncake store for DeepSeek V4. HiCache framework for UnifiedRadixTree and SSD offload through Mooncake store for DeepSeek V4. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	TokenSpeed MLA attention backend with FP8 KV cache on SM100 GPUs (Blackwell). TokenSpeed MLA attention backend with FP8 KV cache on SM100 GPUs (Blackwell). Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	New Model Support: DeepSeek V4, Intern-S2-Preview, MiniCPM-V 4.6, Laguna-XS.2, Ring-2.6-1T, Gemma 4 MTP. New Model Support: DeepSeek V4, Intern-S2-Preview, MiniCPM-V 4.6, Laguna-XS.2, Ring-2.6-1T, Gemma 4 MTP. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Unified docker tag `lmsysorg/sglang:v0.5.12` for all Nvidia GPUs supporting DeepSeek V4 features. Unified docker tag `lmsysorg/sglang:v0.5.12` for all Nvidia GPUs supporting DeepSeek V4 features. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Speculative Decoding V2 maturation including adaptive Spec V2, EAGLE-3 SWA support, and Kimi K2.5 MLA spec decoding. Speculative Decoding V2 maturation including adaptive Spec V2, EAGLE-3 SWA support, and Kimi K2.5 MLA spec decoding. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	CUDA 13 DeepEP migration to `deepseek-ai/DeepEP@hybrid-ep` for clean builds on CUDA 13. CUDA 13 DeepEP migration to `deepseek-ai/DeepEP@hybrid-ep` for clean builds on CUDA 13. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	HiCache + UnifiedRadixTree support with SWA, SSD offload, and stability fixes across eviction paths. HiCache + UnifiedRadixTree support with SWA, SSD offload, and stability fixes across eviction paths. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	PD Disaggregation improvements including DSv4 flash disaggregation tests, Mooncake state transfer, and priority scheduling fix. PD Disaggregation improvements including DSv4 flash disaggregation tests, Mooncake state transfer, and priority scheduling fix. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	LoRA MLA attention (q_b_proj / kv_b_proj) and CSGMV backend with virtual experts for MoE LoRA. LoRA MLA attention (q_b_proj / kv_b_proj) and CSGMV backend with virtual experts for MoE LoRA. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Performance optimizations such as TMA bulk-store set_mla_kv_buffer, Kimi tokenizer TTFT optimization, and DeepSeekV2MoE deferring shared experts. Performance optimizations such as TMA bulk-store set_mla_kv_buffer, Kimi tokenizer TTFT optimization, and DeepSeekV2MoE deferring shared experts. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Observability additions including `sglang:get_loads_duration_seconds` Prometheus metric and decode-side bootstrap metrics. Observability additions including `sglang:get_loads_duration_seconds` Prometheus metric and decode-side bootstrap metrics. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Frontend & API enhancements like `/v1/tokenize` chat-completion support, multi-detokenizer, and auto-detect reasoning tool-call parser. Frontend & API enhancements like `/v1/tokenize` chat-completion support, multi-detokenizer, and auto-detect reasoning tool-call parser. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	SGLang-Diffusion new model support HunyuanVideo ModelOpt FP8 and Qwen Image ModelOpt FP8 with CFG parallelism framework. SGLang-Diffusion new model support HunyuanVideo ModelOpt FP8 and Qwen Image ModelOpt FP8 with CFG parallelism framework. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	AMD/ROCm improvements including DSv4 Flash tests on MI35x ROCm 7.2, gfx950 aiter `_skip_rope_for_aiter_fused_mla`, and JIT kernel PR-CI. AMD/ROCm improvements including DSv4 Flash tests on MI35x ROCm 7.2, gfx950 aiter `_skip_rope_for_aiter_fused_mla`, and JIT kernel PR-CI. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	NPU/Ascend enhancements such as `zbal` support, Trinity-mini (~90% accuracy), and MLA KV transfer in pipeline parallel. NPU/Ascend enhancements such as `zbal` support, Trinity-mini (~90% accuracy), and MLA KV transfer in pipeline parallel. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	CPU/Intel/MUSA/MLX improvements including MUSA FlashInfer sampling backend, MLX on-the-fly quantization on Apple Silicon, and Intel CPU test migration. CPU/Intel/MUSA/MLX improvements including MUSA FlashInfer sampling backend, MLX on-the-fly quantization on Apple Silicon, and Intel CPU test migration. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Quantization & Kernels updates like NVFP4 hot-reload-safe weight loading, Cute-DSL FP4 dense GEMM, and DSv3.2 indexer GEMM via `torch.mm`. Quantization & Kernels updates like NVFP4 hot-reload-safe weight loading, Cute-DSL FP4 dense GEMM, and DSv3.2 indexer GEMM via `torch.mm`. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Dependencies upgrades: FlashInfer 0.6.8.post1 → 0.6.11.post1, sgl-kernel 0.4.2.post1/2 updates, and custom `sgl-deep-gemm` wheel release workflow. Dependencies upgrades: FlashInfer 0.6.8.post1 → 0.6.11.post1, sgl-kernel 0.4.2.post1/2 updates, and custom `sgl-deep-gemm` wheel release workflow. Source: llm_adapter@2026-05-21 Confidence: high	—
Feature	Medium	Adds DeepSeek V4 support with Tensor, Expert, Context parallelism, Data Parallel Attention, HiSparse offloading, Prefill‑Decode disaggregation, Reasoning parser, Tool Call Parser, DeepGemm/FlashMLA kernels (including MegaMoE). Adds DeepSeek V4 support with Tensor, Expert, Context parallelism, Data Parallel Attention, HiSparse offloading, Prefill‑Decode disaggregation, Reasoning parser, Tool Call Parser, DeepGemm/FlashMLA kernels (including MegaMoE). Source: granite4.1:30b@2026-05-22-audit Confidence: high	—
Feature	Medium	Enables PDL across DSv3.2 and GLM‑5 kernels, uses torch.mm for DeepSeek V3.2 indexer GEMM, and relands Cute‑DSL FP4 dense GEMM to trim low‑latency overheads. Enables PDL across DSv3.2 and GLM‑5 kernels, uses torch.mm for DeepSeek V3.2 indexer GEMM, and relands Cute‑DSL FP4 dense GEMM to trim low‑latency overheads. Source: granite4.1:30b@2026-05-22-audit Confidence: high	—
Feature	Medium	PDL enabled across DSv3.2 / GLM-5 kernels, reducing low-latency overheads on FP4 paths. PDL enabled across DSv3.2 / GLM-5 kernels, reducing low-latency overheads on FP4 paths. Source: llm_adapter@2026-05-21 Confidence: low	—

Full changelog

Highlights

DeepSeek V4 support: Full inference path for DeepSeek-V4 (#23882), including:

Day-0 Features: #23882
- Parallelism: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallel Attention
- Hardware: Nvidia B300/B200/H200/H100/GB200/GB300, AMD MI35X
- Prefill-Decode Disaggregation
- HiSparse for offloading inactive KV cache to CPU memory
- Reasoning parser and Tool Call Parser
- DeepGemm and FlashMLA kernels for DeepSeek V4, including MegaMoE
Post-Day-0 additions:
- HiCache for DeepSeek V4 under unified Radix Tree [UnifiedTree]: #24691
- W4A4 MegaMoE kernels — faster speed with negligible accuracy drop: #25052
- Marlin/FlashInfer W4A8 MoE kernels on Hopper: #24816 #24986
- Faster V2 fused compression kernels: #24890
- TP16 support on H100/H20: #24949
- Fused SiLU+clamp+FP8 quant kernel: #24897
- Optimized MHC + DeepGemm pipeline (fused norm, fused hc_head): #24775
- Non-standard chat template support for DSv4: #23915
- Multi-detokenizer support: #24944
- Pipeline Parallelism + PD support for DeepSeek-V4: #24700
- A unified docker tag lmsysorg/sglang:v0.5.12 for all Nvidia GPUs
See the LMSYS blog and the DeepSeek-V4 cookbook for more details.
TokenSpeed MLA attention backend (Blackwell, FP8 KV cache): New MLA prefill/decode kernels integrated as an attention backend on SM100, with FP8 KV cache support for low-latency MLA serving: #24925
DSv3.2 / GLM-5 FP4 low-latency perf: PDL enabled across DSv3.2 / GLM-5 kernels, torch.mm for the DeepSeek V3.2 indexer GEMM, and a reland of the Cute-DSL FP4 dense GEMM — materially trimming low-latency overheads on FP4 paths: #23965, #23856, #23590, #25311
New Model Support: DeepSeek V4 #23882, Intern-S2-Preview #24875, MiniCPM-V 4.6 #24855, Laguna-XS.2 #24204, Ring-2.6-1T #25360, and Gemma 4 MTP #24436 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook
HiCache + UnifiedRadixTree: HiCache framework support for UnifiedRadixTree (with SWA), HiCache for DeepSeek V4, SSD offload through Mooncake store, and stability fixes across cascade eviction, tombstone replay, and partial-match paths: #23316, #23391, #24691, #24277, #24943, #24972, #25068, #25277
Speculative Decoding V2 maturation: Adaptive Spec V2, EAGLE-3 SWA + newer drafters, Kimi K2.5 EAGLE-3 MLA, Gemma 3/4 + EAGLE-3, and an extensive naming / shape-handling refactor across draft-extend paths: #23336, #24663, #24664, #24826, #23976, #24859
CUDA 13 DeepEP migration: Gateway DeepEP source swapped from a community fork to deepseek-ai/DeepEP@hybrid-ep so DeepEP builds and runs cleanly on the CUDA 13 default; FlashInfer pinned at 0.6.11.post1 alongside a gpt-oss triton-kernel fix: #25113

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

DeepSeek V4 (see cookbook; LMSYS blog)
Intern-S2-Preview: #24875, #25115, #25134 (see cookbook)
MiniCPM-V 4.6: #24855, #24876, #24991, #24998 (see cookbook)
Laguna-XS.2 (Poolside): #24204, #24730 (see cookbook)
Ring-2.6-1T (InclusionAI, trillion-param reasoning): #25360, #25370 (see cookbook)
Gemma 4 MTP (MTP head for Gemma 4): #24436, #24433
Trinity-mini (Ascend NPU, ~90% accuracy): #18172
HunyuanVideo ModelOpt FP8 (Diffusion): #23199
Qwen Image ModelOpt FP8 (Diffusion): #23155

Speculative Decoding

TokenSpeed MLA prefill/decode kernels integrated as attention backend (FP8 KV cache, Blackwell): #24925
Adaptive Spec V2 (2/N): #23336
SWA support for EAGLE-3 drafter: #24664
Support newer EAGLE-3 drafters: #24663
Kimi K2.5 EAGLE-3 MLA spec decoding: #24826
Gemma 3 / Gemma 4 + EAGLE-3 support: #23976
Spec V1 — split draft-extend into EagleDraftExtendInput: #24859
Custom speculative-algorithm registry: #23991
Spec-V2 overlap stale-state fix: #23456
trtllm decode kernel for draft extend: #24566
AMD: EAGLE on Qwen3.5 FP8/MXFP4 via aiter unified attention: #23146
Fix Kimi K2.5 MLA EAGLE + DP attention: #25033
Fix ngram metric off-by-1 in num_accepted_drafts_per_req_cpu: #24965
Fix frozen-KV MTP crash when bonus_tokens is None: #25204
Fix stuck-MTP on DSA models: #24635
Reduce specdec CPU overhead: #23321
Spec-decoding naming-convention rule + refactors: #24094, #25014, #25038, #24081, #24724, #24735, #24881, #25010, #25012, #25030, #25029, #25037, #25109

PD Disaggregation

DSv4 Flash disaggregation test: #24973
Unify DSv4 dispatch with SWA: #24888
DSv4 mooncake state_type branch: #24878
Hybrid state transfer refactor: #24932
Priority scheduling in PD mode fix: #25062
NIXL: staging buffer for heterogeneous-TP KV transfer: #22536
NIXL: async transfer: #23967
NIXL XPU: uint64 pointer overflow + mismatched P/D TP fixes: #24188, #24648
Mooncake: incremental transfer + SSD offload: #24257, #24277
Multi-node prefill bootstrap-port broadcast: #24378
Add retry-with-backoff for prefill bootstrap registration: #25125
PrefillDelayer: NCCL all-gather for cross-DP info sync: #24768
MORI-IO: state transfer + high-concurrency fixes: #22665
Per-room cleanup centralization; prevent update_status from cleared entries; fix abort update_status across KV backends: #24601, #24539, #24522
PD KV transfer metrics fix: #24416
SWA memory preallocation for disaggregated decode: #24857
IntraNode NVLink configuration docs: #23329

HiCache & Radix Cache

HiCache framework for UnifiedRadixTree: #23316
SWA HiCache for unified radix cache: #23391
HiCache for DeepSeek V4 + nightly CI for DSA model: #24691, #25369, #25348
SSD offload through Mooncake store: #24277
HiSparse FP8 KV cache via flashmla_kv backend: #23013
Default storage prefetch timeout: #23309
UnifiedRadixCache device match semantics with HiCache: #25277
UnifiedTree partial match on evicted+backuped nodes: #24943
UnifiedTree tombstone lock release replay fix: #24972
UnifiedTree _cascade_evict leaf determination fix: #25068
UnifiedRadixTree align cache_empty_result with RadixTree: #24779
Mamba radix cache KV events; SWA radix cache events: #23678, #24718
SWA chunk req deferred fix; SWA component host hit fix: #24318, #25085

LoRA

MLA attention LoRA (q_b_proj / kv_b_proj): #25001
CSGMV backend with virtual experts for MoE LoRA: #24007
MoE LoRA: remove CPU-GPU sync barriers and duplicate code (prefill optimize 2/n, 3/n): #24246, #24262
LoRADrainer for high P99 TTFT: #17913
qkv_proj buffer sizing when tp_size > num_key_value_heads: #24420
Torch-Native LoRA: embedding + graph optimization: #21885
Deterministic lora_id for multi-node --lora-paths: #24555
Fix broken sgemm_lora_a_graph_fwd due to invalid torch.mm(): #24760
Diffusion: fix RowParallel LoRA merged forwarding: #24410

Performance

TMA bulk-store set_mla_kv_buffer (up to 12× over baseline): #25311
Kimi tokenizer TTFT optimization: #25265
Avoid hidden-states D2H copy when return_hidden_states=false: #25155
DeepseekV2MoE: defer shared experts when routed kernel is non-mutating: #25279
SGLANG_OPT_FP8_WO_A_GEMM on by default: #25181
--prefill-only-disable-kv-cache to skip KV pool allocation: #23675
Gemma 4 MoE: fused Q/K/V RMSNorm + per-expert FP8 ckpt loader: #24696
Gemma 4 VLM: PCG + fused RMSNorm + residual: #24048
MHC pipeline: DeepGemm + fused norm + fused hc_head: #24775
JIT custom all-reduce default; non-NVL follow-up: #24363, #24742
SGLANG_USE_JIT_ALL_REDUCE → SGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: #24297
Eliminate logits H2D blocking copy: #24627
Cache empty MatchResult in RadixCache: #24470
Breakable CUDA graph for bs > 1: #24662
FA3: skip scheduler_metadata precompute under DP attention: #24632
aten::rms_norm / aten::mm.dtype registration in batch-invariant mode: #24459
Optimize Helios fused norm modulation: #24059
Z-Image packed QKV optimization: #24117
KDA prefill kernels: diagonal + recompute fuse: #24271

Observability

sglang:get_loads_duration_seconds Prometheus metric: #25163
Per-iteration forward-pass metrics via ZMQ PUB: #22789
SGLANG_TRACE_LEVEL env for startup trace level: #24716
fwd_occupancy metric in SchedulerStats + Prometheus collector: #24458
SWA / Mamba cache metrics: #24396
Mamba radix cache + SWA radix cache KV events: #23678, #24718
PD KV transfer metrics fix: #24416
CP allgather buffer registered with symmetric memory: #24040
Decode-side bootstrap/alloc metrics + non-int token-id filter: #24684

Frontend & API

/v1/tokenize chat-completion-style support: #23981
Multi-detokenizer support: #24944
Structural tags for strict tool calling & reasoning across more models: #21722
Auto-detect reasoning / tool-call parser from chat template: #23952
Two-phase reasoning grammar + --enable-strict-thinking: #23953
OpenAI reasoning.enabled mapping to thinking + enable_thinking: #23951
Kimi-K2.5 bare-numeric tool-call IDs: #23950
Crusoe managed-inference backend: #20475
Azure Blob Storage connector (az:// and *.blob.core.windows.net): #23995
Adaptive queue-based prefill-delayer trigger: #23189
SGLANG_MAX_KV_CHUNK_CAPACITY env: #25120
SGLANG_RADIX_FORCE_MISS env: #24726, #24950
Reject repetition_penalty=0 in SamplingParams.verify(): #24874
--random-input-len for send_one.py: #24464

SGLang-Diffusion

New model support: HunyuanVideo ModelOpt FP8 (#23199), Qwen Image ModelOpt FP8 (#23155)
CFG parallelism framework + multi-branch CFG for LTX-2: #23736
Initial dynamic batching: #18764
Performance-mode server args: #24491
dit_precision config respected (no hardcoded bf16): #24988
Cache-DiT: mount before torch.compile in native denoising: #25328
Z-Image Cache-DiT sequence-parallel override fix: #25305
USP: direct all-to-all collectives; NCCL deadlock fix for remainder seq lengths: #24366, #24694
FA3 varlen out argument handling: #24688
RowParallel LoRA merged forwarding fix: #24410
CFG communication: handle non-contiguous tensors: #24332
LTX-2.3 alignment with official + HQ denoising split passes: #24313, #24298
LTX-2 feed-forward TP optimization (#23221) + Hunyuan3D shape denoising / export chunks: #24287, #24358
Encoder result cache for default negative prompt: #24304
Channels-last 3D VAE convs by default; disable VAE CPU offload by default: #23200, #24315
Component attention-backend override CLI: #24320
AMD: online MXFP4 + FP8 diffusion quantization; aiter RMSNorm; temporal-unfolded batched Conv2D for ROCm VAE decode; dual-stream MoE: #21431, #24360, #22971, #24005, #24677
NPU: MXFP8 quantization for Wan2.2 (#20922, #24918); fused-operator E2E perf for Wan (#24028); selectable parallel VAE decode strategies (#23248); SANA fix (#24798); Z-Image negative-branch rotary embed CFG fix (#23538)
MUSA: sage attention backend (#24752)

AMD / ROCm

DSv4 Flash / Pro nightly tests on MI35x ROCm 7.2: #24203, #24825, #25039
NSA indexer fallbacks + preshuffle paged MQA + GLM-5 NSA TileLang: #24125, #23562, #25205
fp8 blockwise quantization combine for MoRI EP: #24879
gfx950 + aiter _skip_rope_for_aiter_fused_mla: #24148
aiter fused_qk_rmsnorm API shim (pre/post #2958): #24799
TBO Spec-V2 seq_lens_cpu None handling: #24319
Kimi-K2.6 nightly tests (MI30x / MI35x): #23848
JIT kernel PR-CI through run_suite.py: #24987
AMD JIT benches: clamp position + resolve-token-ids: #24209, #24210 (#25209, #25210)
AMD CI hygiene (registration + cleanup + VRAM): #24569, #24572, #24586, #24612, #24614, #24615, #24665, #24924, #24981, #25112
Docker: cache-dit 1.3.0 pin; archive.ubuntu.com fallback: #24924, #24407

NPU / Ascend

zbal support: #24575
Trinity-mini support (~90% accuracy): #18172
Shared-expert dual-stream optimization: #23827
Mamba-extra-buffer radix cache (Qwen3.5): #23891
MLA KV transfer in pipeline parallel: #23893
Multi-batch FIA ops: #20177
GLM-5 docs: DeepEP enabled by default: #23708
GLM-4.5V / GLM-4.7-Flash NPU support / fixes (carry-over): existing
--disable-cuda-graph + MTP warmup fix: #23819
MRoPE position fix in Eagle Worker v2 with PlanStream: #23423
Z-Image negative-branch rotary embeddings for CFG: #23538
Wan quantization fix: #24540
causal_conv1d_update_v2 for performance: #24595
sgl-kernel-npu 2026.05.01 bump: #24951
Profiler revert + re-add: #24685, #24815
Doc / accuracy / FAQ work: #21537, #24658, #24676, #24777, #25114, #25130, #25268, #24668, #24918

CPU / Intel / MUSA / MLX / Apple Silicon

MUSA: FlashInfer sampling backend: #24978
MUSA: optimized kernels for piecewise CUDA graph: #23633
MUSA: optimized kernels for hot ops: #23255
MUSA: torchada 0.1.54 bump: #24592
MLX: on-the-fly --quantization mlx_q4 / mlx_q8 on Apple Silicon: #24907
MLX: auto-detect MLX-format quantization_config dict: #25191
MLX: thread --quantization through MlxModelRunner in bench_one_batch: #25221
MLX: Apple Silicon Metal kernel support in sgl-kernel: #23449
sgl-kernel/cpu: w8a8 int8 model support for arm cpu: #16045
Intel CPU tests migrated to test/registered (re-applied after revert): #25139, #22670, #25044
Arm64 CPU Phase-1A CI bootstrap: #22123
XPU pipeline parallelism on Intel: #23472

Quantization & Kernels

NVFP4 hot-reload-safe weight loading (alias-when-same-shape): #25190
NVFP4: free unused source scales after weight processing: #25107
Cute-DSL NVFP4 quantization kernels: #23745
Cute-DSL FP4 dense GEMM (reland): #23590
DSv3.2 indexer GEMM via torch.mm: #23856
PDL for DSv3.2 / GLM-5 kernels: #23965
DSv4: W4A4 MegaMoE; W4(MXFP4)A16 on Hopper: #25052, #24986
FlashInfer SM90 cutlass MXFP4 MoE backend (W4A16) for GPT-OSS + DSv4: #24816
Port KV Compression V2 + fused SiLU+clamp+FP8 quant from DSV4 dev branch: #24890, #24897
BF16 EP-MoE for DeepGEMM: #17392
DeepGEMM deprecated in sgl-kernel; custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
TRT-LLM A2A dispatch: NaN sanitization in padding slots: #24850
TRT-LLM BF16 MoE for MTP: #24260
MegaMoE decoupled from DeepEP backend (subsequently reverted): #24884, #25317
DeepEP waterfill load balancing for shared-expert dispatch: #19290
DeepEP support for --enable-return-routed-experts: #16859

Dependencies

FlashInfer 0.6.8.post1 → 0.6.11 → 0.6.11.post1 (with intermediate revert): #24452, #25129, #25310, #25335
sgl-kernel 0.4.2.post1, 0.4.2.post2: #24457, #25326
sgl-kernel: SM90 flashmla compile fix: #24130
Custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
sgl-kernel-build x86 + arm merged into reusable workflow; disk-reclaim cleanup: #25135, #25206
DeepEP swapped from fzyzcjy fork to deepseek-ai/DeepEP@hybrid-ep (CUDA 13): #25113
Torch 2.11 Docker prep + dependency cleanup: #23593
nixl stub installation alongside nixl-cuXX binary: #24369
aarch64 cubin handling + masked-failure fix: #24234
H20 stage on CUDA 13: #24916
CUDA-13 kernel installation docs: #24181, #24516
FlashInfer autotune cache: #24156
FlashInfer workspace OOM fix: #24172
FlashInfer allreduce fusion disabled under deterministic inference: #24629
trtllm allreduce fusion with PDL: #23765
TRTLLM MHA routing fix for draft-extend: #24856
torchcodec → soundfile WAV fallback for trailing metadata: #24185
sgl-kernel-npu 2026.05.01: #24951

Security

No security-tagged PRs in this window.

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.11...v0.5.12

New Contributors

@Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
@stargazerZJ made their first contribution in https://github.com/sgl-project/sglang/pull/24344
@Jianhong-Zhang made their first contribution in https://github.com/sgl-project/sglang/pull/24188
@gh1595 made their first contribution in https://github.com/sgl-project/sglang/pull/24420
@revanthreddy-hai made their first contribution in https://github.com/sgl-project/sglang/pull/24329
@TallMessiWu made their first contribution in https://github.com/sgl-project/sglang/pull/20922
@ranimandepudi made their first contribution in https://github.com/sgl-project/sglang/pull/22123
@Joey-gvwal made their first contribution in https://github.com/sgl-project/sglang/pull/23255
@fanxingran made their first contribution in https://github.com/sgl-project/sglang/pull/24129
@xz-keg made their first contribution in https://github.com/sgl-project/sglang/pull/24604
@zhongdaor-nv made their first contribution in https://github.com/sgl-project/sglang/pull/23678
@chfeng-cs made their first contribution in https://github.com/sgl-project/sglang/pull/24434
@sglang-npu-bot made their first contribution in https://github.com/sgl-project/sglang/pull/24815
@brian030128 made their first contribution in https://github.com/sgl-project/sglang/pull/24217
@tjdharamsi made their first contribution in https://github.com/sgl-project/sglang/pull/24871
@sytianhe made their first contribution in https://github.com/sgl-project/sglang/pull/24716
@Dogacel made their first contribution in https://github.com/sgl-project/sglang/pull/24663
@tangcy98 made their first contribution in https://github.com/sgl-project/sglang/pull/24967
@1pikachu made their first contribution in https://github.com/sgl-project/sglang/pull/22670
@flutist made their first contribution in https://github.com/sgl-project/sglang/pull/24760
@acheamponge made their first contribution in https://github.com/sgl-project/sglang/pull/20475
@taegeonum made their first contribution in https://github.com/sgl-project/sglang/pull/25022
@RulinJuice made their first contribution in https://github.com/sgl-project/sglang/pull/24874
@lluki made their first contribution in https://github.com/sgl-project/sglang/pull/24671
@Religious-J made their first contribution in https://github.com/sgl-project/sglang/pull/20930
@ltcs11 made their first contribution in https://github.com/sgl-project/sglang/pull/24575
@damahua made their first contribution in https://github.com/sgl-project/sglang/pull/24907
@ziang663 made their first contribution in https://github.com/sgl-project/sglang/pull/25126
@Emmanuel0612 made their first contribution in https://github.com/sgl-project/sglang/pull/25209
@Jialin made their first contribution in https://github.com/sgl-project/sglang/pull/25234
@jlee5814 made their first contribution in https://github.com/sgl-project/sglang/pull/25191
@unseenmars made their first contribution in https://github.com/sgl-project/sglang/pull/24935
@nano8259 made their first contribution in https://github.com/sgl-project/sglang/pull/25125
@imp2002 made their first contribution in https://github.com/sgl-project/sglang/pull/24130
@liuxianglong17 made their first contribution in https://github.com/sgl-project/sglang/pull/25080

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.11...v0.5.12

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track sglang

Get notified when new releases ship.

About sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

All releases →