Skip to content

sglang

v0.5.12 Security

This release patches 1 CVE for security teams tracking exposure across their dependency inventory.

Published 18d Model Serving & MLOps
1 patched CVE
Read the diff → Tool health → What is this tool? →
This release patches 1 known CVE CVE-2023-4863 EPSS 93%
1 CVEs patched

Topics

attention blackwell cuda deepseek diffusion glm
+12 more
gpt-oss inference llama llm minimax moe qwen qwen-image reinforcement-learning transformer vlm wan

ReleasePort's take

Moderate signal
editorial:auto 9d

ReleasePort Layer 1 introduces DeepSeek V4 support with Tensor, Expert, Context parallelism and unified docker tag `lmsysorg/sglang:v0.5.12`.

Why it matters: DeepSeek V4 adds full inference path with new kernels; the unified Docker tag simplifies deployment across all Nvidia GPUs.

Summary

AI summary

DeepSeek V4 support adds full inference path with new kernels and unified docker tag.

Changes in this release

Feature Medium

DeepSeek V4 support with Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallel Attention, and HiSparse offloading to CPU memory.

DeepSeek V4 support with Tensor Parallelism, Expert Parallelism, Context Parallelism, Data Parallel Attention, and HiSparse offloading to CPU memory.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

HiCache framework for UnifiedRadixTree and SSD offload through Mooncake store for DeepSeek V4.

HiCache framework for UnifiedRadixTree and SSD offload through Mooncake store for DeepSeek V4.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

TokenSpeed MLA attention backend with FP8 KV cache on SM100 GPUs (Blackwell).

TokenSpeed MLA attention backend with FP8 KV cache on SM100 GPUs (Blackwell).

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

New Model Support: DeepSeek V4, Intern-S2-Preview, MiniCPM-V 4.6, Laguna-XS.2, Ring-2.6-1T, Gemma 4 MTP.

New Model Support: DeepSeek V4, Intern-S2-Preview, MiniCPM-V 4.6, Laguna-XS.2, Ring-2.6-1T, Gemma 4 MTP.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Unified docker tag `lmsysorg/sglang:v0.5.12` for all Nvidia GPUs supporting DeepSeek V4 features.

Unified docker tag `lmsysorg/sglang:v0.5.12` for all Nvidia GPUs supporting DeepSeek V4 features.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Speculative Decoding V2 maturation including adaptive Spec V2, EAGLE-3 SWA support, and Kimi K2.5 MLA spec decoding.

Speculative Decoding V2 maturation including adaptive Spec V2, EAGLE-3 SWA support, and Kimi K2.5 MLA spec decoding.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

CUDA 13 DeepEP migration to `deepseek-ai/DeepEP@hybrid-ep` for clean builds on CUDA 13.

CUDA 13 DeepEP migration to `deepseek-ai/DeepEP@hybrid-ep` for clean builds on CUDA 13.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

HiCache + UnifiedRadixTree support with SWA, SSD offload, and stability fixes across eviction paths.

HiCache + UnifiedRadixTree support with SWA, SSD offload, and stability fixes across eviction paths.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

PD Disaggregation improvements including DSv4 flash disaggregation tests, Mooncake state transfer, and priority scheduling fix.

PD Disaggregation improvements including DSv4 flash disaggregation tests, Mooncake state transfer, and priority scheduling fix.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

LoRA MLA attention (q_b_proj / kv_b_proj) and CSGMV backend with virtual experts for MoE LoRA.

LoRA MLA attention (q_b_proj / kv_b_proj) and CSGMV backend with virtual experts for MoE LoRA.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Performance optimizations such as TMA bulk-store set_mla_kv_buffer, Kimi tokenizer TTFT optimization, and DeepSeekV2MoE deferring shared experts.

Performance optimizations such as TMA bulk-store set_mla_kv_buffer, Kimi tokenizer TTFT optimization, and DeepSeekV2MoE deferring shared experts.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Observability additions including `sglang:get_loads_duration_seconds` Prometheus metric and decode-side bootstrap metrics.

Observability additions including `sglang:get_loads_duration_seconds` Prometheus metric and decode-side bootstrap metrics.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Frontend & API enhancements like `/v1/tokenize` chat-completion support, multi-detokenizer, and auto-detect reasoning tool-call parser.

Frontend & API enhancements like `/v1/tokenize` chat-completion support, multi-detokenizer, and auto-detect reasoning tool-call parser.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

SGLang-Diffusion new model support HunyuanVideo ModelOpt FP8 and Qwen Image ModelOpt FP8 with CFG parallelism framework.

SGLang-Diffusion new model support HunyuanVideo ModelOpt FP8 and Qwen Image ModelOpt FP8 with CFG parallelism framework.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

AMD/ROCm improvements including DSv4 Flash tests on MI35x ROCm 7.2, gfx950 aiter `_skip_rope_for_aiter_fused_mla`, and JIT kernel PR-CI.

AMD/ROCm improvements including DSv4 Flash tests on MI35x ROCm 7.2, gfx950 aiter `_skip_rope_for_aiter_fused_mla`, and JIT kernel PR-CI.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

NPU/Ascend enhancements such as `zbal` support, Trinity-mini (~90% accuracy), and MLA KV transfer in pipeline parallel.

NPU/Ascend enhancements such as `zbal` support, Trinity-mini (~90% accuracy), and MLA KV transfer in pipeline parallel.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

CPU/Intel/MUSA/MLX improvements including MUSA FlashInfer sampling backend, MLX on-the-fly quantization on Apple Silicon, and Intel CPU test migration.

CPU/Intel/MUSA/MLX improvements including MUSA FlashInfer sampling backend, MLX on-the-fly quantization on Apple Silicon, and Intel CPU test migration.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Quantization & Kernels updates like NVFP4 hot-reload-safe weight loading, Cute-DSL FP4 dense GEMM, and DSv3.2 indexer GEMM via `torch.mm`.

Quantization & Kernels updates like NVFP4 hot-reload-safe weight loading, Cute-DSL FP4 dense GEMM, and DSv3.2 indexer GEMM via `torch.mm`.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Dependencies upgrades: FlashInfer 0.6.8.post1 → 0.6.11.post1, sgl-kernel 0.4.2.post1/2 updates, and custom `sgl-deep-gemm` wheel release workflow.

Dependencies upgrades: FlashInfer 0.6.8.post1 → 0.6.11.post1, sgl-kernel 0.4.2.post1/2 updates, and custom `sgl-deep-gemm` wheel release workflow.

Source: llm_adapter@2026-05-21

Confidence: high

Feature Medium

Adds DeepSeek V4 support with Tensor, Expert, Context parallelism, Data Parallel Attention, HiSparse offloading, Prefill‑Decode disaggregation, Reasoning parser, Tool Call Parser, DeepGemm/FlashMLA kernels (including MegaMoE).

Adds DeepSeek V4 support with Tensor, Expert, Context parallelism, Data Parallel Attention, HiSparse offloading, Prefill‑Decode disaggregation, Reasoning parser, Tool Call Parser, DeepGemm/FlashMLA kernels (including MegaMoE).

Source: granite4.1:30b@2026-05-22-audit

Confidence: high

Feature Medium

Enables PDL across DSv3.2 and GLM‑5 kernels, uses torch.mm for DeepSeek V3.2 indexer GEMM, and relands Cute‑DSL FP4 dense GEMM to trim low‑latency overheads.

Enables PDL across DSv3.2 and GLM‑5 kernels, uses torch.mm for DeepSeek V3.2 indexer GEMM, and relands Cute‑DSL FP4 dense GEMM to trim low‑latency overheads.

Source: granite4.1:30b@2026-05-22-audit

Confidence: high

Feature Medium

PDL enabled across DSv3.2 / GLM-5 kernels, reducing low-latency overheads on FP4 paths.

PDL enabled across DSv3.2 / GLM-5 kernels, reducing low-latency overheads on FP4 paths.

Source: llm_adapter@2026-05-21

Confidence: low

Full changelog

Highlights

  • DeepSeek V4 support: Full inference path for DeepSeek-V4 (#23882), including:

    Day-0 Features: #23882

    • Parallelism: Tensor Parallelism/Expert Parallelism/Context Parallelism/Data Parallel Attention
    • Hardware: Nvidia B300/B200/H200/H100/GB200/GB300, AMD MI35X
    • Prefill-Decode Disaggregation
    • HiSparse for offloading inactive KV cache to CPU memory
    • Reasoning parser and Tool Call Parser
    • DeepGemm and FlashMLA kernels for DeepSeek V4, including MegaMoE

    Post-Day-0 additions:

    • HiCache for DeepSeek V4 under unified Radix Tree [UnifiedTree]: #24691
    • W4A4 MegaMoE kernels — faster speed with negligible accuracy drop: #25052
    • Marlin/FlashInfer W4A8 MoE kernels on Hopper: #24816 #24986
    • Faster V2 fused compression kernels: #24890
    • TP16 support on H100/H20: #24949
    • Fused SiLU+clamp+FP8 quant kernel: #24897
    • Optimized MHC + DeepGemm pipeline (fused norm, fused hc_head): #24775
    • Non-standard chat template support for DSv4: #23915
    • Multi-detokenizer support: #24944
    • Pipeline Parallelism + PD support for DeepSeek-V4: #24700
    • A unified docker tag lmsysorg/sglang:v0.5.12 for all Nvidia GPUs

    See the LMSYS blog and the DeepSeek-V4 cookbook for more details.

  • TokenSpeed MLA attention backend (Blackwell, FP8 KV cache): New MLA prefill/decode kernels integrated as an attention backend on SM100, with FP8 KV cache support for low-latency MLA serving: #24925

  • DSv3.2 / GLM-5 FP4 low-latency perf: PDL enabled across DSv3.2 / GLM-5 kernels, torch.mm for the DeepSeek V3.2 indexer GEMM, and a reland of the Cute-DSL FP4 dense GEMM — materially trimming low-latency overheads on FP4 paths: #23965, #23856, #23590, #25311

  • New Model Support: DeepSeek V4 #23882, Intern-S2-Preview #24875, MiniCPM-V 4.6 #24855, Laguna-XS.2 #24204, Ring-2.6-1T #25360, and Gemma 4 MTP #24436 — with cookbook recipes for tuned deployment commands. See docs.sglang.io/cookbook

  • HiCache + UnifiedRadixTree: HiCache framework support for UnifiedRadixTree (with SWA), HiCache for DeepSeek V4, SSD offload through Mooncake store, and stability fixes across cascade eviction, tombstone replay, and partial-match paths: #23316, #23391, #24691, #24277, #24943, #24972, #25068, #25277

  • Speculative Decoding V2 maturation: Adaptive Spec V2, EAGLE-3 SWA + newer drafters, Kimi K2.5 EAGLE-3 MLA, Gemma 3/4 + EAGLE-3, and an extensive naming / shape-handling refactor across draft-extend paths: #23336, #24663, #24664, #24826, #23976, #24859

  • CUDA 13 DeepEP migration: Gateway DeepEP source swapped from a community fork to deepseek-ai/DeepEP@hybrid-ep so DeepEP builds and runs cleanly on the CUDA 13 default; FlashInfer pinned at 0.6.11.post1 alongside a gpt-oss triton-kernel fix: #25113

New Model Support

Entries with a published cookbook recipe come first; entries whose cookbook page is still pending are grouped at the bottom.

  • DeepSeek V4 (see cookbook; LMSYS blog)
  • Intern-S2-Preview: #24875, #25115, #25134 (see cookbook)
  • MiniCPM-V 4.6: #24855, #24876, #24991, #24998 (see cookbook)
  • Laguna-XS.2 (Poolside): #24204, #24730 (see cookbook)
  • Ring-2.6-1T (InclusionAI, trillion-param reasoning): #25360, #25370 (see cookbook)
  • Gemma 4 MTP (MTP head for Gemma 4): #24436, #24433
  • Trinity-mini (Ascend NPU, ~90% accuracy): #18172
  • HunyuanVideo ModelOpt FP8 (Diffusion): #23199
  • Qwen Image ModelOpt FP8 (Diffusion): #23155

Speculative Decoding

  • TokenSpeed MLA prefill/decode kernels integrated as attention backend (FP8 KV cache, Blackwell): #24925
  • Adaptive Spec V2 (2/N): #23336
  • SWA support for EAGLE-3 drafter: #24664
  • Support newer EAGLE-3 drafters: #24663
  • Kimi K2.5 EAGLE-3 MLA spec decoding: #24826
  • Gemma 3 / Gemma 4 + EAGLE-3 support: #23976
  • Spec V1 — split draft-extend into EagleDraftExtendInput: #24859
  • Custom speculative-algorithm registry: #23991
  • Spec-V2 overlap stale-state fix: #23456
  • trtllm decode kernel for draft extend: #24566
  • AMD: EAGLE on Qwen3.5 FP8/MXFP4 via aiter unified attention: #23146
  • Fix Kimi K2.5 MLA EAGLE + DP attention: #25033
  • Fix ngram metric off-by-1 in num_accepted_drafts_per_req_cpu: #24965
  • Fix frozen-KV MTP crash when bonus_tokens is None: #25204
  • Fix stuck-MTP on DSA models: #24635
  • Reduce specdec CPU overhead: #23321
  • Spec-decoding naming-convention rule + refactors: #24094, #25014, #25038, #24081, #24724, #24735, #24881, #25010, #25012, #25030, #25029, #25037, #25109

PD Disaggregation

  • DSv4 Flash disaggregation test: #24973
  • Unify DSv4 dispatch with SWA: #24888
  • DSv4 mooncake state_type branch: #24878
  • Hybrid state transfer refactor: #24932
  • Priority scheduling in PD mode fix: #25062
  • NIXL: staging buffer for heterogeneous-TP KV transfer: #22536
  • NIXL: async transfer: #23967
  • NIXL XPU: uint64 pointer overflow + mismatched P/D TP fixes: #24188, #24648
  • Mooncake: incremental transfer + SSD offload: #24257, #24277
  • Multi-node prefill bootstrap-port broadcast: #24378
  • Add retry-with-backoff for prefill bootstrap registration: #25125
  • PrefillDelayer: NCCL all-gather for cross-DP info sync: #24768
  • MORI-IO: state transfer + high-concurrency fixes: #22665
  • Per-room cleanup centralization; prevent update_status from cleared entries; fix abort update_status across KV backends: #24601, #24539, #24522
  • PD KV transfer metrics fix: #24416
  • SWA memory preallocation for disaggregated decode: #24857
  • IntraNode NVLink configuration docs: #23329

HiCache & Radix Cache

  • HiCache framework for UnifiedRadixTree: #23316
  • SWA HiCache for unified radix cache: #23391
  • HiCache for DeepSeek V4 + nightly CI for DSA model: #24691, #25369, #25348
  • SSD offload through Mooncake store: #24277
  • HiSparse FP8 KV cache via flashmla_kv backend: #23013
  • Default storage prefetch timeout: #23309
  • UnifiedRadixCache device match semantics with HiCache: #25277
  • UnifiedTree partial match on evicted+backuped nodes: #24943
  • UnifiedTree tombstone lock release replay fix: #24972
  • UnifiedTree _cascade_evict leaf determination fix: #25068
  • UnifiedRadixTree align cache_empty_result with RadixTree: #24779
  • Mamba radix cache KV events; SWA radix cache events: #23678, #24718
  • SWA chunk req deferred fix; SWA component host hit fix: #24318, #25085

LoRA

  • MLA attention LoRA (q_b_proj / kv_b_proj): #25001
  • CSGMV backend with virtual experts for MoE LoRA: #24007
  • MoE LoRA: remove CPU-GPU sync barriers and duplicate code (prefill optimize 2/n, 3/n): #24246, #24262
  • LoRADrainer for high P99 TTFT: #17913
  • qkv_proj buffer sizing when tp_size > num_key_value_heads: #24420
  • Torch-Native LoRA: embedding + graph optimization: #21885
  • Deterministic lora_id for multi-node --lora-paths: #24555
  • Fix broken sgemm_lora_a_graph_fwd due to invalid torch.mm(): #24760
  • Diffusion: fix RowParallel LoRA merged forwarding: #24410

Performance

  • TMA bulk-store set_mla_kv_buffer (up to 12× over baseline): #25311
  • Kimi tokenizer TTFT optimization: #25265
  • Avoid hidden-states D2H copy when return_hidden_states=false: #25155
  • DeepseekV2MoE: defer shared experts when routed kernel is non-mutating: #25279
  • SGLANG_OPT_FP8_WO_A_GEMM on by default: #25181
  • --prefill-only-disable-kv-cache to skip KV pool allocation: #23675
  • Gemma 4 MoE: fused Q/K/V RMSNorm + per-expert FP8 ckpt loader: #24696
  • Gemma 4 VLM: PCG + fused RMSNorm + residual: #24048
  • MHC pipeline: DeepGemm + fused norm + fused hc_head: #24775
  • JIT custom all-reduce default; non-NVL follow-up: #24363, #24742
  • SGLANG_USE_JIT_ALL_REDUCESGLANG_OPT_USE_CUSTOM_ALL_REDUCE_V2: #24297
  • Eliminate logits H2D blocking copy: #24627
  • Cache empty MatchResult in RadixCache: #24470
  • Breakable CUDA graph for bs > 1: #24662
  • FA3: skip scheduler_metadata precompute under DP attention: #24632
  • aten::rms_norm / aten::mm.dtype registration in batch-invariant mode: #24459
  • Optimize Helios fused norm modulation: #24059
  • Z-Image packed QKV optimization: #24117
  • KDA prefill kernels: diagonal + recompute fuse: #24271

Observability

  • sglang:get_loads_duration_seconds Prometheus metric: #25163
  • Per-iteration forward-pass metrics via ZMQ PUB: #22789
  • SGLANG_TRACE_LEVEL env for startup trace level: #24716
  • fwd_occupancy metric in SchedulerStats + Prometheus collector: #24458
  • SWA / Mamba cache metrics: #24396
  • Mamba radix cache + SWA radix cache KV events: #23678, #24718
  • PD KV transfer metrics fix: #24416
  • CP allgather buffer registered with symmetric memory: #24040
  • Decode-side bootstrap/alloc metrics + non-int token-id filter: #24684

Frontend & API

  • /v1/tokenize chat-completion-style support: #23981
  • Multi-detokenizer support: #24944
  • Structural tags for strict tool calling & reasoning across more models: #21722
  • Auto-detect reasoning / tool-call parser from chat template: #23952
  • Two-phase reasoning grammar + --enable-strict-thinking: #23953
  • OpenAI reasoning.enabled mapping to thinking + enable_thinking: #23951
  • Kimi-K2.5 bare-numeric tool-call IDs: #23950
  • Crusoe managed-inference backend: #20475
  • Azure Blob Storage connector (az:// and *.blob.core.windows.net): #23995
  • Adaptive queue-based prefill-delayer trigger: #23189
  • SGLANG_MAX_KV_CHUNK_CAPACITY env: #25120
  • SGLANG_RADIX_FORCE_MISS env: #24726, #24950
  • Reject repetition_penalty=0 in SamplingParams.verify(): #24874
  • --random-input-len for send_one.py: #24464

SGLang-Diffusion

  • New model support: HunyuanVideo ModelOpt FP8 (#23199), Qwen Image ModelOpt FP8 (#23155)
  • CFG parallelism framework + multi-branch CFG for LTX-2: #23736
  • Initial dynamic batching: #18764
  • Performance-mode server args: #24491
  • dit_precision config respected (no hardcoded bf16): #24988
  • Cache-DiT: mount before torch.compile in native denoising: #25328
  • Z-Image Cache-DiT sequence-parallel override fix: #25305
  • USP: direct all-to-all collectives; NCCL deadlock fix for remainder seq lengths: #24366, #24694
  • FA3 varlen out argument handling: #24688
  • RowParallel LoRA merged forwarding fix: #24410
  • CFG communication: handle non-contiguous tensors: #24332
  • LTX-2.3 alignment with official + HQ denoising split passes: #24313, #24298
  • LTX-2 feed-forward TP optimization (#23221) + Hunyuan3D shape denoising / export chunks: #24287, #24358
  • Encoder result cache for default negative prompt: #24304
  • Channels-last 3D VAE convs by default; disable VAE CPU offload by default: #23200, #24315
  • Component attention-backend override CLI: #24320
  • AMD: online MXFP4 + FP8 diffusion quantization; aiter RMSNorm; temporal-unfolded batched Conv2D for ROCm VAE decode; dual-stream MoE: #21431, #24360, #22971, #24005, #24677
  • NPU: MXFP8 quantization for Wan2.2 (#20922, #24918); fused-operator E2E perf for Wan (#24028); selectable parallel VAE decode strategies (#23248); SANA fix (#24798); Z-Image negative-branch rotary embed CFG fix (#23538)
  • MUSA: sage attention backend (#24752)

AMD / ROCm

  • DSv4 Flash / Pro nightly tests on MI35x ROCm 7.2: #24203, #24825, #25039
  • NSA indexer fallbacks + preshuffle paged MQA + GLM-5 NSA TileLang: #24125, #23562, #25205
  • fp8 blockwise quantization combine for MoRI EP: #24879
  • gfx950 + aiter _skip_rope_for_aiter_fused_mla: #24148
  • aiter fused_qk_rmsnorm API shim (pre/post #2958): #24799
  • TBO Spec-V2 seq_lens_cpu None handling: #24319
  • Kimi-K2.6 nightly tests (MI30x / MI35x): #23848
  • JIT kernel PR-CI through run_suite.py: #24987
  • AMD JIT benches: clamp position + resolve-token-ids: #24209, #24210 (#25209, #25210)
  • AMD CI hygiene (registration + cleanup + VRAM): #24569, #24572, #24586, #24612, #24614, #24615, #24665, #24924, #24981, #25112
  • Docker: cache-dit 1.3.0 pin; archive.ubuntu.com fallback: #24924, #24407

NPU / Ascend

  • zbal support: #24575
  • Trinity-mini support (~90% accuracy): #18172
  • Shared-expert dual-stream optimization: #23827
  • Mamba-extra-buffer radix cache (Qwen3.5): #23891
  • MLA KV transfer in pipeline parallel: #23893
  • Multi-batch FIA ops: #20177
  • GLM-5 docs: DeepEP enabled by default: #23708
  • GLM-4.5V / GLM-4.7-Flash NPU support / fixes (carry-over): existing
  • --disable-cuda-graph + MTP warmup fix: #23819
  • MRoPE position fix in Eagle Worker v2 with PlanStream: #23423
  • Z-Image negative-branch rotary embeddings for CFG: #23538
  • Wan quantization fix: #24540
  • causal_conv1d_update_v2 for performance: #24595
  • sgl-kernel-npu 2026.05.01 bump: #24951
  • Profiler revert + re-add: #24685, #24815
  • Doc / accuracy / FAQ work: #21537, #24658, #24676, #24777, #25114, #25130, #25268, #24668, #24918

CPU / Intel / MUSA / MLX / Apple Silicon

  • MUSA: FlashInfer sampling backend: #24978
  • MUSA: optimized kernels for piecewise CUDA graph: #23633
  • MUSA: optimized kernels for hot ops: #23255
  • MUSA: torchada 0.1.54 bump: #24592
  • MLX: on-the-fly --quantization mlx_q4 / mlx_q8 on Apple Silicon: #24907
  • MLX: auto-detect MLX-format quantization_config dict: #25191
  • MLX: thread --quantization through MlxModelRunner in bench_one_batch: #25221
  • MLX: Apple Silicon Metal kernel support in sgl-kernel: #23449
  • sgl-kernel/cpu: w8a8 int8 model support for arm cpu: #16045
  • Intel CPU tests migrated to test/registered (re-applied after revert): #25139, #22670, #25044
  • Arm64 CPU Phase-1A CI bootstrap: #22123
  • XPU pipeline parallelism on Intel: #23472

Quantization & Kernels

  • NVFP4 hot-reload-safe weight loading (alias-when-same-shape): #25190
  • NVFP4: free unused source scales after weight processing: #25107
  • Cute-DSL NVFP4 quantization kernels: #23745
  • Cute-DSL FP4 dense GEMM (reland): #23590
  • DSv3.2 indexer GEMM via torch.mm: #23856
  • PDL for DSv3.2 / GLM-5 kernels: #23965
  • DSv4: W4A4 MegaMoE; W4(MXFP4)A16 on Hopper: #25052, #24986
  • FlashInfer SM90 cutlass MXFP4 MoE backend (W4A16) for GPT-OSS + DSv4: #24816
  • Port KV Compression V2 + fused SiLU+clamp+FP8 quant from DSV4 dev branch: #24890, #24897
  • BF16 EP-MoE for DeepGEMM: #17392
  • DeepGEMM deprecated in sgl-kernel; custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
  • TRT-LLM A2A dispatch: NaN sanitization in padding slots: #24850
  • TRT-LLM BF16 MoE for MTP: #24260
  • MegaMoE decoupled from DeepEP backend (subsequently reverted): #24884, #25317
  • DeepEP waterfill load balancing for shared-expert dispatch: #19290
  • DeepEP support for --enable-return-routed-experts: #16859

Dependencies

  • FlashInfer 0.6.8.post1 → 0.6.11 → 0.6.11.post1 (with intermediate revert): #24452, #25129, #25310, #25335
  • sgl-kernel 0.4.2.post1, 0.4.2.post2: #24457, #25326
  • sgl-kernel: SM90 flashmla compile fix: #24130
  • Custom sgl-deep-gemm wheel + release workflow: #24268, #24348, #24385
  • sgl-kernel-build x86 + arm merged into reusable workflow; disk-reclaim cleanup: #25135, #25206
  • DeepEP swapped from fzyzcjy fork to deepseek-ai/DeepEP@hybrid-ep (CUDA 13): #25113
  • Torch 2.11 Docker prep + dependency cleanup: #23593
  • nixl stub installation alongside nixl-cuXX binary: #24369
  • aarch64 cubin handling + masked-failure fix: #24234
  • H20 stage on CUDA 13: #24916
  • CUDA-13 kernel installation docs: #24181, #24516
  • FlashInfer autotune cache: #24156
  • FlashInfer workspace OOM fix: #24172
  • FlashInfer allreduce fusion disabled under deterministic inference: #24629
  • trtllm allreduce fusion with PDL: #23765
  • TRTLLM MHA routing fix for draft-extend: #24856
  • torchcodecsoundfile WAV fallback for trailing metadata: #24185
  • sgl-kernel-npu 2026.05.01: #24951

Security

No security-tagged PRs in this window.

All PRs included in this release: https://github.com/sgl-project/sglang/compare/v0.5.11...v0.5.12

New Contributors

  • @Seven-Streams made their first contribution in https://github.com/sgl-project/sglang/pull/21722
  • @stargazerZJ made their first contribution in https://github.com/sgl-project/sglang/pull/24344
  • @Jianhong-Zhang made their first contribution in https://github.com/sgl-project/sglang/pull/24188
  • @gh1595 made their first contribution in https://github.com/sgl-project/sglang/pull/24420
  • @revanthreddy-hai made their first contribution in https://github.com/sgl-project/sglang/pull/24329
  • @TallMessiWu made their first contribution in https://github.com/sgl-project/sglang/pull/20922
  • @ranimandepudi made their first contribution in https://github.com/sgl-project/sglang/pull/22123
  • @Joey-gvwal made their first contribution in https://github.com/sgl-project/sglang/pull/23255
  • @fanxingran made their first contribution in https://github.com/sgl-project/sglang/pull/24129
  • @xz-keg made their first contribution in https://github.com/sgl-project/sglang/pull/24604
  • @zhongdaor-nv made their first contribution in https://github.com/sgl-project/sglang/pull/23678
  • @chfeng-cs made their first contribution in https://github.com/sgl-project/sglang/pull/24434
  • @sglang-npu-bot made their first contribution in https://github.com/sgl-project/sglang/pull/24815
  • @brian030128 made their first contribution in https://github.com/sgl-project/sglang/pull/24217
  • @tjdharamsi made their first contribution in https://github.com/sgl-project/sglang/pull/24871
  • @sytianhe made their first contribution in https://github.com/sgl-project/sglang/pull/24716
  • @Dogacel made their first contribution in https://github.com/sgl-project/sglang/pull/24663
  • @tangcy98 made their first contribution in https://github.com/sgl-project/sglang/pull/24967
  • @1pikachu made their first contribution in https://github.com/sgl-project/sglang/pull/22670
  • @flutist made their first contribution in https://github.com/sgl-project/sglang/pull/24760
  • @acheamponge made their first contribution in https://github.com/sgl-project/sglang/pull/20475
  • @taegeonum made their first contribution in https://github.com/sgl-project/sglang/pull/25022
  • @RulinJuice made their first contribution in https://github.com/sgl-project/sglang/pull/24874
  • @lluki made their first contribution in https://github.com/sgl-project/sglang/pull/24671
  • @Religious-J made their first contribution in https://github.com/sgl-project/sglang/pull/20930
  • @ltcs11 made their first contribution in https://github.com/sgl-project/sglang/pull/24575
  • @damahua made their first contribution in https://github.com/sgl-project/sglang/pull/24907
  • @ziang663 made their first contribution in https://github.com/sgl-project/sglang/pull/25126
  • @Emmanuel0612 made their first contribution in https://github.com/sgl-project/sglang/pull/25209
  • @Jialin made their first contribution in https://github.com/sgl-project/sglang/pull/25234
  • @jlee5814 made their first contribution in https://github.com/sgl-project/sglang/pull/25191
  • @unseenmars made their first contribution in https://github.com/sgl-project/sglang/pull/24935
  • @nano8259 made their first contribution in https://github.com/sgl-project/sglang/pull/25125
  • @imp2002 made their first contribution in https://github.com/sgl-project/sglang/pull/24130
  • @liuxianglong17 made their first contribution in https://github.com/sgl-project/sglang/pull/25080

Full Changelog: https://github.com/sgl-project/sglang/compare/v0.5.11...v0.5.12

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track sglang

Get notified when new releases ship.

Sign up free

About sglang

SGLang is a high-performance serving framework for large language models and multimodal models.

All releases →

Related context

Beta — feedback welcome: [email protected]