Thaw

v0.1.4 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 3mo AI Agents & Assistants

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agents inference kv-cache llm reinforcement-learning sglang

+1 more

vllm

Summary

AI summary

Updates What's new, Automatic, and https://github.com/thaw-ai/thaw/blob/main/bench_slot_warm.py across a mixed release.

Full changelog

Highlights

Sub-second model hot-swap at 55 GB/s — thaw serve pins the snapshot mmap once per pool slot, then reuses that pinned handle for every subsequent model swap. Steady-state reload hits PCIe Gen5-saturating throughput.

Benchmark (H100 SXM, Llama-3-8B, 16 GB fp16)

| Reload | Time | Throughput |
|--------|------|------------|
| 0 (cold, one-time cudaHostRegister) | 6.40s | — |
| 1 | 0.29s | 55.0 GB/s |
| 2 | 0.29s | 55.1 GB/s |
| 3 | 0.29s | 55.1 GB/s |
| 4 | 0.29s | 55.1 GB/s |

Bit-identical output verified across reloads. Extrapolates to ~2.5s for Llama-70B (140 GB).

What's new

PinnedMmap PyO3 type — lifecycle-managed cudaHostRegister handle
restore_from_pinned_mmap — skip registration, reuse pinned buffer directly
EngineSlot._pinned_mmap — persistent per-slot pinned handle in thaw serve
Automatic: any thaw serve deployment gets slot-warm swap with no API change

Required env

thaw serve currently requires VLLM_ENABLE_V1_MULTIPROCESSING=0 to reach across vLLM V1's process boundary.

Bench: bench_slot_warm.py · correctness: bench_slot_warm_correctness.py

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track Thaw

Get notified when new releases ship.

About Thaw

All releases →