Skip to content

Thaw

v0.1.4 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

agents inference kv-cache llm reinforcement-learning sglang
+1 more
vllm

Summary

AI summary

Updates What's new, Automatic, and https://github.com/thaw-ai/thaw/blob/main/bench_slot_warm.py across a mixed release.

Full changelog

Highlights

Sub-second model hot-swap at 55 GB/sthaw serve pins the snapshot mmap once per pool slot, then reuses that pinned handle for every subsequent model swap. Steady-state reload hits PCIe Gen5-saturating throughput.

Benchmark (H100 SXM, Llama-3-8B, 16 GB fp16)

| Reload | Time | Throughput |
|--------|------|------------|
| 0 (cold, one-time cudaHostRegister) | 6.40s | — |
| 1 | 0.29s | 55.0 GB/s |
| 2 | 0.29s | 55.1 GB/s |
| 3 | 0.29s | 55.1 GB/s |
| 4 | 0.29s | 55.1 GB/s |

Bit-identical output verified across reloads. Extrapolates to ~2.5s for Llama-70B (140 GB).

What's new

  • PinnedMmap PyO3 type — lifecycle-managed cudaHostRegister handle
  • restore_from_pinned_mmap — skip registration, reuse pinned buffer directly
  • EngineSlot._pinned_mmap — persistent per-slot pinned handle in thaw serve
  • Automatic: any thaw serve deployment gets slot-warm swap with no API change

Required env

thaw serve currently requires VLLM_ENABLE_V1_MULTIPROCESSING=0 to reach across vLLM V1's process boundary.

Bench: bench_slot_warm.py · correctness: bench_slot_warm_correctness.py

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track Thaw

Get notified when new releases ship.

Sign up free

Beta — feedback welcome: [email protected]