This release adds 3 notable features for engineering teams evaluating rollout.
✓ No known CVEs patched in this version
Topics
+1 more
Summary
AI summaryUpdates What's new, Automatic, and https://github.com/thaw-ai/thaw/blob/main/bench_slot_warm.py across a mixed release.
Full changelog
Highlights
Sub-second model hot-swap at 55 GB/s — thaw serve pins the snapshot mmap once per pool slot, then reuses that pinned handle for every subsequent model swap. Steady-state reload hits PCIe Gen5-saturating throughput.
Benchmark (H100 SXM, Llama-3-8B, 16 GB fp16)
| Reload | Time | Throughput |
|--------|------|------------|
| 0 (cold, one-time cudaHostRegister) | 6.40s | — |
| 1 | 0.29s | 55.0 GB/s |
| 2 | 0.29s | 55.1 GB/s |
| 3 | 0.29s | 55.1 GB/s |
| 4 | 0.29s | 55.1 GB/s |
Bit-identical output verified across reloads. Extrapolates to ~2.5s for Llama-70B (140 GB).
What's new
PinnedMmapPyO3 type — lifecycle-managedcudaHostRegisterhandlerestore_from_pinned_mmap— skip registration, reuse pinned buffer directlyEngineSlot._pinned_mmap— persistent per-slot pinned handle inthaw serve- Automatic: any
thaw servedeployment gets slot-warm swap with no API change
Required env
thaw serve currently requires VLLM_ENABLE_V1_MULTIPROCESSING=0 to reach across vLLM V1's process boundary.
Bench: bench_slot_warm.py · correctness: bench_slot_warm_correctness.py
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Thaw
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]