This release adds 3 notable features for engineering teams evaluating rollout.
Published 2mo
Model Serving & MLOps
✓ No known CVEs patched
✓ No known CVEs patched in this version
Topics
inference
llm
model-compression
pytorch
quantization
Summary
AI summaryCPU offloading for 7B+ model quantization prevents OOM on large models with limited VRAM.
Full changelog
What's new
- CPU offloading for 7B+ model quantization: Model layers are now loaded to CPU and moved to GPU one at a time during quantization, preventing OOM on large models with 24GB VRAM GPUs.
- Mistral-7B-v0.3 benchmark: GLQ 3-bit achieves 4.41 perplexity (1.05x vs bf16 4.20) using only 4.4 GB GPU memory vs 14.5 GB for bf16.
- Llama-3.2-3B benchmarks: GLQ 3-bit (6.78 PPL, 1.10x) and 2-bit (8.49 PPL, 1.38x) results added.
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About Glq
All releases →Related context
Related tools
Beta — feedback welcome: [email protected]