Skip to content

Glq

Model Serving & MLOps

Post‑training weight quantization for large language models using E8 lattice codebooks, achieving 2–8 bits‑per‑weight with performance comparable to state‑of‑the‑art methods.

Python Latest v0.5.1 · 5h ago Security brief →

Features

  • Encodes each 8‑weight group as a 16‑bit index into an E8 lattice codebook
  • Uses Randomized Hadamard Transform for decorrelation, enabling near‑optimal Euclidean nearest‑neighbour search
  • Provides fused CUDA kernels that matmul directly against compressed indices without full dequantization
  • Supports flexible bit‑width quantization (2–8 bpw, including fractional values)
  • Includes mixed‑precision allocation flow for per‑layer bpw tuning

Recent releases

View all 24 releases →
No immediate action
v0.5.1 Breaking risk

Default inline-dequant E8 KV

No immediate action
v0.5.0 New feature

Inline-dequant E8 KV cache

No immediate action
v0.3.5 New feature

Auto‑PIECEWISE downgrade

No immediate action
v0.3.4 New feature

CUDA-graph capture size tuning

No immediate action
v0.3.3 Bug fix

E8 KV regression fix + GLQ ops

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

About

Stars
3
Forks
1
Languages
Python Cuda C++

Install & Platforms

Install via
pip
Platforms
linux macos windows arm64

Beta — feedback welcome: [email protected]