noonghunna/club-3090

v0.8.4 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 11d Model Serving & MLOps

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Summary

AI summary

Broad release touches 📝 Documentation, 🐛 Bug fixes, 🧹 Other, and ✨ Features.

Changes in this release

Type	Severity	Summary	CVE
Feature
Feature	Medium	Adds ik iq4ks-mtp and iq4ks-mtp-vision to launch.sh and switch.sh. Adds ik iq4ks-mtp and iq4ks-mtp-vision to launch.sh and switch.sh. Source: llm_adapter@2026-05-28 Confidence: high	—
Feature	Medium	Adds ik_llama Qwen3.6-27B IQ4_KS composes for text (262K) and vision (160K). Adds ik_llama Qwen3.6-27B IQ4_KS composes for text (262K) and vision (160K). Source: llm_adapter@2026-05-28 Confidence: high	—
Feature	Medium	Exposes request‑level thinking toggles in eval. Exposes request‑level thinking toggles in eval. Source: llm_adapter@2026-05-28 Confidence: high	—
Feature	Medium	Exposes sampling defaults via environment variables in compose. Exposes sampling defaults via environment variables in compose. Source: llm_adapter@2026-05-28 Confidence: high	—
Feature	Medium	Sets WEIGHTS=gguf to fetch llama.cpp GGUF weights in setup. Sets WEIGHTS=gguf to fetch llama.cpp GGUF weights in setup. Source: llm_adapter@2026-05-28 Confidence: high	—
Feature	Medium	Passes --sampling-from-server through quality-test.sh and rebench-full.sh. Passes --sampling-from-server through quality-test.sh and rebench-full.sh. Source: llm_adapter@2026-05-28 Confidence: high	—
Feature	Low	Capture prefill throughput during NIAH rungs in verify-stress. Capture prefill throughput during NIAH rungs in verify-stress. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—
Bugfix
Bugfix	Medium	Fixes three live‑caught bugs in verify‑stress ceiling ladder. Fixes three live‑caught bugs in verify‑stress ceiling ladder. Source: llm_adapter@2026-05-28 Confidence: high	—
Bugfix	Medium	Adds CTX_SIZE‑scaled ceiling ladder to verify‑stress. Adds CTX_SIZE‑scaled ceiling ladder to verify‑stress. Source: llm_adapter@2026-05-28 Confidence: high	—
Bugfix	Medium	Recognizes llama‑cpp and ik‑llama containers in soak and preflight autodetect. Recognizes llama‑cpp and ik‑llama containers in soak and preflight autodetect. Source: llm_adapter@2026-05-28 Confidence: high	—
Bugfix	Medium	Lowers single‑card MTP CTX_SIZE default from 262144 to 200000 for llama.cpp and ik_llama. Lowers single‑card MTP CTX_SIZE default from 262144 to 200000 for llama.cpp and ik_llama. Source: llm_adapter@2026-05-28 Confidence: high	—
Bugfix	Low	Fix basename model ID handling in rebench aider/litellm step. Fix basename model ID handling in rebench aider/litellm step. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—
Bugfix	Low	Record measured ceiling‑ladder result for ik iq4ks-mtp header in compose. Record measured ceiling‑ladder result for ik iq4ks-mtp header in compose. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—
Bugfix	Low	Polish report handling of PyYAML, idle VRAM, P2P redaction, and kv-calc. Polish report handling of PyYAML, idle VRAM, P2P redaction, and kv-calc. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—
Bugfix	Low	Guide users to MODEL_DIR/.env when weights are not found in launch script. Guide users to MODEL_DIR/.env when weights are not found in launch script. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—
Bugfix	Low	Pin llama‑cpp Docker image to server‑cuda‑b9246 tag. Pin llama‑cpp Docker image to server‑cuda‑b9246 tag. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—
Bugfix	Low	Change single‑card default suggestion in launch script to llamacpp/default. Change single‑card default suggestion in launch script to llamacpp/default. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—
Bugfix	Low	Always capture sandboxed-pack logs to per‑tag results directory in rebench. Always capture sandboxed-pack logs to per‑tag results directory in rebench. Source: granite4.1:30b@2026-05-28-audit Confidence: high	—

Full changelog

v0.8.4 — 2026-05-23

✨ Features

feat(verify-stress): capture prefill throughput during NIAH rungs (#199) (07d478c)
feat(eval): expose request-level thinking toggles (#196) (#196 by @noonghunna)
feat(scripts): pass --sampling-from-server through quality-test.sh + rebench-full.sh (dd1f070)
feat(compose): expose sampling defaults via env (#194) (#194 by @noonghunna)
feat(setup): WEIGHTS=gguf to fetch the llama.cpp GGUF (not just the vLLM model) (#191) (#191 by @noonghunna)
feat(ik-llama): wire iq4ks-mtp + iq4ks-mtp-vision into launch.sh + switch.sh (#189) (#189 by @noonghunna)
feat(models): add ik_llama Qwen3.6-27B IQ4_KS composes — text 262K + vision 160K (#180) (#180 by @noonghunna)

🐛 Bug fixes

fix(rebench): basename model id for the aider/litellm step (ik_llama full-path id → 0/30) (3b20ce3)
fix(soak,preflight): recognize llama-cpp / ik-llama containers in autodetect (#403) (d9fdab2)
fix(compose): ik iq4ks-mtp header — record measured ceiling-ladder result (200K confirmed) (1d93343)
fix(compose): lower single-card MTP CTX_SIZE default 262144 → 200000 (llama.cpp + ik_llama) (2e45928)
fix(verify-stress): three live-caught bugs in ceiling ladder (#199) (b84249c)
fix(verify-stress): add CTX_SIZE-scaled ceiling ladder (#199) (5a825a4)
fix(report): PyYAML/idle-VRAM/P2P/redaction/kv-calc polish + review fixes (#178/#137) (#192 by @noonghunna)
fix(launch): point users at MODEL_DIR/.env when weights aren't found (#190) (#190 by @noonghunna)
fix(llamacpp): pin image to server-cuda-b9246 (rolling tag broke at b9282) (#188) (#188 by @noonghunna)
fix(launch): single-card default suggestion → llamacpp/default (#185) (#185 by @noonghunna)
fix(rebench): always capture sandboxed-pack logs to the per-tag results dir (#179) (#179 by @noonghunna)

📝 Documentation

docs: correct ik_llama verdict — ~18-20% FASTER than mainline, not a "tie" (#184) (b7353da)
docs: add @mgabor3141 X399/TR-1950X dual.yml row + pre-Zen2 CPU-IPC note (#178) (6e49960)
docs(CLIFFS): document llama.cpp "boots ≠ fills" false ceiling; 200K = max-safe single-card CTX_SIZE (9be237d)
docs: QUALITY_TEST.md — fix stale pack-status (sandboxed packs now implemented) (f6bdc06)
docs: document sampling/temperature eval options (#193/#194 + benchlocal #19/#21) (9fd634a)
docs(single-card): strike Genesis-pinned vLLM rows (blocked by purged pin #167) (a30bdfd)
docs(upstream): correct the #40875 row (open tool-call-corruption bug, not "closed coexistence") (25f130a)
docs: correct ik_llama claims to the matched-power tie (#184) (c470d9a)
docs: surface WEIGHTS=gguf + switch.sh ik-llama paths (match #189/#191) (412315d)
docs(HARDWARE/FAQ): AMD-Vi IOMMU Xid 154 under TP=2 → iommu=pt fix (#178) (fe86b72)
docs: add ik_llama engine page + QUANTIZATION primer; surface IQK quants (554b85b)
docs(BENCHMARKS): @duart dual NVLink Proxmox VFIO-passthrough, stock-upstream no-Genesis (disc #162) (bc6e20b)
docs(BENCHMARKS): @mgabor3141 dual.yml — Z77/i7-3770K, PCIe 2.0 x4 slowest cross-card link (#178) (626fa68)
docs(mtp-vision): surface the -ub 512 → 192K context recipe in the compose header (70bf7e7)
docs: cross-link the -ub vs ctx trade-off into SINGLE_CARD + CLIFFS + FAQ (035261b)
charts: compose names on x-axis + description legend block below (07c7cd0)
charts: tighten single-card label format (line 1 = variant + ctx, line 2 = modifier) (9aa8fa7)

🛠️ Scripts + tooling

scripts: endpoint-first --url/--model/--engine for non-Docker engines (#174) (#174 by @noonghunna)
report.sh: capture image digest + OCI labels (build tag, upstream commit) (78556f8)

🧹 Maintenance

chore(compose): drop accidentally-committed qwopus3.6-27b-v2 llama.cpp compose (b8aeb93)
refactor(llamacpp): collapse single-card composes 3→2 (default = mtp alias) (#181) (#181 by @noonghunna)

🧹 Other

Fix verify-full to accept reasoning_content (3a04ae5)
quality-test: respect explicit MODEL/--model, don't clobber from /v1/models (#177) (#177 by @noonghunna)
sglang: park EAGLE-3 path for Qwen3-Next (MTP wins everywhere) (#176) (#176 by @noonghunna)
quality-test: expose --timeout-per-case + bump aider-polyglot-30 to 3600s (#175) (#175 by @noonghunna)
sglang: experimental EAGLE-3 + Qwen3-Next dual-3090 path (Codex-led patch) (941fa06)
SINGLE_CARD: refresh Luce DFlash + PFlash watch-list (2026-05-20) (f9f9640)
AGENTS: pin engine images only when we vendor patches (6810768)
llama-cpp: document speed-vs-context trade-off + fix stale ub default (1b2a76c)
llama-cpp: switch to rolling :server-cuda tag (no patches → no pin needed) (4a53eda)
llama-cpp: replace orphan llama-cpp:local with upstream pinned image (#170) (c3e7c7e)
gpu-mode status: probe :8020 + detect engine on :8030 (db9c5e1)

[Pin: git checkout v0.8.4] · Full diff

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track noonghunna/club-3090

Get notified when new releases ship.

About noonghunna/club-3090

All releases →

Related context

Related tools

Earlier breaking changes

v0.8.7 Genesis vLLM composes deprecated; default to `vllm/minimal`.
v0.8.6 Compose paths moved to `models/<model>/<engine>/compose/<topology>/<quant>/<serving>.yml`.