Release history
LLMKube releases
Kubernetes operator for llama.cpp-native LLM inference with GPU scheduling, Apple Silicon Metal support, and OpenAI-compatible API.
All releases
89 shown
- After upgrading to v0.8.1, re‑apply all Agent CRs so existing Agents pick up explicit values for the new requestTimeoutSeconds and requestTurnTimeoutSeconds fields.
- Agent.spec.requestTimeoutSeconds now represents a loop-wide wall-clock budget (default 3600) instead of per-request HTTP timeout; the former behavior is moved to Agent.spec.requestTurnTimeoutSeconds (default 120). Re‑apply Agent CRs after upgrade.
- **inferenceservice:** adds typed spec.ropeScaling for RoPE/YaRN context extension
Full changelog
0.8.1 (2026-06-01)
⚠ BREAKING CHANGES
- foreman: Agent.spec.requestTimeoutSeconds changes meaning from a per-request HTTP timeout to a loop-wide wall-clock budget, and its default moves from 600 to 3600. The former per-request bound is now the new Agent.spec.requestTurnTimeoutSeconds (default 120). Re-apply your Agent CRs after upgrade so existing Agents pick up explicit values.
Features
Bug Fixes
- foreman: recover orphaned phase=Running tasks on agent restart (#542) (#598) (6dd2c44)
- foreman: split per-turn timeout from loop-wide budget (#532) (#602) (41e7663)
- foreman: warm-path reviewer scheduling on macOS (#578, #579) (#597) (a94d1ef)
- metal-agent: prefer routable interface for host-IP auto-detect (#526) (#599) (c780795)
Documentation
- foreman: absolute paths in overview README cross-refs (fix llmkube-web prerender) (#596) (b5f6f94)
- foreman: move docs/foreman to docs/site/foreman + register in site nav (#594) (9fd85bb)
Miscellaneous
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- OpenShift / OKD / MicroShift installs must use `helm ... -f charts/llmkube/values-openshift.yaml` to allow restricted-v2 SCC to inject fsGroup.
- Operators with a custom `--init-container-image` whose user is not curl (uid=101 gid=102) should set `spec.podSecurityContext` on each InferenceService or pass `--default-fsgroup=` to the controller.
- OpenShift made a first-class deploy target (ci+chart changes)
- VLLMConfig CRD now includes gpuMemoryUtilization and cpuOffloadGB fields
- metal-agent emits Kubernetes events for memory-pressure, evictions, skips, and respawn blocks
Full changelog
0.7.7 (2026-05-11)
Features
- agent: vllm-swift runtime + TurboQuant passthrough (#391) (#393) (2691e67)
- ci+chart: make OpenShift a first-class deploy target (closes #421) (#422) (798a13e)
- crd: add gpuMemoryUtilization and cpuOffloadGB to VLLMConfig (#394) (6883f78)
- metal-agent: emit Kubernetes events for memory-pressure transitions, evictions, skips, and respawn blocks (closes #390) (#411) (e0d17d1)
- observability: runtime label on inference pods + recording rules + starter dashboard (refs #409) (#410) (71743ed)
Bug Fixes
- controller: default FSGroup to curl_group + Longhorn-backed e2e job (closes #418, closes #420) (adce90f)
- controller: stop hot-spinning on unreachable file:// model sources (closes #405) (#412) (4ac6f57)
Documentation
- add NVIDIA Blackwell B200 (sm_100) validation matrix (refs #413) (#414) (bfda149)
- operations: seed runbooks index + first 2 entries (file:// hot-spin, metal-agent memory pressure) (#417) (d3bce8d)
- port concepts/comparison to markdown (first Phase 1C content port) (#403) (51c396b)
- readme: HN-launch readiness fixes (broken link, Apple Silicon CTA, quickstart memory) (#401) (3e44bfb)
- refresh quickstart cast for v0.7.6 (HN launch) (#404) (5abaddb)
- split docs/ into site/ and contributors/, prep for site rendering (#396) (9299a31)
- upgrade: OpenShift / OKD / MicroShift installs must use
helm ... -f charts/llmkube/values-openshift.yamlso restricted-v2 SCC can inject fsGroup from the namespace's allocated range (adce90f) - upgrade: operators using a custom
--init-container-imagewhose user is not curl (uid=101 gid=102) should setspec.podSecurityContexton each InferenceService or pass--default-fsgroup=<gid>to the controller (adce90f) - upgrade: v0.7.7 rolls every InferenceService Pod once on first reconcile (Deployment template gains fsGroup=102 and the new
inference.llmkube.dev/runtimelabel) (adce90f)
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- agent: added eviction safety floor, evictionProtection opt‑out, and late‑spawn condition fix
- agent: introduced memory‑pressure eviction with respawn protection
- api: passthrough podAnnotations and podLabels on InferenceServiceSpec
Full changelog
0.7.6 (2026-05-03)
Features
- agent: eviction safety floor + evictionProtection opt-out + late-spawn condition fix (#186) (#384) (6544747)
- agent: memory-pressure eviction and respawn protection (#186) (#382) (65a78b5)
- api: add podAnnotations and podLabels passthrough (closes #326) (#381) (baecd68)
- api: expose runtimeClassName on InferenceServiceSpec (closes #375) (#380) (cc44ff5)
- crd: add ParallelSlots support for vllm and fix llamacpp (#340) (d81babb)
Bug Fixes
- catalog: default phi-4-mini context to 8K (closes #386) (#387) (7bcd685)
- controller: drop model label from Deployment selector to make modelRef mutable (closes #301) (#385) (a1de3bf)
- derive metal InferenceService phase from Endpoints, not desiredReplicas (closes #374) (#376) (350dafe)
Documentation
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed Helm chart by syncing CRDs from kubebuilder source and adding a CI guard.
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Controller pins vLLM default image to version 0.20.0
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Added cacheTypeCustomK/V for non-enum llama.cpp KV cache types
- Added kvCacheCustomDtype for non-enum vLLM KV cache types
Full changelog
0.7.3 (2026-04-29)
Features
- agent: cache-type-aware memory estimator + TurboQuant docs (#355) (0697afd)
- api: add cacheTypeCustomK/V for non-enum llama.cpp KV cache types (#351) (71bd762)
- api: add kvCacheCustomDtype for non-enum vLLM KV cache types (#359) (5e796d0)
Bug Fixes
- agent: respawn on InferenceService spec drift, honor replicas=0, and plumb full spec to llama-server flags (#353) (ff54cad)
- controller: use GGUF metadata name for downloaded model file basename (#347) (e932c7a)
- vllm: set enableServiceLinks=false on vLLM Pod spec (#361) (01eb5c5)
- vllm: use positional model argument instead of deprecated --model (#360) (a17566c)
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Apple Silicon power gauges exposed via powermetrics
- One-command make targets for installing and uninstalling powermetrics-sudo
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Apple Silicon-optimized flags for llama-server in agent
- values.schema.json added to Helm chart for value validation
- agentic-coding flags extended in InferenceService vLLM config
Full changelog
0.7.1 (2026-04-25)
Features
- agent: pass Apple Silicon-optimized flags to llama-server (#327) (a69ab6a)
- chart: add values.schema.json for Helm value validation (#322) (1f8a34d)
- crd: extend InferenceService vLLM config for agentic-coding flags (#306) (cb2aa6a)
- security: supply-chain MVP — checksum install, govulncheck, gosec, codecov (#310) (f17f59d)
Bug Fixes
- agent: detect stalled K8s polling and exit for supervisor restart (#328) (c0636cc)
- agent: let the kernel pick free ports for llama-server (#321) (8111395)
- bump InferenceService spec.contextSize cap from 131072 to 2097152 (#300) (a46a1bf)
Documentation
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- sharding.strategy: tensor now maps to llama.cpp --split-mode row instead of layer; set strategy: layer to retain previous behavior
- InferenceService spec.extraArgs is forwarded to vLLM runtime, previously ignored; configs with llama.cpp-only flags will fail
- Hybrid GPU/CPU offloading support for MoE models
- Tensor overrides and batch size controls for hybrid offloading
- Additional runtime controls for llama.cpp and vllm
Full changelog
0.7.0 (2026-04-18)
⚠ BREAKING CHANGES
- sharding:
sharding.strategy: tensoron a Model now correctly maps to llama.cpp's--split-mode rowinstead of silently falling back to--split-mode layer. Configs that setstrategy: tensorexpecting layer behavior may see performance regressions or new failure modes under concurrent load (particularly on consumer PCIe multi-GPU setups with quantized models). Explicitly setstrategy: layerto retain the previous behavior. (#291) - vllm: InferenceService
spec.extraArgsis now forwarded to the vLLM runtime. PreviouslyextraArgswas silently ignored whenruntime: vllm. Configs that placed llama.cpp-only flags inextraArgson a vLLM InferenceService will start failing at pod startup. Audit any vLLM InferenceService that setsextraArgsbefore upgrading. (#291)
Features
- add hybrid GPU/CPU offloading support for MoE models (#281) (2287f66)
- add tensor overrides and batch size controls for hybrid offloading (#283) (8be4adc)
- expose additional runtime controls for llama.cpp and vllm (#291) (2245718)
- recognize runtime-resolved sources (HF repo IDs) in Model controller (#293) (953e8a7)
Bug Fixes
Documentation
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Default CUDA image changed from prior version to server-cuda13 for Qwen3.5 and Blackwell support.
- First-class PersonaPlex (Moshi) runtime backend added
- Grafana inference metrics dashboard added
- HPA autoscaling for InferenceService added
Full changelog
0.6.0 (2026-04-08)
⚠ BREAKING CHANGES
- update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support (#262)
Features
- add first-class PersonaPlex (Moshi) runtime backend (#272) (2b1c948)
- add Grafana inference metrics dashboard (#269) (be376c6)
- add HPA autoscaling for InferenceService (#260) (2d16502)
- add pluggable runtime backends for non-llama.cpp inference engines (#271) (bb1576c)
- add vLLM and TGI runtime backends with per-runtime HPA metrics (#273) (441c7c7)
- separate image registry from repository in Helm chart (#268) (5c059a4)
- support custom layer splits from GPUShardingSpec (#267) (a37701c)
- update default CUDA image to server-cuda13 for Qwen3.5 and Blackwell support (#262) (cc9a95e)
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- KV cache type configuration and extraArgs escape hatch
- Ollama runtime backend for Metal agent
- oMLX alternative runtime backend for Metal agent
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Add pod security context defaults and CRD overrides
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Memory pressure watchdog with runtime monitoring
- PVC:// model source and SHA256 integrity verification
- Auto-detect llama-server from Homebrew paths on macOS
Fixed Helm chart appVersion mismatch with the published controller image.
Changelog
Helm chart for LLMKube v0.5.0 — fixes appVersion to match published controller image
- Added per-model `memoryBudget` and `memoryFraction` CRD fields.
- Added pre‑flight memory validation for Metal agent.
- Added health checks, metrics, and continuous monitoring to Metal agent.
Full changelog
0.5.0 (2026-03-04)
Features
- add pre-flight memory validation for Metal agent (#204) (ba252ef)
- add health checks, metrics, and continuous monitoring to Metal agent (#205) (a113fd1)
- add per-model memoryBudget and memoryFraction CRD fields (#206) (e632369)
Bug Fixes
- agent: unregister service endpoints on metal process delete (#168) (147b9bc)
- enable controller metrics endpoint in Helm chart (#195) (70940af)
- prevent model re-download of cached models after helm upgrade (#203) (a8f9a88)
- use Recreate strategy for GPU workloads to prevent rolling update deadlock (#196) (2e45181)
Documentation
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Prevent command injection in init container shell commands — mitigates remote code execution vulnerability
- License compliance scanning for GGUF models
- Prometheus metrics, OpenTelemetry tracing, and inference observability added
- PVC inspection in cache list to detect orphaned entries
Full changelog
0.4.20 (2026-02-28)
Features
- add license compliance scanning for GGUF models (#188) (c26400a)
- add Prometheus metrics, OpenTelemetry tracing, and inference observability (#189) (c653ff1)
- add PVC inspection to cache list for orphaned entry detection (#183) (2723d92)
- agent: add structured zap logging to metal agent (#164) (e9d143c)
- deps: upgrade to Kubernetes 1.35 and controller-runtime v0.23.1 (#175) (3c323f4)
Bug Fixes
- correct Metal quickstart docs for selectorless services (#173) (89471ec)
- prevent command injection in init container shell commands (#172) (3aa9cc3)
- remove mutable latest tags and pin container images (#174) (3c4569a)
Documentation
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Added --jinja flag to enable Jinja templating for tool and function calls
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed reading contextSize from the InferenceService CRD in the agent.
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed agent filtering of InferenceServices to match the correct Metal accelerator type.
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Added --host-ip flag to agent for remote Kubernetes cluster support
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed inference flag passing for newer llama.cpp versions.
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Native Go GGUF parser with CRD integration and CLI inspect command
- FlashAttention support added to inference manifest
- ContextSize parameter introduced in sample manifest
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Controller init container image is now configurable.
- InferenceService CRD exposes llama.cpp parallel slots setting.
- Helm chart adds optional NetworkPolicy for controller manager.
Full changelog
0.4.13 (2026-02-07)
Features
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Support for custom Certificate Authorities (CA)
- Fixed deprecated image tags
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Air‑gapped deployment support for environments without internet access
- 32B model catalog with --context flag support
- GPU observability configuration and Grafana dashboard
Full changelog
What's New in v0.4.10
Features
- Air-gapped deployment support - Deploy models from local file paths for environments without internet access (#85)
- 32B models in catalog - Added larger models with
--contextflag support (#88) - GPU observability - New configuration and Grafana dashboard for GPU metrics (#105)
- Benchmark test suites - Comprehensive benchmark sweeps for performance testing (#107)
- Stress testing mode - New stress testing capabilities in the benchmark command (#104)
Documentation
- Added community standards and security policy (#92)
- Updated documentation for v0.4.9 GPU scheduling features (#83)
Installation
Homebrew (Recommended for macOS)
brew install defilantech/tap/llmkube
Install Script (Linux/macOS)
curl -sSL https://raw.githubusercontent.com/defilantech/LLMKube/main/install.sh | bash
Manual Download
macOS
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.10/LLMKube_0.4.10_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.10/LLMKube_0.4.10_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.10/LLMKube_0.4.10_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.10/LLMKube_0.4.10_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows: Download the .zip file and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Full Changelog: https://github.com/defilantech/LLMKube/compare/v0.4.9...v0.4.10
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- GPU contention visibility with queue position and priority classes
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
- Support configurable context size for llama.cpp server
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed bug where Helm chart releases were incorrectly marked as the latest.
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed empty component causing "llmkube-" prefix in release identifiers.
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Minor fixes and improvements.
Full changelog
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed CI workflow to trigger GoReleaser and Helm release.
Full changelog
LLMKube v0.4.4
Release Date: 2025-11-26T19:10:25Z
See RELEASE_NOTES_v0.4.4.md for complete details.
Changelog
Bug Fixes
- 9a37a77e556d6f811cb6a090125a4a73e2e9c346: fix: Trigger GoReleaser and Helm release from Release Please workflow (#64) (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.4/llmkube_0.4.4_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.4/llmkube_0.4.4_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.4/llmkube_0.4.4_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.4/llmkube_0.4.4_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.4.4.md
- Metal GPU support for macOS (Apple Silicon)
- Model catalog with 10 pre‑configured models
- Add benchmark command and reorganize documentation
Full changelog
0.4.3 (2025-11-26)
Features
- Add benchmark command and reorganize documentation (58307be)
- Add benchmark command and reorganize documentation (ac8888e), closes #6
- Add Helm chart for easy installation (5718804)
- Add Helm chart for easy installation with comprehensive CI testing (3ea3bfd), closes #9
- Add Metal GPU support for macOS (Apple Silicon) (f673c26), closes #33
- Add model catalog with 10 pre-configured models (404d722)
- Add model catalog with 10 pre-configured models (Phase 1) (0fd969a)
- Add persistent model cache to avoid re-downloading (83f844f), closes #52
- Add Release Please automation and version-agnostic docs (dc2d54e)
- helm: Add image digest support for production deployments (a38801d)
- Implement automatic port forwarding for benchmark command (472b3ae)
- Multi-GPU support with layer-based sharding (#47) (4797609)
- Persistent model cache with per-namespace PVC support (ab04261)
- Set up Helm repository on GitHub Pages (8d62737)
- Support per-namespace model cache PVCs (c3cb891)
Bug Fixes
- Add cacheKey to CRD and restrict cache to llmkube-system namespace (464c23d)
- Add CRD keep policy and improve security test reliability (ff32296)
- Add Helm chart publishing to release workflow (8baf9c4)
- Add Helm chart publishing to release workflow (03bab72)
- Add Homebrew archive IDs and v0.3.0 release notes (cea933b)
- Address lint issues in benchmark command (bf80610)
- Address linter errors in catalog implementation (8932e4f)
- Address linter issues in Metal agent code (3f1f678)
- controller: Add Model watch to InferenceService controller (cb4e201)
- Correct CLI binary path in E2E tests (41af555)
- Fix GoReleaser Homebrew tap configuration for v0.3.0 (4e95c04)
- Further increase Helm CI timeout and readiness probe delay (5453d66)
- Further increase Helm CI timeout and readiness probe delay (fd577d3)
- Handle resp.Body.Close error in version check (linter) (fb3adf5)
- Increase Helm chart CI timeout from 2m to 5m (7a08b45)
- Increase Helm chart CI timeout from 2m to 5m (ced2210)
- InferenceService stuck in Pending when Model becomes Ready (4d20aec)
- Metal agent production fixes and testing improvements (8744c7b)
- Resolve Helm chart CI test failures (9919696)
- Resolve staticcheck SA5011 lint errors and update CONTRIBUTING.md (#60) (c0b5824)
- Sanitize Service names for DNS-1035 compliance (v0.3.3) (db81990)
- Sanitize Service names to comply with DNS-1035 requirements (b431986)
- Skip containerized Deployment for Metal accelerator and add version check (d300e64)
- Skip containerized Deployment for Metal accelerator and add version check (8dab955)
- Suppress Endpoints API deprecation warnings (e70a4b3)
- Update operator deployment to use correct container image (00fee75)
- Update operator deployment to use correct container image (4c67a78)
- Update version.go to 0.2.1 and add automation for future releases (8dd613d)
- Update version.go to 0.2.1 and add automation for future releases (2ff68bd)
- Use simple v* tag format for releases (#62) (bda9f19)
- Use workspace path for kubeconform validation (fc066d8)
Documentation
- Add CLI option to quick start, keep kubectl as fallback (f6829ee)
- Add release notes for v0.3.2 (177abf8)
- Add release notes for v0.3.2 (ca1bb12)
- Add release notes for v0.4.0 (144b960)
- Add release notes for v0.4.0 (a61321f)
- Overhaul README and roadmap for public launch (b42c17e)
- Update binary download links to version 0.2.1 (fad530a)
- Update binary download links to version 0.2.1 (63bb0fa)
- Update Helm installation to use GitHub Pages repository (477e037)
- Update MODEL-CACHE.md for per-namespace PVC pattern (0be3f46)
Minor fixes and improvements.
Full changelog
- Add benchmark command and reorganize documentation
- Add persistent model cache to avoid re‑downloading with per‑namespace PVC support
- **helm:** Add image digest support for production deployments
Full changelog
0.4.1 (2025-11-26)
Features
- Add benchmark command and reorganize documentation (58307be)
- Add benchmark command and reorganize documentation (ac8888e), closes #6
- Add persistent model cache to avoid re-downloading (83f844f), closes #52
- Add Release Please automation and version-agnostic docs (dc2d54e)
- helm: Add image digest support for production deployments (a38801d)
- Implement automatic port forwarding for benchmark command (472b3ae)
- Persistent model cache with per-namespace PVC support (ab04261)
- Support per-namespace model cache PVCs (c3cb891)
Bug Fixes
- Add cacheKey to CRD and restrict cache to llmkube-system namespace (464c23d)
- Address lint issues in benchmark command (bf80610)
Documentation
- Update MODEL-CACHE.md for per-namespace PVC pattern (0be3f46)
- Multi‑GPU support with layer‑based sharding
Full changelog
LLMKube v0.4.0
Release Date: 2025-11-26T00:23:11Z
See RELEASE_NOTES_v0.4.0.md for complete details.
Changelog
New Features
- 479760973eb811a0b7a71c711f52ca3d8695b761: feat: Multi-GPU support with layer-based sharding (#47) (@Defilan)
Bug Fixes
- 03bab72a74496085b79e3c51838f9853ed674062: fix: Add Helm chart publishing to release workflow (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.0/llmkube_0.4.0_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.0/llmkube_0.4.0_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.0/llmkube_0.4.0_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.4.0/llmkube_0.4.0_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.4.0.md
Minor fixes and improvements.
Changelog
A Helm chart for LLMKube - Kubernetes operator for GPU-accelerated LLM inference
Fixed Service names to comply with DNS-1035 requirements.
Full changelog
LLMKube v0.3.3
Release Date: 2025-11-24T17:07:23Z
See RELEASE_NOTES_v0.3.3.md for complete details.
Changelog
Bug Fixes
- b431986ceae6b383ee064bec595c922a42394a8e: fix: Sanitize Service names to comply with DNS-1035 requirements (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.3/llmkube_0.3.3_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.3/llmkube_0.3.3_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.3/llmkube_0.3.3_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.3/llmkube_0.3.3_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.3.3.md
Fixed resp.Body.Close error handling in version check and skipped containerized Deployment for Metal accelerator.
Full changelog
LLMKube v0.3.2
Release Date: 2025-11-24T16:28:19Z
See RELEASE_NOTES_v0.3.2.md for complete details.
Changelog
Bug Fixes
- fb3adf57913744e08ebffb58af6877bd15fbeb93: fix: Handle resp.Body.Close error in version check (linter) (@Defilan)
- 8dab955a2d1e728fe8a9b1b2971a4906454d71c3: fix: Skip containerized Deployment for Metal accelerator and add version check (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.2/llmkube_0.3.2_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.2/llmkube_0.3.2_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.2/llmkube_0.3.2_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.2/llmkube_0.3.2_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.3.2.md
Fixed controller OOM by increasing memory limits.
Full changelog
LLMKube v0.3.1
Release Date: 2025-11-24T09:17:13Z
See RELEASE_NOTES_v0.3.1.md for complete details.
Changelog
Bug Fixes
- fd577d3137da086346524f1802e47219feefa1fa: fix: Further increase Helm CI timeout and readiness probe delay (@Defilan)
- ced2210ea28d453fdac4c7346bc98f66684893b1: fix: Increase Helm chart CI timeout from 2m to 5m (@Defilan)
- 4c67a7806232c687b7b2450660735d9265d507b8: fix: Update operator deployment to use correct container image (@Defilan)
Other Changes
- 3e60a3031ef0f443209c0088e84f1a01dd1f6c1a: Release v0.3.1: Fix controller OOM with increased memory limits (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.1/llmkube_0.3.1_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.1/llmkube_0.3.1_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.1/llmkube_0.3.1_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.1/llmkube_0.3.1_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.3.1.md
- Add Metal GPU support for macOS (Apple Silicon)
Full changelog
LLMKube v0.3.0
Release Date: 2025-11-24T06:15:31Z
See RELEASE_NOTES_v0.3.0.md for complete details.
Changelog
New Features
- f673c26bd4ac1a285dc7e72ffe6a930bc586b855: feat: Add Metal GPU support for macOS (Apple Silicon) (@Defilan)
Bug Fixes
- cea933beac2607122772d14184b35da04738b7f9: fix: Add Homebrew archive IDs and v0.3.0 release notes (@Defilan)
- 3f1f678502c985b04d48a1c8c8bc44ea68d8a542: fix: Address linter issues in Metal agent code (@Defilan)
- 8744c7b54e23cbb77609a97340d9be9dd5da931c: fix: Metal agent production fixes and testing improvements (@Defilan)
- e70a4b391725a70a82d78d47a7d4f6d2b898dcc8: fix: Suppress Endpoints API deprecation warnings (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.0/llmkube_0.3.0_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.0/llmkube_0.3.0_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.0/llmkube_0.3.0_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.3.0/llmkube_0.3.0_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.3.0.md
- Add Helm chart for easy installation with comprehensive CI testing
- Add model catalog featuring ten pre‑configured models
Full changelog
LLMKube v0.2.2
Release Date: 2025-11-24T02:00:38Z
See RELEASE_NOTES_v0.2.2.md for complete details.
Changelog
New Features
- 3ea3bfd27ce864f7884f25ae9db65ed52eb68e01: feat: Add Helm chart for easy installation with comprehensive CI testing (@Defilan)
- 404d722e70d3e885f1e437ebdadf38fe43c7689a: feat: Add model catalog with 10 pre-configured models (@Defilan)
Bug Fixes
- ff32296a45174bdce6070844a68007e2c45cf3fe: fix: Add CRD keep policy and improve security test reliability (@Defilan)
- 8932e4fbb3fe8d1fea1fedba5bb18f3cd02808c8: fix: Address linter errors in catalog implementation (@Defilan)
- 41af55589ba6b17f07119b50d96db9c39eac6ea3: fix: Correct CLI binary path in E2E tests (@Defilan)
- 99196961bf91e4c285182211a7a6fdec574ae7e7: fix: Resolve Helm chart CI test failures (@Defilan)
- 2ff68bdc0e40ab9ee8337403af649fda7354ad7c: fix: Update version.go to 0.2.1 and add automation for future releases (@Defilan)
- fc066d8d0f9175382fa7cfab5f40c755739e175f: fix: Use workspace path for kubeconform validation (@Defilan)
Other Changes
- aa84b601d75753c585cacace76311fbbac598080: Add Minikube quickstart guide and improve CLI-first documentation (@Defilan)
- 5f08b27232102d17a0e2ae59f74176ed25a9689b: Update docs to recommend local controller for Minikube/local development (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.2/llmkube_0.2.2_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.2/llmkube_0.2.2_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.2/llmkube_0.2.2_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.2/llmkube_0.2.2_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.2.2.md
Fixed Model watch missing in InferenceService controller.
Full changelog
LLMKube v0.2.1
Release Date: 2025-11-18T16:21:32Z
See RELEASE_NOTES_v0.2.1.md for complete details.
Changelog
Other Changes
- cb4e2019583a811fa98af1a446bd0df6b6c3cba2: fix(controller): Add Model watch to InferenceService controller (@Defilan)
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.1/llmkube_0.2.1_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.1/llmkube_0.2.1_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.1/llmkube_0.2.1_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.1/llmkube_0.2.1_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.2.1.md
Minor fixes and improvements.
Full changelog
LLMKube v0.2.0
Release Date: 2025-11-18T06:34:01Z
See RELEASE_NOTES_v0.2.0.md for complete details.
Changelog
Other Changes
- f821f0f073040d82613e8ed809ab2d402f1fb2a7: Initial public release: LLMKube v0.2.0 (Christopher Maher [email protected])
Installation
macOS
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.0/llmkube_0.2.0_darwin_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64 (Apple Silicon)
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.0/llmkube_0.2.0_darwin_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Linux
# AMD64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.0/llmkube_0.2.0_linux_amd64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
# ARM64
curl -L https://github.com/defilantech/LLMKube/releases/download/v0.2.0/llmkube_0.2.0_linux_arm64.tar.gz | tar xz
sudo mv llmkube /usr/local/bin/
Windows
Download the .zip file for your architecture and add llmkube.exe to your PATH.
Verify Installation
llmkube version
Next Steps
Full Release Notes: RELEASE_NOTES_v0.2.0.md