This release includes 1 security fix for security teams reviewing exposed deployments.
Topics
+11 more
Affected surfaces
Summary
AI summaryingero trace --inference becomes a production‑grade daemon for GPU inference with phase‑aware baselines and comprehensive observability upgrades.
Full changelog
TL;DR
ingero trace --inference is now a first-class production daemon
shape for GPU inference workloads (vLLM, SGLang, TGI, Triton),
sitting alongside the training-flavoured --fleet-workload-type
path that v0.10..v0.15 built out. One flag bundles sub-second
causal windows, an event sampler attached to the on-disk store, DB
file rollover with retention, a typed UDS outlier protocol, and a
per-workload step-duration baseline + outlier classifier. The
classifier is phase-aware (prefill / decode / mixed / unknown)
so a 10x slow decode on a vLLM continuous-batching stream gets
compared against the decode baseline (~5 ms p95), not the prefill
tail (~180 ms p95) it used to be absorbed into.
The same release closes the loop on the rest of the inference
observability surface:
ingero.infer.*metrics emit on both OTLP and Prometheus,
attributed by(cgroup_path_hash, pid, stream_handle, phase),
withgen_ai.request.model+gen_ai.systemenrichment from
the engine scraper (vllm / tgi / sglang / triton). The
resource-attribute trio (ingero.node.id,
ingero.cluster.id,ingero.provider) is the join key
consumed by the Fleet OTel Collector distribution
(ingero-io/ingero-fleet)
for cross-pod peer-MAD outlier detection.- Per-outlier OTLP
/v1/tracesspans land in Tempo / Honeycomb /
any OTLP-compatible viewer with the workload key, bucket,
duration ratios, and optional KV-cache / memfrag / thermal
context as attributes. - KV-cache lineage tracker pairs
cudaMalloc/cudaFreeper PID
and attaches the top-N oldest live-block ages to decode-phase
outliers, so "fresh-allocation cold-start cost" vs "old blocks
accumulating without eviction" is visible at the source. - Memfrag IOCTL events from v0.15 now fan into the inference
engine and route KV-cache eviction storms into the decode bucket. - Thermal context (NVML throttle reasons OR-folded into a
process-level bitmap) is attached to every outlier, so a slow
step caused by HW_SLOWDOWN is visibly thermal without operators
having to join time series by hand. - Continuous engine re-detection picks up engines that started
after the agent (k8s rolling restart, pod swap, agent-first
ordering) without operator intervention.
Plus the supporting hardening, security, and dependency work that
shipped alongside in the same PR: Go 1.26.3 (closes three reachable
HIGH CVEs), an SQLite read-only enforcement at the engine level,
deterministic FNV-128a fallbacks for TraceID / SpanID /
EventID on CSPRNG failure, a contract drift-guard test that
catches future attr/metric renames, watchdog clock alignment on
CLOCK_BOOTTIME (fixes a silent post-suspend probe outage on WSL2
laptops), and safe defaults on every external surface (MCP bearer,
PagerDuty key, OTLP TLS, Prometheus bind, dashboard TLS,
fleet TLS 1.3 floor).
What's new in v0.16.0
ingero trace --inference umbrella
One flag, five bundled defaults appropriate for 24/7 inference
deployment:
- Sub-second causal window.
--fleet-workload-type=inference
(500 ms instead of the 10 s training default). - Event sampler attached to the SQLite store. 1% admission in
healthy state, 100% under degradation, 30 s cooldown back to
healthy. Caps storage on inference workloads that emit kernel
events at production QPS. Previously onlyfleet-pushwired the
sampler; v0.16.0 closes the storage-flood gap on the trace path
viaStore.SetSampler. - JSON output by default (no TUI), suitable for systemd /
k8s log collectors. - File-level DB rollover with retention sweep. Rotates
ingero.dbtoingero.<UTC-timestamp>.<rand>.dbwhen the file
crosses the threshold (default 1 GB), keeps the last 6.
Atomic POSIX rename; mutually exclusive with--max-db.
Rolled files open viaNewReadOnlyso the
_query_only=1connection pool refuses every write at the
engine level (write attempts via writable-CTE bypasses are
rejected by SQLite, not by a fragile substring filter). - Per-workload step-duration baseline + outlier classifier.
Tracks the running mean (EMA) and 95th-percentile (P^2) of
step durations per workload. Classifies each step against the
workload's own p95 once warmed (default 30 healthy steps);
emitsinference_outlierevents on the FOSS UDS socket and
as rate-limited INFO logs (one log per workload per minute).
Any individually-set flag wins over the umbrella default, so
ingero trace --inference --json=false keeps everything else
and re-enables the TUI.
Full quickstart: docs/advanced_features_quickstart.md.
Phase-aware baselines
A naive single-baseline-per-stream design produces false negatives
on heterogeneous-task streams. A vLLM continuous-batching server
interleaves prefill (~200 ms, kernel-heavy) and decode (~5 ms,
sparse-launch) on one hot-path stream; the mixed-bucket p95 lands
near the prefill tail (~180 ms), so a 10x slow decode (50 ms vs
5 ms baseline) gets absorbed and never fires.
v0.16.0 (well, v0.16.1 in the commit history; folded into this
cut) splits the per-(cgroup, pid, stream) baseline by phase,
classifying each step from observable signals BEFORE the duration
is compared to the baseline. The classifier is
duration-invariant by design: a slow decode is still a decode
(few launches, no memcpy, no NCCL), so it lands in the decode
bucket and gets compared against the decode-phase p95.
Phase set: prefill / decode / mixed / unknown. Rule order
(first match wins; defaults tuned for 7B-70B serving):
- NCCL > 0 ->
prefill(distributed tensor-parallel allreduce). - memfrag count >= threshold AND launch count < decode max ->
decode(KV-cache eviction storm during decode). - launches >= prefill min AND memcpy bytes >= prefill bytes ->
prefill. - launches <= decode max AND memcpy bytes <= decode bytes ->
decode. - launches between thresholds ->
mixed. - memcpy without launches in a meaningful pattern ->
mixed. - catch-all ->
unknown.
Outlier buckets are mutually exclusive at emission:
| Bucket | Trigger | Default action |
|---|---|---|
| 1.5x | step >= 1.5x baseline-p95 | log + UDS publish |
| 2x | step >= 2.0x baseline-p95 | log + UDS publish |
| 3x | step >= 3.0x baseline-p95 (configurable) | log + UDS publish + sampler bumped to 100% admission for cooldown window |
The 3x bucket flips the sampler to admit 100% of events for the
next 30 s, so when the outlier resolves you have full-fidelity
event history surrounding the spike.
ingero.infer.* on OTLP and Prometheus
Per-workload data points on every snapshot tick:
ingero.infer.step_duration_ns(histogram, geometric 100 us to
10 s, 15 buckets)ingero.infer.baseline_mean_ns(gauge)ingero.infer.baseline_p95_ns(gauge)ingero.infer.outlier_total(counter, labelbucket)ingero.infer.workloads_tracked(gauge)ingero.infer.sampler.degraded(gauge) +
ingero.infer.sampler.degradations_total(counter, labelcause)ingero.infer.throttle_active_total(counter)ingero.infer.kvcache.alloc_age_ms(histogram, geometric
10 ms to 600 s, 15 buckets; unattributed engine-wide rollup)
Data-point attributes: cgroup_path_hash, pid, stream_handle,
phase, optional kernel_fingerprint, optional
gen_ai.request.model, optional gen_ai.system. Resource
attributes: ingero.node.id, ingero.cluster.id,
ingero.provider. The latter trio is the join key for the
Fleet-side ingeroprocessor
cross-pod peer-MAD outlier detection (Layer 3 enabler); without
them peer-MAD has no way to identify peers across pods of one
job. Fleet's provider-lookup processor fills ingero.provider
from operator-supplied node_providers.yaml when the agent
leaves it empty.
The default-off case (no fingerprint key, no detected engine,
no --cluster) keeps the v0.15-era attribute set unchanged so
dashboards built against earlier versions keep working.
Per-outlier OTLP /v1/traces span emission
Every snapshot tick where the engine drained at least one outlier
POSTs an OTLP traces payload with one Span per classified outlier
alongside the existing /v1/metrics POST. Operators can jump from
a slow request in Tempo / Honeycomb / any OTLP-compatible trace
viewer to the workload-key context.
Span shape:
- Name:
infer.outlier.<bucket>(e.g.infer.outlier.3x) - Kind:
SPAN_KIND_INTERNAL - Status:
STATUS_CODE_ERRORwith a human-readable message - Start / end time: derived from
OutlierEvent.Atminus
StepDurationNs, so the span window matches the slow step's
wall-clock interval - Attributes: workload key (
cgroup_path_hash,pid,
stream_handle,phase), bucket + ratios,event_id(matches
the metric-sideAttrEventIDfor cross-channel correlation),
optionalkernel_fingerprint,gen_ai.request.model+
gen_ai.system, optionalmemfrag_events_in_step,
throttle_reasons,kvcache.oldest_alloc_age_ms
Each outlier carries a self-contained trace_id (no
parent_span_id); cross-outlier correlation is via the
workload-key attribute set, not parent-child trace structure.
The same event_id lands on the metric-side data point (via
AttrEventID) and on the
Fleet push, so a
slow request can be pivoted from any of the three channels.
KV-cache lineage tracker
Surfaces KV-cache fragmentation context on decode-phase outliers
so operators can distinguish "fresh-allocation cold-start cost"
from "old blocks accumulating without eviction" (the proximate
cause of many decode slowdowns in serving stacks).
New package internal/infer/kvcache with a per-PID allocation-LRU
(default 8192 entries per PID, configurable). Pairs cudaMalloc
with cudaFree by device pointer; on decode-phase outlier the
engine pulls the top-N oldest live alloc ages and attaches them
to the OutlierEvent. Cumulative ages fold into
ingero.infer.kvcache.alloc_age_ms.
BPF probe change: cudaMalloc / cudaMallocManaged uretprobes
now read the resolved device pointer (the value cudaMalloc
wrote to the void** out-parameter) and stamp it into
entry->arg1. Prior layout left arg1 as the void**
parameter address, which gave the tracker no stable key to pair
malloc with cudaFree. memtrack is unaffected.
Opt-in via --inference-kvcache-lineage (default off; ~64 KB per
inference PID when on). When engaged, the agent starts a 25 ms
self-heartbeat goroutine that writes CLOCK_BOOTTIME to the
ingero_watchdog BPF map; the cudaMalloc / cudaFree uprobes
have a 50 ms staleness gate driven by an external consumer's
heartbeat and would otherwise silently drop events when the
agent itself is the consumer.
Continuous engine re-detection
enginedetect.Detect() is now a continuous loop in the scrape
package, not a one-shot at agent startup. Agents pick up engines
that started after them (k8s rolling restart, pod swap,
agent-first ordering) without operator intervention.
Adaptive cadence: 30 s while no targets are known, 5 m once at
least one target is registered. Each tick confirms every
registered target via the detector (drops on PID-gone or
cmdline-mismatch, replaces on engine-identity change with INFO
log) and walks /proc for new candidates.
Opt-in via passing a non-nil PIDLister; existing tests that
just want a dumb scraper are unaffected.
Memfrag IOCTL fan-in to the inference engine
The v0.15 W1 memfrag kprobe (closed NVIDIA-driver
nvidia_unlocked_ioctl) was previously only a Prometheus counter
- MCP tool. v0.16.0 fans those events into the inference engine
viaOnMemfragEventand adds a high-priority phase rule
(memfrag_count >= MemfragDecodeMin AND launch_count < DecodeMaxLaunches->decode). KV-cache eviction storms now
classify into the decode bucket and fire as decode-bucket
outliers instead of being absorbed into prefill orunknown.
New flag: --inference-phase-memfrag-decode-min (default 1).
Thermal context on outliers
The v0.12.10 NVML throttle poller now publishes its OR-folded
reasons bitmap to a process-level atomic. The inference engine
reads it via SetThrottleReader callback when an outlier fires
and attaches ThrottleReasons to the OutlierEvent. Surfaces
on the UDS envelope as throttle_reasons and on OTLP /
Prometheus as ingero.infer.throttle_active_total. An outlier
caused by HW_SLOWDOWN is now visibly thermal without operators
having to join time series.
cudaLaunchKernel stream-handle capture
v0.16.0-v0.16.3 keyed per-step observables on evt.Args[0]. The
BPF probe puts the stream handle in Args[0] only for
cudaStreamSynchronize; for cudaLaunchKernel Args[0] is the
kernel function pointer, and for cudaMemcpy it's the byte count.
Launches and memcpys accumulated under completely different
observable keys than syncs read from, so the sync's
ResetAndRead always saw zero launches and every step
classified as PhaseUnknown via the catch-all rule.
The cudaLaunchKernel uprobe now reads the actual cudaStream_t
arg from the appropriate ABI slot (x86-64 SysV: spilled to caller
stack at [rsp+16]; arm64 AAPCS: still in X7 because dim3
splits keep eight register slots) and stamps it into Args[1].
The per-step observable bucket switches from PID-level
(cgroup, pid, 0) to per-stream (cgroup, pid, stream) for
launches and NCCL events; memcpy and memfrag remain PID-level
because their BPF arg slots are full or the signal is
process-wide by nature.
Multi-stream workloads get separate baselines via the per-stream
WorkloadKey on syncs; the legacy single-stream / default-stream
path is preserved.
Optional kernel-fingerprint workload key
--inference-fingerprint-key (default off) adds a fifth dimension
to WorkloadKey: an FNV-1a fold over the kernel function
pointers launched during the step. Useful when a single
(pid, stream) hosts multiple models / model versions and the
operator wants independent baselines per kernel sequence. Order-
dependent so a duplicate kernel does not cancel the running hash.
Hardening and security
Bundled into this cut and worth calling out separately:
- Go 1.26.3 + x/net 0.53.0. Closes three reachable HIGH CVEs
from the dep scan (html/templateescaper bypass / XSS,
net.DialNUL-byte panic on Windows,x/netHTTP/2 infinite
loop on badSETTINGS_MAX_FRAME_SIZE).govulncheckreports
0 vulnerabilities after the bump (was 4). - SQLite read-only enforcement at the engine level.
ExecuteReadOnlyruns against a sibling_query_only=1
connection pool. Writable-CTE bypasses with
tab/newline/CR/form-feed whitespace are rejected by SQLite,
not by the substring filter (which is retained as a friendly
pre-check). - Store rollover hardening. Schema-fail no longer leaves
s.dbpointing at the closedoldDB; rename-fail and
reopen-fail paths logError-severityRESTART REQUIREDand
surface both errors. Rolled filenames embed nanosecond
timestamps + a 4-byte random suffix (no same-second
collisions, no NTP-backwards-drift edge case). Concurrent
reads acquirerolloverMu.RLockso they wait for an in-flight
rollover instead of racing against a closed*sql.DB. - Safe defaults on external surfaces. MCP bearer + PagerDuty
key now ingested from env (refuses leaky flag values).
--otlp-insecuredefault false + non-loopback gate. Dashboard
refuses--no-tlson non-loopback bind. Prometheus
non-loopback bind logs a workload-identity warning.
fleet.Config.RequireTLSgate + TLS 1.3 floor for the
ingero-fleet
client + k8s flows. - Deterministic ID fallbacks on CSPRNG failure. New
pkg/events.DeterministicID(pid, streamHandle, tsNanos, salt)
helper is the single FNV-128a primitive used by both
internal/infernewEventIDandinternal/export/otlp_traces
newTraceAndSpanIDs. Salts 0x01 (event), 0x02 (trace), 0x03
(span) give three distinct deterministic digests from one
workload key, so a CSPRNG outage during an outlier storm no
longer collapses every fallback into one downstream row. - Watchdog clock alignment on
CLOCK_BOOTTIME.
watchdog_is_staleinbpf/common.bpf.hnow uses
bpf_ktime_get_boot_ns()(kernel 5.7+) instead of
bpf_ktime_get_ns()(CLOCK_MONOTONIC), aligning probe-side
with producer-sideclock_gettime(CLOCK_BOOTTIME). Without
this, after any host suspend the BOOTTIME-MONOTONIC delta
underflowed,watchdog_is_stalereturned1forever, and
cudaMalloc/cudaFreeprobes dropped all events for the
rest of the host's uptime. Practical risk was higher on WSL2 /
laptop dev rigs than on never-suspended GPU servers. - BPF struct field renames:
pidis now the userspace PID
everywhere.memfrag_ioctl_eventandkernel_launch_event
used kernel-naming (pid= kernel TID,tgid= userspace PID).
Now matchingero_event_hdr(pid= userspace PID,tid=
kernel TID). Wire bytes unchanged; only labels swapped.
Without this,gpu_kernel_launch_count{pid="..."}on a
multi-threaded CUDA workload (DDP, NCCL helper threads,
DataLoader workers) fragmented one Python rank across many
per-TID series. - CUDA graph correlation fix.
cudaStreamEndCaptureand
cudaGraphInstantiateuretprobes nowbpf_probe_read_userthe
output handle from the saved user pointer (PARM2/PARM1).
Previously everyEnd_Captureevent carriedgraph_handle=0
and everyInstantiatecarriedexec_handle=0, breaking the
Begin -> End -> Instantiate -> Launch correlation chain at two
of the four hops.
CLI / daemon-mode wire-up fixes
runJSONModewas missing the inference-engine hook entirely;
--inferencedefaults to--json, so phase classification,
baselines, and outlier detection were silently inert on the
umbrella path. Mirrored the integration block in both the TUI
loop and the JSON loop; the engine now sees the expected
sync events.runJSONModewas also missing the correlate.Engine wiring, so
the chain-of-evidence pipeline produced zero rows on the
production-shape path. Now acceptscorrPID+
*correlate.Engineand mirrors the snap-tick drains.emitInferOutlier/emitInferSamplerDegradednow pass the
resolvednodeIdentityand the operator-supplied
traceClusterso the UDSNodeID+ClusterIDfields match
the OTLP resource attrs.- SGLang's launch flag is
--model-path; older builds use
--model.extractModelnow tries the canonical key first
and falls back to the legacy form. - Outlier log line gained the
phasefield. Operators can now
tell from the agent log whether an outlier was a slow prefill
or a slow decode without joining log streams. - One
routeInferEvent(evt)free function, called from both
runTableModeandrunJSONMode. The duplicated switch block
is gone; future event-type additions land in one place.
Hygiene cluster
Smaller items folded into the same cut: zero throttle bitmap
after 3 consecutive NVML failures; open rolled DBs via
NewReadOnly (no migration churn); symmetric trim in --max-db
is-set check; collapse fleet_push poller duplication; swap
explain --last From/To bounds (was inverted); implement
/proc/<pid>/status UID filter for --user; merge.go
propagates Commit errors and emits per-row Exec warnings;
align nccl_barrier docstring with the collective-skew
computation; 32 MiB cap on fleet response body at 3 sites;
slog.Debug on update-cache writer error paths; drop unused
at param from OnFree / OnFreeEvent; clarify scrape parser
early-return + escape-sequence limits; pin fire-and-forget
convention with explanatory comments; --slow-decode-us flag
parameterizes the infer_phases.cu literal.
ingero demo --json incident nil-panic fix
runJSONMode called attachSysSnapshot(snap, sysColl) on every
snap-tick (mirroring runTableMode), but the pre-existing
conditional init only fired when onSnapshot or eventStore
was non-nil. The ingero demo --json incident path passes both
nil, so sysColl stayed nil and attachSysSnapshot crashed
dereferencing it. The init is now unconditional in runJSONMode
to match runTableMode.
New flags in v0.16.0
ingero trace:
--inference(umbrella; bundles the five defaults above)--inference-warmup(default 30, healthy steps before
classification activates per workload)--inference-outlier-ratio(default 3.0, multiplier on
baseline-p95 for the largest bucket)--inference-pause-on-severity(defaultHIGH, lowest causal-
chain severity that pauses baseline updates)--inference-sampler-degrade-on(default3x, smallest bucket
that bumps the sampler to 100%; valid values
{1.5x, 2x, 3x, off})--inference-phase-memfrag-decode-min(default 1)--inference-fingerprint-key(default off)--inference-kvcache-lineage(default off)--inference-scrape-redetect-interval(default 30 s)--cluster(operator-supplied cluster identity; becomes
ingero.cluster.idon OTLP resource attrs)--db-rollover-size(default1g, file rollover threshold)--db-rollover-keep(default 6, rolled files retained on disk)--slow-decode-us(parameter for theinfer_phases.cu
regression workload)
New YAML block under inference: mirroring the otlp: /
alerter: shape; CLI flags override file values via the
resolveInferenceConfig pattern.
New metrics in v0.16.0
Five inference metrics, two sampler-degraded metrics, one thermal
counter, one KV-cache histogram:
ingero.infer.step_duration_ns(histogram)ingero.infer.baseline_mean_ns(gauge)ingero.infer.baseline_p95_ns(gauge)ingero.infer.outlier_total(counter, labelbucket)ingero.infer.workloads_tracked(gauge)ingero.infer.sampler.degraded(gauge)ingero.infer.sampler.degradations_total(counter, label
cause)ingero.infer.throttle_active_total(counter)ingero.infer.kvcache.alloc_age_ms(histogram)
New attributes: ingero.infer.stream_handle,
ingero.infer.outlier_bucket,
ingero.infer.kernel_fingerprint,
ingero.infer.step_duration_ns,
ingero.infer.baseline_p95_ns,
ingero.infer.baseline_mean_ns,
ingero.infer.memfrag_events_in_step,
ingero.infer.throttle_reasons,
ingero.infer.kvcache.oldest_alloc_age_ms. New resource attrs:
ingero.node.id, ingero.cluster.id, ingero.provider. New
per-data-point GenAI attrs: gen_ai.request.model,
gen_ai.system.
AttrMemcpyDirection is now "ingero.memcpy.direction" (was
bare "direction"); the contract package documented the
namespaced form but the wire emitted the bare key. Aligned.
Known limitations carried into v0.16.0
- Per-stream attribution for memcpy and memfrag is not done
yet; the BPF probe arg slots are full or the signal is
process-wide. Multi-stream workloads still get separate
baselines via the per-streamWorkloadKeyon syncs; only the
per-step observables are PID-aggregated for those two event
types. - KV-cache lineage tracker is opt-in (
--inference-kvcache- lineage); ~64 KB per inference PID when on. - Kernel fingerprint workload-key dimension is opt-in
(--inference-fingerprint-key); off-by-default keeps the
per-workload LRU footprint at v0.16.4 levels. - The agent self-heartbeat for the watchdog gate only runs when
--inference-kvcache-lineageis set. Without it the
cudaMalloc/cudaFreeuprobes still attach but their 50 ms
staleness gate silently drops events unless an external
consumer is supplying the heartbeat.
Compatibility
- Wire format: All
inferenceOutlierMessagefield additions
(memfrag_events_in_step,throttle_reasons,
min_sm_clock_mhz,kv_cache_top_alloc_ages_ms) are
backwards-compatible optional fields on the JSON wire. Pre-
v0.16.0 UDS consumers ignore unknown fields. - OTLP / Prometheus: Existing v0.15 attribute sets are
unchanged when v0.16.0 features are not engaged. The
default-off case (no fingerprint key, no detected engine, no
--cluster) keeps dashboards built against v0.15 working. - SQLite schema: Rollover re-applies the same schema sequence
via the extractedschema_setup.go, so a rolled file is
bit-identical to a fresh file from a schema perspective. - CLI flags: All new flags are additive; existing flag
defaults are unchanged outside the explicit--inference
umbrella set.
What also landed between v0.10.0 and v0.15.0
The sixteen releases between v0.10.0 and v0.15.0 shipped
without human-readable notes (just commit lists). This rollup
catalogs the user-visible changes for anyone upgrading from a
pre-v0.16 binary. Grouped by surface area, not by release
tag. Version-in-parens tags at the end of each entry point at
the release that actually shipped that piece.
Added
NCCL collective tracing. 16 uprobes against libnccl.so (or
libtorch_cuda.so for statically-linked PyTorch):
ncclCommInitRank, ncclCommDestroy, ncclAllReduce,
ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend,
ncclRecv (each with uprobe + uretprobe). Auto-discovers
libnccl via /proc/<pid>/maps; falls back to libtorch_cuda.so
or libtorch_global_deps.so. Per-collective metrics
nccl.collective.duration_ms, barrier_wait_ms, bytes
emitted with op_type, comm_id_hash, rank, nranks,
datatype, reduce_op. Agent-side barrier correlator joins
NCCL uretprobes with the next cudaStreamSynchronize on the
same (pid, stream) and emits nccl.collective.barrier_wait_ms
(5-minute correlation window with GC). Per-tenant PID filter
via Tracer.SetTargetPID / ClearTargetPIDs. ncclSend /
ncclRecv peer-rank attribute (nccl.peer_rank). --nccl
privilege gate surfaces a clear "requires root or CAP_BPF +
CAP_PERFMON" warning instead of the libbpf attach error deep
in the stack. Contract docs explain world_size (resource-
level, job-wide) vs nranks (per-comm); same for single-comm
jobs, diverge for FSDP+TP / MoE. (v0.12.1, v0.12.2)
Runtime libnccl uprobe attachment. The agent now attaches
NCCL collective uprobes against libnccl paths the discovery
scanner finds at runtime, in addition to the systemwide
attachment that ran at startup. PyTorch + pip workloads ship
libnccl in a venv that startup-time attach could never see, so
the previous behavior captured zero NCCL events on bare cloud
GPU VMs. Tracer API exposes Prepare() + AttachAt(libPath)
for the runtime-attach flow; Attach(filterPIDs) is a
compatibility shim. (v0.15.0)
Prometheus running counters for NCCL collectives.
gpu_nccl_collective_count{op_type},
gpu_nccl_collective_bytes_total{op_type},
gpu_nccl_collective_barrier_events{op_type}. OTLP keeps the
per-event gauge view as the canonical channel; the counters
are the pull-friendly slice. (v0.15.0)
libnccl process discovery. Periodic /proc/PID/maps
scanner emits gpu.nccl.process_loaded (gauge=1 per discovered
PID, labelled with pid, comm, libnccl_path,
libnccl_version) and gpu.nccl.processes_total. Flag
--libnccl-discovery-interval (default 10s; 0 disables).
(v0.14.0)
CUDA memcpy tracing. Uprobes for cudaMemcpy2D,
cudaMemcpy2DAsync, cudaMemcpyPeer, cudaMemcpyPeerAsync.
Per-direction labels (h2h, h2d, d2h, d2d, default,
unknown). New gpu.memcpy.bytes_total{direction} (cumulative
counter) and gpu.memcpy.duration_ms{direction}. 2D variants
encode direction=unknown because the kind argument lives
where libbpf's PT_REGS_PARMn macros cannot read it. v0.15.0
promoted duration_ms to a per-event histogram with buckets
(ms) at 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 25, 50, 100, 250, 500,
1000. (v0.14.0, v0.15.0)
NVML clock-throttle reason metrics (W2-poller). Polls
nvmlDeviceGetCurrentClocksThrottleReasons every 5 s by
default (override via --throttle-poll-interval; 0
disables). Four OTel gauges per device labelled with
gpu.uuid: gpu.throttle.power_active, .thermal_active,
.sw_active, .hw_active (umbrella for any HW reason). Value
is 1 when the bucket is active and 0 otherwise. Poll
interval is the bursting floor: throttle events shorter than
the interval are missed by design. Consumer GPUs returning
[Not Supported] are skipped per device, logged once per GPU
at info level. (v0.12.10)
W1 NVML-poll memfrag heuristic.
gpu.memory.fragmentation_estimate (gauge 0..1, derived from
nvidia-smi memory.{used,free,total}) and per-PID
gpu.memory.process.allocated_bytes. New flag
--memfrag-poll-interval (default 10s; 0 disables). Polling-
based; the v0.15 experimental IOCTL kprobe variant superseded
this on allowlisted hosts. (v0.14.0)
Per-cgroup CUDA + CPU-stall metric variants alongside the
existing per-process series. New
internal/health/cgroup_cache.go builds the cgroup-path-hash
identity. Schema additive; older readers unaffected. (v0.14.0)
OTLP Histogram and ExponentialHistogram encoders.
Direct JSON, no OTel-SDK dependency. (v0.15.0)
OTLP traces emission for detection events. Detection-edge
spans on the same OTLP endpoint as metrics; one pipeline for
two signals. (v0.14.0)
Cost-of-problem gauges. ingero.node.info (descriptor
labels) and ingero.node.world_size (rank-count gauge per
training job). (v0.11.0)
ingero-alerter sidecar. Slack + PagerDuty backends for
per-node straggler alerts. Reads NDJSON events from the
agent's --remediate UDS; renders configurable templates.
(v0.11.0)
ingero check --support-bundle flag. Writes a diagnostic
tarball (config, recent log, BPF program names, kernel/driver
info) for support cases. (v0.11.0)
ingero migrate subcommand + framework. First-class schema
migrations for the trace DB; versioned migration registry so
v0.11+ schema changes apply forward without manual operator
action. (v0.11.0)
ingero rates update CLI subcommand. Live update with
embedded Markdown fallback for GPU hourly cost lookups.
(v0.14.0)
pagerduty_trigger MCP tool. PagerDuty Events v2 trigger
from the agent's detection-event shape. Off by default; opt in
via SetPagerDutyMCPEnabled(true) after pairing the agent MCP
with bearer auth. v0.15.0 auto-registers it when running on
stdio (loopback by definition) or on HTTP with
--mcp-bearer-token set. (v0.14.0, v0.15.0)
--mcp-bearer-token on ingero mcp. Requires
Authorization: Bearer <token> for HTTPS MCP requests when
set; empty disables auth (loopback-only deployments).
Constant-time compare via SHA-256 padding. (v0.15.0)
/investigate MCP prompt aware of Echo federation. Opens
with a one-paragraph note when reached via the
ingero-fleet
Echo-federated cluster-investigate flow: Echo has already
ranked the node, so this prompt's job is the per-node "WHY"
half (root cause + recommendation) while Echo handles the
cluster-side "WHERE" half. (v0.12.9)
YAML config loader. internal/config/config.go; CLI flags
override file values. (v0.14.0)
Trace-sampling policy (internal/sampling/sampler.go).
v0.14.0 shipped this as the fleet-push primitive; v0.16.0
wires it to the trace path's SQLite store. (v0.14.0)
internal/auth package. Bearer parsing + constant-time
compare + hardened TLS keypair loader. Wires every TLS load
site (MCP HTTPS, dashboard, fleet client, healthee poller).
Refuses world-readable keys unless
INGERO_TLS_ALLOW_LOOSE_KEY_PERMS=1; rejects directories.
(v0.15.0)
Six new metric-name constants in pkg/contract plus the
ingero.cgroup.path_hash attribute constant. (v0.14.0)
docs/troubleshooting.md — patterns Ingero detects,
operational cheat sheet, advanced configuration. README
trimmed in turn. (v0.14.0)
docs/otlp.md trust-model section. BPF-derived register
attributes (memcpy direction, NCCL op-type, IOCTL command
numbers, kernel grid/block dims) cannot be trusted as security
signals on multi-tenant hosts; cross-tenant correlation should
rely on cgroup attribution. Includes the v0.12.10 W2-poller
bit-to-bucket mapping for the four gpu.throttle.*_active
series. (v0.14.0, v0.15.0)
Architecture diagram in README. (v0.11.0)
Added (experimental; gated on --enable-experimental-kprobes)
--enable-experimental-kprobes flag (default off). The
agent reads /proc/driver/nvidia/version + uname -r and
only loads experimental probes if the pair is on the
internal/kprobe.DefaultAllowlist. Off-allowlist hosts get a
startup warning and the probes do NOT load. Allowlist seeded
with {driver 535.*, kernel 5.15.*} (Lambda A10),
{driver 535.*, kernel 6.5.*} (Lambda GH200), and
{driver 570.*, kernel 6.8.*} (Lambda current Ubuntu 22.04).
(v0.15.0)
Memfrag IOCTL kprobe. Hooks nvidia_unlocked_ioctl on the
closed NVIDIA driver. Records the IOCTL cmd field;
argument-buffer decode is a v0.16.x follow-up. Per-cmd counter
gpu.memfrag.ioctl_event_total{cmd}. New
fleet.cluster.memfrag_hotspots
MCP tool consumes it. (v0.15.0)
Throttle event counters via edge detection.
gpu.throttle.{power,thermal,sw,hw}.event_total. Layered on
the existing nvidia-smi throttle poller; each rising edge per
(gpu_uuid, bucket) increments once. (v0.15.0)
CUDA kernel launch dims uprobe. Hooks cuLaunchKernel in
libcuda.so and captures grid (X, Y, Z) + block (X, Y) dims per
launch. Block Z requires PARM7 access and is a follow-up. New
metrics gpu.kernel.launch.count, .threads_per_block
(histogram), .grid_blocks (histogram). New
fleet.cluster.kernel_launch_summary
MCP tool consumes it. (v0.15.0)
Changed
AWS provider detection hardened. Instance-id shape
validation added; the metadata URL is now injectable for
tests. (v0.12.4)
internal/store schema extended with per-cgroup attribution
columns. Additive; older readers unaffected. (v0.14.0)
PagerDuty client response body bounded by
io.LimitReader(64KiB) before close. (v0.14.0)
Sampler doc note. Intentionally not cryptographically
secure; load-shedder, not a security control. (v0.15.0)
In-tandem release records for v0.12.6, v0.12.7, v0.12.8.
v0.12.6 republished v0.12.5's scope after the v0.12.5 release
broke and produced no GitHub Release or GHCR image (v0.12.5
artifacts do not exist; users on that tag should upgrade to
v0.12.6 for the same code with working binaries). v0.12.7 was
a doc-sync release synchronizing version pins to v0.12.6.
v0.12.8 was an in-tandem release with
ingero-fleet
v0.12.8 (no agent code changes; tag maintained for sync per
the in-tandem release rule — fleet shipped the healthee OTel
Collector extension and the quanthealth library).
Fixed
- arm64 release binaries now load BPF probes correctly. The
v0.10.0 release archives shipped a single set of pre-compiled
BPF objects built for x86_64 and embedded that set into both
thelinux_amd64andlinux_arm64binaries. On any aarch64
kernel thept_regsCO-RE relocations resolved against the
wrong struct layout and the verifier rejected the load with
bad CO-RE relocation: invalid func unknown. bpf2go is now
invoked with-target amd64,arm64; per-arch.bpf.ofiles
are committed under
internal/ebpf/<pkg>/<name>_<arch>_bpfel.{go,o}and
selected at Go build time via build constraints. Reported by
@saiyam1814 against DGX Spark (issue #35); affected every
arm64 release-binary user, not only Spark. (v0.10.1) - MCP server reports build-time version.
ingero mcpnow
reports the version embedded at build time (e.g.v0.10.0)
in the MCPImplementation.Versionfield, instead of the
previously hardcoded"0.9.0"string. Affects how MCP
clients (Claude Desktop, IDE plugins, glama.ai) identify
the server. (v0.10.1) - NCCL ringbuf decode panic under Go 1.26. The
Event
struct had unexported_pad/_pad2ABI padding fields
thatbinary.Readpanicked on under Go 1.26's stricter
unexported-field reflect rules. Renamed toPad1/Pad2;
EventSize=104and field offsets stay identical. (v0.15.0) - NCCL discovery dispatcher panic when running without
--nccl. A typed-nil*ncclprobe.Tracerwidened to a
non-nilncclAttacherinterface bypassed the naive
att == nilcheck and segfaulted on the next method call.
Triggered whenever the agent ran with--prometheusbut
without--nccl(the typical inference deployment shape).
(v0.15.0) - NCCL probe attach-vs-PID-filter race. Closed the window
where target PIDs could change betweenAttach()and
SetTargetPIDs().ClearTargetPIDserrors now propagate
to the caller instead of being silently swallowed. (v0.12.2) cudaMemcpy2D/cudaMemcpy2DAsyncBPF probes encode
direction=5(unknown) instead ofdirection=0(h2h).
Earlier code relied on a userspace convention to interpret
direction=h2h on a 2D op as unknown; the fix makes the metric
label honest and prevents silent mixing of true H2H copies
with 2D-with-unknown-direction. (v0.14.0)- Prometheus
/metricsexporter now includes the v0.12.10 +
v0.14.0 metric set (gpu_throttle_*,
gpu_nccl_process_loaded,gpu_nccl_processes_total,
gpu_memory_{used,free,total}_bytes,
gpu_memory_fragmentation_estimate,
gpu_memory_process_allocated_bytes,
gpu_memcpy_bytes_total{direction},
gpu_memcpy_duration_ms{direction}). Operators scraping
Prometheus directly previously saw only the pre-v0.12.10
metric subset. (v0.14.1) - Reliability gaps from automated audit. Hardening pass on
agent paths flagged in v0.11.0's testing review. (v0.12.1)
Upgrading from v0.10.x
- From v0.10.x: Re-pull
ingerobinaries (arm64 users
especially: v0.10.0 arm64 BPF objects never worked). Existing
flags, OTLP, and Prometheus output are forward-compatible.
Operators on multi-tenant hosts should review the trust-model
section added todocs/otlp.mdin v0.15.0 before treating
BPF-derived register attributes as security signals. - From v0.11.x or v0.12.x: Schema migrations apply
automatically via the v0.11ingero migrateframework on
next startup. - From v0.13.x (private): v0.14.0 is the public successor;
see the v0.14.0 section above for the full set of additions. - From v0.14.x or v0.15.x: New
inference:YAML block is
additive. Existing OTLP / Prometheus dashboards keep working;
gen_ai.request.model+gen_ai.systemdata-point attrs and
theingero.node.id/ingero.cluster.id/
ingero.providerresource attrs only appear when the
inference engine is detected and--clusteris set. - MCP clients: Reauthorize with
--mcp-bearer-tokenon the
agent if you ran v0.14.x with HTTP MCP exposed; v0.15.0
defaults to no bearer (loopback-only) but the
pagerduty_triggertool is now auto-registered when bearer
auth is configured, so HTTP-exposed MCPs without a bearer
will no longer see that tool.
Security Fixes
- Go upgraded to 1.26.3; closes three HIGH CVEs (html/template escaper bypass, net.Dial NUL‑byte panic on Windows, x/net HTTP/2 infinite loop).
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About ingero-io/ingero
eBPF-based GPU causal observability agent with MCP server. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency.
Related context
Related tools
Earlier breaking changes
- v0.17.0 Dropped 'annotate --socket' option from CLI.
Beta — feedback welcome: [email protected]