ingero-io/ingero

v0.16.0 Security

This release includes 1 security fix for security teams reviewing exposed deployments.

Published 2mo MCP Data & Storage

View tool

✓ No known CVEs patched

Read the diff → Tool health → What is this tool? →

This release patches 1 known CVE

Topics

causal-tracing cuda cuda-graphs ebpf gpu gpu-monitoring

+11 more

gpu-observability incident-response kubernetes machine-learning mcp model-context-protocol nvidia observability pytorch sre distributed-tracing

Affected surfaces

auth deps

Summary

AI summary

ingero trace --inference becomes a production‑grade daemon for GPU inference with phase‑aware baselines and comprehensive observability upgrades.

Full changelog

TL;DR

ingero trace --inference is now a first-class production daemon
shape for GPU inference workloads (vLLM, SGLang, TGI, Triton),
sitting alongside the training-flavoured --fleet-workload-type
path that v0.10..v0.15 built out. One flag bundles sub-second
causal windows, an event sampler attached to the on-disk store, DB
file rollover with retention, a typed UDS outlier protocol, and a
per-workload step-duration baseline + outlier classifier. The
classifier is phase-aware (prefill / decode / mixed / unknown)
so a 10x slow decode on a vLLM continuous-batching stream gets
compared against the decode baseline (~5 ms p95), not the prefill
tail (~180 ms p95) it used to be absorbed into.

The same release closes the loop on the rest of the inference
observability surface:

ingero.infer.* metrics emit on both OTLP and Prometheus,
attributed by (cgroup_path_hash, pid, stream_handle, phase),
with gen_ai.request.model + gen_ai.system enrichment from
the engine scraper (vllm / tgi / sglang / triton). The
resource-attribute trio (ingero.node.id,
ingero.cluster.id, ingero.provider) is the join key
consumed by the Fleet OTel Collector distribution
(ingero-io/ingero-fleet)
for cross-pod peer-MAD outlier detection.
Per-outlier OTLP /v1/traces spans land in Tempo / Honeycomb /
any OTLP-compatible viewer with the workload key, bucket,
duration ratios, and optional KV-cache / memfrag / thermal
context as attributes.
KV-cache lineage tracker pairs cudaMalloc / cudaFree per PID
and attaches the top-N oldest live-block ages to decode-phase
outliers, so "fresh-allocation cold-start cost" vs "old blocks
accumulating without eviction" is visible at the source.
Memfrag IOCTL events from v0.15 now fan into the inference
engine and route KV-cache eviction storms into the decode bucket.
Thermal context (NVML throttle reasons OR-folded into a
process-level bitmap) is attached to every outlier, so a slow
step caused by HW_SLOWDOWN is visibly thermal without operators
having to join time series by hand.
Continuous engine re-detection picks up engines that started
after the agent (k8s rolling restart, pod swap, agent-first
ordering) without operator intervention.

Plus the supporting hardening, security, and dependency work that
shipped alongside in the same PR: Go 1.26.3 (closes three reachable
HIGH CVEs), an SQLite read-only enforcement at the engine level,
deterministic FNV-128a fallbacks for TraceID / SpanID /
EventID on CSPRNG failure, a contract drift-guard test that
catches future attr/metric renames, watchdog clock alignment on
CLOCK_BOOTTIME (fixes a silent post-suspend probe outage on WSL2
laptops), and safe defaults on every external surface (MCP bearer,
PagerDuty key, OTLP TLS, Prometheus bind, dashboard TLS,
fleet TLS 1.3 floor).

What's new in v0.16.0

`ingero trace --inference` umbrella

One flag, five bundled defaults appropriate for 24/7 inference
deployment:

Sub-second causal window. --fleet-workload-type=inference
(500 ms instead of the 10 s training default).
Event sampler attached to the SQLite store. 1% admission in
healthy state, 100% under degradation, 30 s cooldown back to
healthy. Caps storage on inference workloads that emit kernel
events at production QPS. Previously only fleet-push wired the
sampler; v0.16.0 closes the storage-flood gap on the trace path
via Store.SetSampler.
JSON output by default (no TUI), suitable for systemd /
k8s log collectors.
File-level DB rollover with retention sweep. Rotates
ingero.db to ingero.<UTC-timestamp>.<rand>.db when the file
crosses the threshold (default 1 GB), keeps the last 6.
Atomic POSIX rename; mutually exclusive with --max-db.
Rolled files open via NewReadOnly so the
_query_only=1 connection pool refuses every write at the
engine level (write attempts via writable-CTE bypasses are
rejected by SQLite, not by a fragile substring filter).
Per-workload step-duration baseline + outlier classifier.
Tracks the running mean (EMA) and 95th-percentile (P^2) of
step durations per workload. Classifies each step against the
workload's own p95 once warmed (default 30 healthy steps);
emits inference_outlier events on the FOSS UDS socket and
as rate-limited INFO logs (one log per workload per minute).

Any individually-set flag wins over the umbrella default, so
ingero trace --inference --json=false keeps everything else
and re-enables the TUI.

Full quickstart: docs/advanced_features_quickstart.md.

Phase-aware baselines

A naive single-baseline-per-stream design produces false negatives
on heterogeneous-task streams. A vLLM continuous-batching server
interleaves prefill (~200 ms, kernel-heavy) and decode (~5 ms,
sparse-launch) on one hot-path stream; the mixed-bucket p95 lands
near the prefill tail (~180 ms), so a 10x slow decode (50 ms vs
5 ms baseline) gets absorbed and never fires.

v0.16.0 (well, v0.16.1 in the commit history; folded into this
cut) splits the per-(cgroup, pid, stream) baseline by phase,
classifying each step from observable signals BEFORE the duration
is compared to the baseline. The classifier is
duration-invariant by design: a slow decode is still a decode
(few launches, no memcpy, no NCCL), so it lands in the decode
bucket and gets compared against the decode-phase p95.

Phase set: prefill / decode / mixed / unknown. Rule order
(first match wins; defaults tuned for 7B-70B serving):

NCCL > 0 -> prefill (distributed tensor-parallel allreduce).
memfrag count >= threshold AND launch count < decode max ->
decode (KV-cache eviction storm during decode).
launches >= prefill min AND memcpy bytes >= prefill bytes ->
prefill.
launches <= decode max AND memcpy bytes <= decode bytes ->
decode.
launches between thresholds -> mixed.
memcpy without launches in a meaningful pattern -> mixed.
catch-all -> unknown.

Outlier buckets are mutually exclusive at emission:

| Bucket | Trigger | Default action |
|---|---|---|
| 1.5x | step >= 1.5x baseline-p95 | log + UDS publish |
| 2x | step >= 2.0x baseline-p95 | log + UDS publish |
| 3x | step >= 3.0x baseline-p95 (configurable) | log + UDS publish + sampler bumped to 100% admission for cooldown window |

The 3x bucket flips the sampler to admit 100% of events for the
next 30 s, so when the outlier resolves you have full-fidelity
event history surrounding the spike.

`ingero.infer.*` on OTLP and Prometheus

Per-workload data points on every snapshot tick:

ingero.infer.step_duration_ns (histogram, geometric 100 us to
10 s, 15 buckets)
ingero.infer.baseline_mean_ns (gauge)
ingero.infer.baseline_p95_ns (gauge)
ingero.infer.outlier_total (counter, label bucket)
ingero.infer.workloads_tracked (gauge)
ingero.infer.sampler.degraded (gauge) +
ingero.infer.sampler.degradations_total (counter, label cause)
ingero.infer.throttle_active_total (counter)
ingero.infer.kvcache.alloc_age_ms (histogram, geometric
10 ms to 600 s, 15 buckets; unattributed engine-wide rollup)

Data-point attributes: cgroup_path_hash, pid, stream_handle,
phase, optional kernel_fingerprint, optional
gen_ai.request.model, optional gen_ai.system. Resource
attributes: ingero.node.id, ingero.cluster.id,
ingero.provider. The latter trio is the join key for the
Fleet-side ingeroprocessor
cross-pod peer-MAD outlier detection (Layer 3 enabler); without
them peer-MAD has no way to identify peers across pods of one
job. Fleet's provider-lookup processor fills ingero.provider
from operator-supplied node_providers.yaml when the agent
leaves it empty.

The default-off case (no fingerprint key, no detected engine,
no --cluster) keeps the v0.15-era attribute set unchanged so
dashboards built against earlier versions keep working.

Per-outlier OTLP `/v1/traces` span emission

Every snapshot tick where the engine drained at least one outlier
POSTs an OTLP traces payload with one Span per classified outlier
alongside the existing /v1/metrics POST. Operators can jump from
a slow request in Tempo / Honeycomb / any OTLP-compatible trace
viewer to the workload-key context.

Span shape:

Name: infer.outlier.<bucket> (e.g. infer.outlier.3x)
Kind: SPAN_KIND_INTERNAL
Status: STATUS_CODE_ERROR with a human-readable message
Start / end time: derived from OutlierEvent.At minus
StepDurationNs, so the span window matches the slow step's
wall-clock interval
Attributes: workload key (cgroup_path_hash, pid,
stream_handle, phase), bucket + ratios, event_id (matches
the metric-side AttrEventID for cross-channel correlation),
optional kernel_fingerprint, gen_ai.request.model +
gen_ai.system, optional memfrag_events_in_step,
throttle_reasons, kvcache.oldest_alloc_age_ms

Each outlier carries a self-contained trace_id (no
parent_span_id); cross-outlier correlation is via the
workload-key attribute set, not parent-child trace structure.
The same event_id lands on the metric-side data point (via
AttrEventID) and on the
Fleet push, so a
slow request can be pivoted from any of the three channels.

KV-cache lineage tracker

Surfaces KV-cache fragmentation context on decode-phase outliers
so operators can distinguish "fresh-allocation cold-start cost"
from "old blocks accumulating without eviction" (the proximate
cause of many decode slowdowns in serving stacks).

New package internal/infer/kvcache with a per-PID allocation-LRU
(default 8192 entries per PID, configurable). Pairs cudaMalloc
with cudaFree by device pointer; on decode-phase outlier the
engine pulls the top-N oldest live alloc ages and attaches them
to the OutlierEvent. Cumulative ages fold into
ingero.infer.kvcache.alloc_age_ms.

BPF probe change: cudaMalloc / cudaMallocManaged uretprobes
now read the resolved device pointer (the value cudaMalloc
wrote to the void** out-parameter) and stamp it into
entry->arg1. Prior layout left arg1 as the void**
parameter address, which gave the tracker no stable key to pair
malloc with cudaFree. memtrack is unaffected.

Opt-in via --inference-kvcache-lineage (default off; ~64 KB per
inference PID when on). When engaged, the agent starts a 25 ms
self-heartbeat goroutine that writes CLOCK_BOOTTIME to the
ingero_watchdog BPF map; the cudaMalloc / cudaFree uprobes
have a 50 ms staleness gate driven by an external consumer's
heartbeat and would otherwise silently drop events when the
agent itself is the consumer.

Continuous engine re-detection

enginedetect.Detect() is now a continuous loop in the scrape
package, not a one-shot at agent startup. Agents pick up engines
that started after them (k8s rolling restart, pod swap,
agent-first ordering) without operator intervention.

Adaptive cadence: 30 s while no targets are known, 5 m once at
least one target is registered. Each tick confirms every
registered target via the detector (drops on PID-gone or
cmdline-mismatch, replaces on engine-identity change with INFO
log) and walks /proc for new candidates.

Opt-in via passing a non-nil PIDLister; existing tests that
just want a dumb scraper are unaffected.

Memfrag IOCTL fan-in to the inference engine

The v0.15 W1 memfrag kprobe (closed NVIDIA-driver
nvidia_unlocked_ioctl) was previously only a Prometheus counter

MCP tool. v0.16.0 fans those events into the inference engine
via OnMemfragEvent and adds a high-priority phase rule
(memfrag_count >= MemfragDecodeMin AND launch_count < DecodeMaxLaunches -> decode). KV-cache eviction storms now
classify into the decode bucket and fire as decode-bucket
outliers instead of being absorbed into prefill or unknown.

New flag: --inference-phase-memfrag-decode-min (default 1).

Thermal context on outliers

The v0.12.10 NVML throttle poller now publishes its OR-folded
reasons bitmap to a process-level atomic. The inference engine
reads it via SetThrottleReader callback when an outlier fires
and attaches ThrottleReasons to the OutlierEvent. Surfaces
on the UDS envelope as throttle_reasons and on OTLP /
Prometheus as ingero.infer.throttle_active_total. An outlier
caused by HW_SLOWDOWN is now visibly thermal without operators
having to join time series.

cudaLaunchKernel stream-handle capture

v0.16.0-v0.16.3 keyed per-step observables on evt.Args[0]. The
BPF probe puts the stream handle in Args[0] only for
cudaStreamSynchronize; for cudaLaunchKernel Args[0] is the
kernel function pointer, and for cudaMemcpy it's the byte count.
Launches and memcpys accumulated under completely different
observable keys than syncs read from, so the sync's
ResetAndRead always saw zero launches and every step
classified as PhaseUnknown via the catch-all rule.

The cudaLaunchKernel uprobe now reads the actual cudaStream_t
arg from the appropriate ABI slot (x86-64 SysV: spilled to caller
stack at [rsp+16]; arm64 AAPCS: still in X7 because dim3
splits keep eight register slots) and stamps it into Args[1].
The per-step observable bucket switches from PID-level
(cgroup, pid, 0) to per-stream (cgroup, pid, stream) for
launches and NCCL events; memcpy and memfrag remain PID-level
because their BPF arg slots are full or the signal is
process-wide by nature.

Multi-stream workloads get separate baselines via the per-stream
WorkloadKey on syncs; the legacy single-stream / default-stream
path is preserved.

Optional kernel-fingerprint workload key

--inference-fingerprint-key (default off) adds a fifth dimension
to WorkloadKey: an FNV-1a fold over the kernel function
pointers launched during the step. Useful when a single
(pid, stream) hosts multiple models / model versions and the
operator wants independent baselines per kernel sequence. Order-
dependent so a duplicate kernel does not cancel the running hash.

Hardening and security

Bundled into this cut and worth calling out separately:

Go 1.26.3 + x/net 0.53.0. Closes three reachable HIGH CVEs
from the dep scan (html/template escaper bypass / XSS,
net.Dial NUL-byte panic on Windows, x/net HTTP/2 infinite
loop on bad SETTINGS_MAX_FRAME_SIZE). govulncheck reports
0 vulnerabilities after the bump (was 4).
SQLite read-only enforcement at the engine level.
ExecuteReadOnly runs against a sibling _query_only=1
connection pool. Writable-CTE bypasses with
tab/newline/CR/form-feed whitespace are rejected by SQLite,
not by the substring filter (which is retained as a friendly
pre-check).
Store rollover hardening. Schema-fail no longer leaves
s.db pointing at the closed oldDB; rename-fail and
reopen-fail paths log Error-severity RESTART REQUIRED and
surface both errors. Rolled filenames embed nanosecond
timestamps + a 4-byte random suffix (no same-second
collisions, no NTP-backwards-drift edge case). Concurrent
reads acquire rolloverMu.RLock so they wait for an in-flight
rollover instead of racing against a closed *sql.DB.
Safe defaults on external surfaces. MCP bearer + PagerDuty
key now ingested from env (refuses leaky flag values).
--otlp-insecure default false + non-loopback gate. Dashboard
refuses --no-tls on non-loopback bind. Prometheus
non-loopback bind logs a workload-identity warning.
fleet.Config.RequireTLS gate + TLS 1.3 floor for the
ingero-fleet
client + k8s flows.
Deterministic ID fallbacks on CSPRNG failure. New
pkg/events.DeterministicID(pid, streamHandle, tsNanos, salt)
helper is the single FNV-128a primitive used by both
internal/infer newEventID and internal/export/otlp_traces
newTraceAndSpanIDs. Salts 0x01 (event), 0x02 (trace), 0x03
(span) give three distinct deterministic digests from one
workload key, so a CSPRNG outage during an outlier storm no
longer collapses every fallback into one downstream row.
Watchdog clock alignment on CLOCK_BOOTTIME.
watchdog_is_stale in bpf/common.bpf.h now uses
bpf_ktime_get_boot_ns() (kernel 5.7+) instead of
bpf_ktime_get_ns() (CLOCK_MONOTONIC), aligning probe-side
with producer-side clock_gettime(CLOCK_BOOTTIME). Without
this, after any host suspend the BOOTTIME-MONOTONIC delta
underflowed, watchdog_is_stale returned 1 forever, and
cudaMalloc / cudaFree probes dropped all events for the
rest of the host's uptime. Practical risk was higher on WSL2 /
laptop dev rigs than on never-suspended GPU servers.
BPF struct field renames: pid is now the userspace PID
everywhere. memfrag_ioctl_event and kernel_launch_event
used kernel-naming (pid = kernel TID, tgid = userspace PID).
Now match ingero_event_hdr (pid = userspace PID, tid =
kernel TID). Wire bytes unchanged; only labels swapped.
Without this, gpu_kernel_launch_count{pid="..."} on a
multi-threaded CUDA workload (DDP, NCCL helper threads,
DataLoader workers) fragmented one Python rank across many
per-TID series.
CUDA graph correlation fix. cudaStreamEndCapture and
cudaGraphInstantiate uretprobes now bpf_probe_read_user the
output handle from the saved user pointer (PARM2 / PARM1).
Previously every End_Capture event carried graph_handle=0
and every Instantiate carried exec_handle=0, breaking the
Begin -> End -> Instantiate -> Launch correlation chain at two
of the four hops.

CLI / daemon-mode wire-up fixes

runJSONMode was missing the inference-engine hook entirely;
--inference defaults to --json, so phase classification,
baselines, and outlier detection were silently inert on the
umbrella path. Mirrored the integration block in both the TUI
loop and the JSON loop; the engine now sees the expected
sync events.
runJSONMode was also missing the correlate.Engine wiring, so
the chain-of-evidence pipeline produced zero rows on the
production-shape path. Now accepts corrPID +
*correlate.Engine and mirrors the snap-tick drains.
emitInferOutlier / emitInferSamplerDegraded now pass the
resolved nodeIdentity and the operator-supplied
traceCluster so the UDS NodeID + ClusterID fields match
the OTLP resource attrs.
SGLang's launch flag is --model-path; older builds use
--model. extractModel now tries the canonical key first
and falls back to the legacy form.
Outlier log line gained the phase field. Operators can now
tell from the agent log whether an outlier was a slow prefill
or a slow decode without joining log streams.
One routeInferEvent(evt) free function, called from both
runTableMode and runJSONMode. The duplicated switch block
is gone; future event-type additions land in one place.

Hygiene cluster

Smaller items folded into the same cut: zero throttle bitmap
after 3 consecutive NVML failures; open rolled DBs via
NewReadOnly (no migration churn); symmetric trim in --max-db
is-set check; collapse fleet_push poller duplication; swap
explain --last From/To bounds (was inverted); implement
/proc/<pid>/status UID filter for --user; merge.go
propagates Commit errors and emits per-row Exec warnings;
align nccl_barrier docstring with the collective-skew
computation; 32 MiB cap on fleet response body at 3 sites;
slog.Debug on update-cache writer error paths; drop unused
at param from OnFree / OnFreeEvent; clarify scrape parser
early-return + escape-sequence limits; pin fire-and-forget
convention with explanatory comments; --slow-decode-us flag
parameterizes the infer_phases.cu literal.

`ingero demo --json incident` nil-panic fix

runJSONMode called attachSysSnapshot(snap, sysColl) on every
snap-tick (mirroring runTableMode), but the pre-existing
conditional init only fired when onSnapshot or eventStore
was non-nil. The ingero demo --json incident path passes both
nil, so sysColl stayed nil and attachSysSnapshot crashed
dereferencing it. The init is now unconditional in runJSONMode
to match runTableMode.

New flags in v0.16.0

ingero trace:

--inference (umbrella; bundles the five defaults above)
--inference-warmup (default 30, healthy steps before
classification activates per workload)
--inference-outlier-ratio (default 3.0, multiplier on
baseline-p95 for the largest bucket)
--inference-pause-on-severity (default HIGH, lowest causal-
chain severity that pauses baseline updates)
--inference-sampler-degrade-on (default 3x, smallest bucket
that bumps the sampler to 100%; valid values
{1.5x, 2x, 3x, off})
--inference-phase-memfrag-decode-min (default 1)
--inference-fingerprint-key (default off)
--inference-kvcache-lineage (default off)
--inference-scrape-redetect-interval (default 30 s)
--cluster (operator-supplied cluster identity; becomes
ingero.cluster.id on OTLP resource attrs)
--db-rollover-size (default 1g, file rollover threshold)
--db-rollover-keep (default 6, rolled files retained on disk)
--slow-decode-us (parameter for the infer_phases.cu
regression workload)

New YAML block under inference: mirroring the otlp: /
alerter: shape; CLI flags override file values via the
resolveInferenceConfig pattern.

New metrics in v0.16.0

Five inference metrics, two sampler-degraded metrics, one thermal
counter, one KV-cache histogram:

ingero.infer.step_duration_ns (histogram)
ingero.infer.baseline_mean_ns (gauge)
ingero.infer.baseline_p95_ns (gauge)
ingero.infer.outlier_total (counter, label bucket)
ingero.infer.workloads_tracked (gauge)
ingero.infer.sampler.degraded (gauge)
ingero.infer.sampler.degradations_total (counter, label
cause)
ingero.infer.throttle_active_total (counter)
ingero.infer.kvcache.alloc_age_ms (histogram)

New attributes: ingero.infer.stream_handle,
ingero.infer.outlier_bucket,
ingero.infer.kernel_fingerprint,
ingero.infer.step_duration_ns,
ingero.infer.baseline_p95_ns,
ingero.infer.baseline_mean_ns,
ingero.infer.memfrag_events_in_step,
ingero.infer.throttle_reasons,
ingero.infer.kvcache.oldest_alloc_age_ms. New resource attrs:
ingero.node.id, ingero.cluster.id, ingero.provider. New
per-data-point GenAI attrs: gen_ai.request.model,
gen_ai.system.

AttrMemcpyDirection is now "ingero.memcpy.direction" (was
bare "direction"); the contract package documented the
namespaced form but the wire emitted the bare key. Aligned.

Known limitations carried into v0.16.0

Per-stream attribution for memcpy and memfrag is not done
yet; the BPF probe arg slots are full or the signal is
process-wide. Multi-stream workloads still get separate
baselines via the per-stream WorkloadKey on syncs; only the
per-step observables are PID-aggregated for those two event
types.
KV-cache lineage tracker is opt-in (--inference-kvcache- lineage); ~64 KB per inference PID when on.
Kernel fingerprint workload-key dimension is opt-in
(--inference-fingerprint-key); off-by-default keeps the
per-workload LRU footprint at v0.16.4 levels.
The agent self-heartbeat for the watchdog gate only runs when
--inference-kvcache-lineage is set. Without it the
cudaMalloc / cudaFree uprobes still attach but their 50 ms
staleness gate silently drops events unless an external
consumer is supplying the heartbeat.

Compatibility

Wire format: All inferenceOutlierMessage field additions
(memfrag_events_in_step, throttle_reasons,
min_sm_clock_mhz, kv_cache_top_alloc_ages_ms) are
backwards-compatible optional fields on the JSON wire. Pre-
v0.16.0 UDS consumers ignore unknown fields.
OTLP / Prometheus: Existing v0.15 attribute sets are
unchanged when v0.16.0 features are not engaged. The
default-off case (no fingerprint key, no detected engine, no
--cluster) keeps dashboards built against v0.15 working.
SQLite schema: Rollover re-applies the same schema sequence
via the extracted schema_setup.go, so a rolled file is
bit-identical to a fresh file from a schema perspective.
CLI flags: All new flags are additive; existing flag
defaults are unchanged outside the explicit --inference
umbrella set.

What also landed between v0.10.0 and v0.15.0

The sixteen releases between v0.10.0 and v0.15.0 shipped
without human-readable notes (just commit lists). This rollup
catalogs the user-visible changes for anyone upgrading from a
pre-v0.16 binary. Grouped by surface area, not by release
tag. Version-in-parens tags at the end of each entry point at
the release that actually shipped that piece.

Added

NCCL collective tracing. 16 uprobes against libnccl.so (or
libtorch_cuda.so for statically-linked PyTorch):
ncclCommInitRank, ncclCommDestroy, ncclAllReduce,
ncclAllGather, ncclReduceScatter, ncclBcast, ncclSend,
ncclRecv (each with uprobe + uretprobe). Auto-discovers
libnccl via /proc/<pid>/maps; falls back to libtorch_cuda.so
or libtorch_global_deps.so. Per-collective metrics
nccl.collective.duration_ms, barrier_wait_ms, bytes
emitted with op_type, comm_id_hash, rank, nranks,
datatype, reduce_op. Agent-side barrier correlator joins
NCCL uretprobes with the next cudaStreamSynchronize on the
same (pid, stream) and emits nccl.collective.barrier_wait_ms
(5-minute correlation window with GC). Per-tenant PID filter
via Tracer.SetTargetPID / ClearTargetPIDs. ncclSend /
ncclRecv peer-rank attribute (nccl.peer_rank). --nccl
privilege gate surfaces a clear "requires root or CAP_BPF +
CAP_PERFMON" warning instead of the libbpf attach error deep
in the stack. Contract docs explain world_size (resource-
level, job-wide) vs nranks (per-comm); same for single-comm
jobs, diverge for FSDP+TP / MoE. (v0.12.1, v0.12.2)

Runtime libnccl uprobe attachment. The agent now attaches
NCCL collective uprobes against libnccl paths the discovery
scanner finds at runtime, in addition to the systemwide
attachment that ran at startup. PyTorch + pip workloads ship
libnccl in a venv that startup-time attach could never see, so
the previous behavior captured zero NCCL events on bare cloud
GPU VMs. Tracer API exposes Prepare() + AttachAt(libPath)
for the runtime-attach flow; Attach(filterPIDs) is a
compatibility shim. (v0.15.0)

Prometheus running counters for NCCL collectives.
gpu_nccl_collective_count{op_type},
gpu_nccl_collective_bytes_total{op_type},
gpu_nccl_collective_barrier_events{op_type}. OTLP keeps the
per-event gauge view as the canonical channel; the counters
are the pull-friendly slice. (v0.15.0)

libnccl process discovery. Periodic /proc/PID/maps
scanner emits gpu.nccl.process_loaded (gauge=1 per discovered
PID, labelled with pid, comm, libnccl_path,
libnccl_version) and gpu.nccl.processes_total. Flag
--libnccl-discovery-interval (default 10s; 0 disables).
(v0.14.0)

CUDA memcpy tracing. Uprobes for cudaMemcpy2D,
cudaMemcpy2DAsync, cudaMemcpyPeer, cudaMemcpyPeerAsync.
Per-direction labels (h2h, h2d, d2h, d2d, default,
unknown). New gpu.memcpy.bytes_total{direction} (cumulative
counter) and gpu.memcpy.duration_ms{direction}. 2D variants
encode direction=unknown because the kind argument lives
where libbpf's PT_REGS_PARMn macros cannot read it. v0.15.0
promoted duration_ms to a per-event histogram with buckets
(ms) at 0.1, 0.25, 0.5, 1, 2.5, 5, 10, 25, 50, 100, 250, 500,
1000. (v0.14.0, v0.15.0)

NVML clock-throttle reason metrics (W2-poller). Polls
nvmlDeviceGetCurrentClocksThrottleReasons every 5 s by
default (override via --throttle-poll-interval; 0
disables). Four OTel gauges per device labelled with
gpu.uuid: gpu.throttle.power_active, .thermal_active,
.sw_active, .hw_active (umbrella for any HW reason). Value
is 1 when the bucket is active and 0 otherwise. Poll
interval is the bursting floor: throttle events shorter than
the interval are missed by design. Consumer GPUs returning
[Not Supported] are skipped per device, logged once per GPU
at info level. (v0.12.10)

W1 NVML-poll memfrag heuristic.
gpu.memory.fragmentation_estimate (gauge 0..1, derived from
nvidia-smi memory.{used,free,total}) and per-PID
gpu.memory.process.allocated_bytes. New flag
--memfrag-poll-interval (default 10s; 0 disables). Polling-
based; the v0.15 experimental IOCTL kprobe variant superseded
this on allowlisted hosts. (v0.14.0)

Per-cgroup CUDA + CPU-stall metric variants alongside the
existing per-process series. New
internal/health/cgroup_cache.go builds the cgroup-path-hash
identity. Schema additive; older readers unaffected. (v0.14.0)

OTLP Histogram and ExponentialHistogram encoders.
Direct JSON, no OTel-SDK dependency. (v0.15.0)

OTLP traces emission for detection events. Detection-edge
spans on the same OTLP endpoint as metrics; one pipeline for
two signals. (v0.14.0)

Cost-of-problem gauges. ingero.node.info (descriptor
labels) and ingero.node.world_size (rank-count gauge per
training job). (v0.11.0)

ingero-alerter sidecar. Slack + PagerDuty backends for
per-node straggler alerts. Reads NDJSON events from the
agent's --remediate UDS; renders configurable templates.
(v0.11.0)

ingero check --support-bundle flag. Writes a diagnostic
tarball (config, recent log, BPF program names, kernel/driver
info) for support cases. (v0.11.0)

ingero migrate subcommand + framework. First-class schema
migrations for the trace DB; versioned migration registry so
v0.11+ schema changes apply forward without manual operator
action. (v0.11.0)

ingero rates update CLI subcommand. Live update with
embedded Markdown fallback for GPU hourly cost lookups.
(v0.14.0)

pagerduty_trigger MCP tool. PagerDuty Events v2 trigger
from the agent's detection-event shape. Off by default; opt in
via SetPagerDutyMCPEnabled(true) after pairing the agent MCP
with bearer auth. v0.15.0 auto-registers it when running on
stdio (loopback by definition) or on HTTP with
--mcp-bearer-token set. (v0.14.0, v0.15.0)

--mcp-bearer-token on ingero mcp. Requires
Authorization: Bearer <token> for HTTPS MCP requests when
set; empty disables auth (loopback-only deployments).
Constant-time compare via SHA-256 padding. (v0.15.0)

/investigate MCP prompt aware of Echo federation. Opens
with a one-paragraph note when reached via the
ingero-fleet
Echo-federated cluster-investigate flow: Echo has already
ranked the node, so this prompt's job is the per-node "WHY"
half (root cause + recommendation) while Echo handles the
cluster-side "WHERE" half. (v0.12.9)

YAML config loader. internal/config/config.go; CLI flags
override file values. (v0.14.0)

Trace-sampling policy (internal/sampling/sampler.go).
v0.14.0 shipped this as the fleet-push primitive; v0.16.0
wires it to the trace path's SQLite store. (v0.14.0)

internal/auth package. Bearer parsing + constant-time
compare + hardened TLS keypair loader. Wires every TLS load
site (MCP HTTPS, dashboard, fleet client, healthee poller).
Refuses world-readable keys unless
INGERO_TLS_ALLOW_LOOSE_KEY_PERMS=1; rejects directories.
(v0.15.0)

Six new metric-name constants in pkg/contract plus the
ingero.cgroup.path_hash attribute constant. (v0.14.0)

docs/troubleshooting.md — patterns Ingero detects,
operational cheat sheet, advanced configuration. README
trimmed in turn. (v0.14.0)

docs/otlp.md trust-model section. BPF-derived register
attributes (memcpy direction, NCCL op-type, IOCTL command
numbers, kernel grid/block dims) cannot be trusted as security
signals on multi-tenant hosts; cross-tenant correlation should
rely on cgroup attribution. Includes the v0.12.10 W2-poller
bit-to-bucket mapping for the four gpu.throttle.*_active
series. (v0.14.0, v0.15.0)

Architecture diagram in README. (v0.11.0)

Added (experimental; gated on `--enable-experimental-kprobes`)

--enable-experimental-kprobes flag (default off). The
agent reads /proc/driver/nvidia/version + uname -r and
only loads experimental probes if the pair is on the
internal/kprobe.DefaultAllowlist. Off-allowlist hosts get a
startup warning and the probes do NOT load. Allowlist seeded
with {driver 535.*, kernel 5.15.*} (Lambda A10),
{driver 535.*, kernel 6.5.*} (Lambda GH200), and
{driver 570.*, kernel 6.8.*} (Lambda current Ubuntu 22.04).
(v0.15.0)

Memfrag IOCTL kprobe. Hooks nvidia_unlocked_ioctl on the
closed NVIDIA driver. Records the IOCTL cmd field;
argument-buffer decode is a v0.16.x follow-up. Per-cmd counter
gpu.memfrag.ioctl_event_total{cmd}. New
fleet.cluster.memfrag_hotspots
MCP tool consumes it. (v0.15.0)

Throttle event counters via edge detection.
gpu.throttle.{power,thermal,sw,hw}.event_total. Layered on
the existing nvidia-smi throttle poller; each rising edge per
(gpu_uuid, bucket) increments once. (v0.15.0)

CUDA kernel launch dims uprobe. Hooks cuLaunchKernel in
libcuda.so and captures grid (X, Y, Z) + block (X, Y) dims per
launch. Block Z requires PARM7 access and is a follow-up. New
metrics gpu.kernel.launch.count, .threads_per_block
(histogram), .grid_blocks (histogram). New
fleet.cluster.kernel_launch_summary
MCP tool consumes it. (v0.15.0)

Changed

AWS provider detection hardened. Instance-id shape
validation added; the metadata URL is now injectable for
tests. (v0.12.4)

internal/store schema extended with per-cgroup attribution
columns. Additive; older readers unaffected. (v0.14.0)

PagerDuty client response body bounded by
io.LimitReader(64KiB) before close. (v0.14.0)

Sampler doc note. Intentionally not cryptographically
secure; load-shedder, not a security control. (v0.15.0)

In-tandem release records for v0.12.6, v0.12.7, v0.12.8.
v0.12.6 republished v0.12.5's scope after the v0.12.5 release
broke and produced no GitHub Release or GHCR image (v0.12.5
artifacts do not exist; users on that tag should upgrade to
v0.12.6 for the same code with working binaries). v0.12.7 was
a doc-sync release synchronizing version pins to v0.12.6.
v0.12.8 was an in-tandem release with
ingero-fleet
v0.12.8 (no agent code changes; tag maintained for sync per
the in-tandem release rule — fleet shipped the healthee OTel
Collector extension and the quanthealth library).

Fixed

arm64 release binaries now load BPF probes correctly. The
v0.10.0 release archives shipped a single set of pre-compiled
BPF objects built for x86_64 and embedded that set into both
the linux_amd64 and linux_arm64 binaries. On any aarch64
kernel the pt_regs CO-RE relocations resolved against the
wrong struct layout and the verifier rejected the load with
bad CO-RE relocation: invalid func unknown. bpf2go is now
invoked with -target amd64,arm64; per-arch .bpf.o files
are committed under
internal/ebpf/<pkg>/<name>_<arch>_bpfel.{go,o} and
selected at Go build time via build constraints. Reported by
@saiyam1814 against DGX Spark (issue #35); affected every
arm64 release-binary user, not only Spark. (v0.10.1)
MCP server reports build-time version. ingero mcp now
reports the version embedded at build time (e.g. v0.10.0)
in the MCP Implementation.Version field, instead of the
previously hardcoded "0.9.0" string. Affects how MCP
clients (Claude Desktop, IDE plugins, glama.ai) identify
the server. (v0.10.1)
NCCL ringbuf decode panic under Go 1.26. The Event
struct had unexported _pad / _pad2 ABI padding fields
that binary.Read panicked on under Go 1.26's stricter
unexported-field reflect rules. Renamed to Pad1 / Pad2;
EventSize=104 and field offsets stay identical. (v0.15.0)
NCCL discovery dispatcher panic when running without
--nccl. A typed-nil *ncclprobe.Tracer widened to a
non-nil ncclAttacher interface bypassed the naive
att == nil check and segfaulted on the next method call.
Triggered whenever the agent ran with --prometheus but
without --nccl (the typical inference deployment shape).
(v0.15.0)
NCCL probe attach-vs-PID-filter race. Closed the window
where target PIDs could change between Attach() and
SetTargetPIDs(). ClearTargetPIDs errors now propagate
to the caller instead of being silently swallowed. (v0.12.2)
cudaMemcpy2D / cudaMemcpy2DAsync BPF probes encode
direction=5 (unknown) instead of direction=0 (h2h).
Earlier code relied on a userspace convention to interpret
direction=h2h on a 2D op as unknown; the fix makes the metric
label honest and prevents silent mixing of true H2H copies
with 2D-with-unknown-direction. (v0.14.0)
Prometheus /metrics exporter now includes the v0.12.10 +
v0.14.0 metric set (gpu_throttle_*,
gpu_nccl_process_loaded, gpu_nccl_processes_total,
gpu_memory_{used,free,total}_bytes,
gpu_memory_fragmentation_estimate,
gpu_memory_process_allocated_bytes,
gpu_memcpy_bytes_total{direction},
gpu_memcpy_duration_ms{direction}). Operators scraping
Prometheus directly previously saw only the pre-v0.12.10
metric subset. (v0.14.1)
Reliability gaps from automated audit. Hardening pass on
agent paths flagged in v0.11.0's testing review. (v0.12.1)

Upgrading from v0.10.x

From v0.10.x: Re-pull ingero binaries (arm64 users
especially: v0.10.0 arm64 BPF objects never worked). Existing
flags, OTLP, and Prometheus output are forward-compatible.
Operators on multi-tenant hosts should review the trust-model
section added to docs/otlp.md in v0.15.0 before treating
BPF-derived register attributes as security signals.
From v0.11.x or v0.12.x: Schema migrations apply
automatically via the v0.11 ingero migrate framework on
next startup.
From v0.13.x (private): v0.14.0 is the public successor;
see the v0.14.0 section above for the full set of additions.
From v0.14.x or v0.15.x: New inference: YAML block is
additive. Existing OTLP / Prometheus dashboards keep working;
gen_ai.request.model + gen_ai.system data-point attrs and
the ingero.node.id / ingero.cluster.id /
ingero.provider resource attrs only appear when the
inference engine is detected and --cluster is set.
MCP clients: Reauthorize with --mcp-bearer-token on the
agent if you ran v0.14.x with HTTP MCP exposed; v0.15.0
defaults to no bearer (loopback-only) but the
pagerduty_trigger tool is now auto-registered when bearer
auth is configured, so HTTP-exposed MCPs without a bearer
will no longer see that tool.

Security Fixes

Go upgraded to 1.26.3; closes three HIGH CVEs (html/template escaper bypass, net.Dial NUL‑byte panic on Windows, x/net HTTP/2 infinite loop).

View diff on GitHub

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Share on X Share on Bluesky

Track ingero-io/ingero

Get notified when new releases ship.

About ingero-io/ingero

eBPF-based GPU causal observability agent with MCP server. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency.

ingero-io/ingero

Summary

TL;DR

What's new in v0.16.0

`ingero trace --inference` umbrella

Phase-aware baselines

`ingero.infer.*` on OTLP and Prometheus

Per-outlier OTLP `/v1/traces` span emission

KV-cache lineage tracker

Continuous engine re-detection

Memfrag IOCTL fan-in to the inference engine

Thermal context on outliers

cudaLaunchKernel stream-handle capture

Optional kernel-fingerprint workload key

Hardening and security

CLI / daemon-mode wire-up fixes

Hygiene cluster

`ingero demo --json incident` nil-panic fix

New flags in v0.16.0

New metrics in v0.16.0

Known limitations carried into v0.16.0

Compatibility

What also landed between v0.10.0 and v0.15.0

Added

Added (experimental; gated on `--enable-experimental-kprobes`)

Changed

Fixed

Upgrading from v0.10.x

Security Fixes

Related context

Related tools

Earlier breaking changes

ingero-io/ingero

Summary

TL;DR

What's new in v0.16.0

ingero trace --inference umbrella

Phase-aware baselines

ingero.infer.* on OTLP and Prometheus

Per-outlier OTLP /v1/traces span emission

KV-cache lineage tracker

Continuous engine re-detection

Memfrag IOCTL fan-in to the inference engine

Thermal context on outliers

cudaLaunchKernel stream-handle capture

Optional kernel-fingerprint workload key

Hardening and security

CLI / daemon-mode wire-up fixes

Hygiene cluster

ingero demo --json incident nil-panic fix

New flags in v0.16.0

New metrics in v0.16.0

Known limitations carried into v0.16.0

Compatibility

What also landed between v0.10.0 and v0.15.0

Added

Added (experimental; gated on --enable-experimental-kprobes)

Changed

Fixed

Upgrading from v0.10.x

Security Fixes

Related context

Related tools

Earlier breaking changes

`ingero trace --inference` umbrella

`ingero.infer.*` on OTLP and Prometheus

Per-outlier OTLP `/v1/traces` span emission

`ingero demo --json incident` nil-panic fix

Added (experimental; gated on `--enable-experimental-kprobes`)