Skip to content

ingero-io/ingero

v0.9.0 Feature

This release adds 3 notable features for engineering teams evaluating rollout.

Published 2mo MCP Data & Storage
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

causal-tracing cuda cuda-graphs ebpf gpu gpu-monitoring
+11 more
gpu-observability incident-response kubernetes machine-learning mcp model-context-protocol nvidia observability pytorch sre distributed-tracing

Summary

AI summary

Ingero adds full CUDA Graph lifecycle tracing, a remediation API via Unix socket, and straggler detection.

Full changelog

Ingero can now trace the full CUDA Graph lifecycle — capture, instantiate, launch — via eBPF uprobes on libcudart.so.
Zero application modification, zero CUPTI dependency, production-safe overhead.

CUDA Graph Observability

  • eBPF probes for cudaStreamBeginCapture, cudaStreamEndCapture, cudaGraphInstantiate, and cudaGraphLaunch — covers the stream capture path used by PyTorch torch.compile, vLLM, and TensorRT-LLM
  • Causal correlation connects graph events to system state: OOM during graph capture, CPU scheduling interference delaying graph dispatch, graph launch frequency anomalies (pool exhaustion), and captured-but-never-launched graphs wasting VRAM
  • MCP tools: graph_lifecycle (timeline of all graph events for a PID) and graph_frequency (per-executable launch rates, hot/cold graph classification, pool saturation detection)
  • ingero explain now includes graph context in causal chains when graph events are relevant
  • Graceful degradation — if graph API symbols are absent (older CUDA), Ingero skips graph probes silently and continues normally
  • Validated at 5,000+ GraphLaunch/sec on EC2 g4dn.xlarge with torch.compile(mode="reduce-overhead"), overhead within <2% budget

Remediation API

Ingero now exposes an optional remediation API over a Unix domain socket (/tmp/ingero-remediate.sock) using type-discriminated NDJSON. External tools can consume real-time {"type":"memory"} and {"type":"straggle"} signals to build custom remediation workflows. Enable with --remediate on ingero trace. See docs/remediation-protocol.md for integration details.

Straggler Detection

  • New internal/straggler package: per-PID EMA throughput baseline tracking with sched_switch contention counting
  • Correlated detection — both throughput drop and scheduling contention must fire to avoid false positives
  • Sustained signal re-emission for downstream consumers that need periodic updates

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track ingero-io/ingero

Get notified when new releases ship.

Sign up free

About ingero-io/ingero

eBPF-based GPU causal observability agent with MCP server. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency.

All releases →

Related context

Earlier breaking changes

  • v0.17.0 Dropped 'annotate --socket' option from CLI.

Beta — feedback welcome: [email protected]