Skip to content

ingero-io/ingero

v0.9.1 Breaking

This release includes breaking changes for platform teams planning a safe upgrade.

Published 2mo MCP Data & Storage
✓ No known CVEs patched
Read the diff → Tool health → What is this tool? →

✓ No known CVEs patched in this version

Topics

causal-tracing cuda cuda-graphs ebpf gpu gpu-monitoring
+11 more
gpu-observability incident-response kubernetes machine-learning mcp model-context-protocol nvidia observability pytorch sre distributed-tracing

Affected surfaces

auth breaking_upgrade

Summary

AI summary

Ingero v0.9.1 adds multi‑node distributed GPU tracing, fan‑out queries, offline DB merge and Perfetto export.

Full changelog

Note: The multi-node features in this release (fan-out queries, offline merge, Perfetto export) are interim solutions for cross-node GPU
investigation. A dedicated cluster-level observability and diagnostics tool with native multi-node support is coming soon.

One command. Every node. Full causal chain.

Ingero can now investigate distributed GPU workloads across multiple nodes from a single CLI command, MCP tool call, or offline database merge. Diagnose
which rank stalled, why, and see it all in a Perfetto timeline.

What's New

Node Identity & Rank Awareness

Every traced event is now tagged with its node identity and distributed training rank.

  • sudo ingero trace --node gpu-node-07 tags all events with the node name
  • Rank auto-detection from torchrun / torch.distributed.launch environment variables (RANK, LOCAL_RANK, WORLD_SIZE)
  • Event IDs are node-namespaced (gpu-node-07:1, gpu-node-07:2) — merge-safe by design
  • Schema v0.9 with backward-compatible migration for existing databases

Fleet Fan-Out Queries

Query your entire GPU cluster from one command:

# SQL fan-out across 3 nodes
ingero query --nodes node-1:8080,node-2:8080,node-3:8080 \
  "SELECT node, source, count(*) FROM events GROUP BY node, source"

# Cross-node causal chains sorted by severity
ingero explain --nodes node-1:8080,node-2:8080,node-3:8080
  • Results concatenated with a node column prepended
  • Partial failure handling — unreachable nodes produce warnings, not errors
  • Configure default nodes in ingero.yaml under fleet.nodes
  • --no-tls mode for dashboard on trusted networks (VPC, VPN)

MCP Fleet Tool

AI agents can now investigate entire clusters in one tool call:

query_fleet(action="chains")  →  merged causal chains from all nodes
query_fleet(action="sql", query="SELECT node, count(*) FROM events GROUP BY node")

Actions: chains, ops, overview, sql. Includes clock skew warnings.

Offline Database Merge

For air-gapped environments or offline analysis:

ingero merge node-1.db node-2.db node-3.db -o cluster.db
ingero query -d cluster.db --since 1h
ingero explain -d cluster.db --chains
  • Node-namespaced IDs ensure zero collisions
  • Stack traces deduplicated by hash
  • --force-node assigns identity to pre-v0.9 databases

Perfetto Timeline Export

Export multi-node traces for visual timeline analysis:

ingero export --format perfetto -d cluster.db -o trace.json

Open in https://ui.perfetto.dev — one process track per node/rank, CUDA events as duration spans, causal chains as severity-colored markers. Immediately
spot which rank stalled while others waited.

Clock Skew Detection

Automatic NTP-style clock offset estimation across nodes:

WARNING: node-2 is ~47ms ahead of node-1 (RTT: 2ms)
  • Live estimation during fan-out queries (3 samples, median, ~1ms precision on LAN)
  • Offline heuristic during ingero merge (session timestamp comparison)
  • Configurable threshold: --clock-skew-threshold 10ms

New Commands

| Command | Description |
|---------|-------------|
| ingero merge | Merge SQLite databases from multiple nodes |
| ingero export --format perfetto | Export to Chrome Trace Event Format |

New Flags

| Flag | Commands | Description |
|------|----------|-------------|
| --node | trace | Tag events with node identity |
| --nodes | query, explain, export | Fan-out to multiple nodes |
| --json | check | Output system readiness results as JSON |
| --no-tls | dashboard | Plain HTTP for fleet queries on trusted networks |
| --force-node | merge | Assign node identity to legacy databases |
| --clock-skew-threshold | query, explain, merge | Clock skew warning threshold |
| --timeout | query, explain, export | Per-node timeout for fleet queries |
| --ca-cert, --client-cert, --client-key | query, explain, export | Optional mTLS for fleet queries |

New API Endpoints

| Endpoint | Description |
|----------|-------------|
| POST /api/v1/query | Execute read-only SQL (used by fleet fan-out) |
| GET /api/v1/time | Server timestamp for clock skew estimation |

New MCP Tool

| Tool | Description |
|------|-------------|
| query_fleet | Fan-out query across multiple nodes (chains, ops, overview, sql) |

Sample Data

Multi-node sample databases are included in investigations/ — 3 node databases (180-252 KB each), a merged cluster database, and a Perfetto timeline. Try them:

ingero explain --db investigations/sample-cluster.db --chains
ingero export --format perfetto --db investigations/sample-cluster.db -o trace.json

Validated On

  • 3 x AWS g4dn.xlarge (Tesla T4, 15 GB VRAM), Kernel 6.17.0-1007-aws, NVIDIA 580.126.09
  • Fan-out query, explain, merge, export, clock skew, partial failure (1 node down), single-node backward compatibility
  • Mixed binary + Docker deployment across fleet nodes
  • All existing tests pass + 80 new tests across 6 packages
  • Full Ingero-EE orchestrator validation: OOM deflection, straggler remediation, watchdog, NCCL suspend/resume, fault injection, recovery persistence

Upgrade Notes

  • Schema migration: Existing databases are automatically migrated to v0.9 on first open (adds node, rank, local_rank, world_size columns). Migration is
    non-destructive — existing data is preserved.
  • Backward compatible: All single-node workflows are unchanged. The --node flag defaults to os.Hostname() when not specified.
  • No new dependencies: Pure Go, no CGO, no new external libraries.

Bug Fixes

  • Fixed demo --no-gpu nil pointer panic when GPU is present on the machine (nil RankCache in synthetic mode)
  • Fixed --nodes "[host:port,...]" bracket format including brackets in hostnames — now strips surrounding brackets before parsing
  • Fixed MCP query_fleet sql action rejecting the query parameter — now accepts both query and sql fields
  • Added --ca-cert, --client-cert, --client-key mTLS flags to export command (query and explain had them, export did not)
  • Added --json flag to check command for structured JSON output
  • Fixed gpu-test.sh using wrong Python path — now auto-detects /opt/pytorch/bin/python3 when system Python lacks PyTorch
  • Fixed staticcheck S1001 lint: removed unnecessary copy loop in /api/v1/query handler
  • Fixed eventSeq not seeded from DB on restart — prevented ID collisions across trace sessions
  • Fixed nil pointer panic in merge batch commit loop on disk-full conditions
  • Fixed race condition in MCP fleet client initialization (now uses sync.Once)
  • Fixed silent I/O error swallowing in Perfetto export writer
  • Fixed duplicate IDs when merging multiple legacy DBs with same --force-node
  • Fixed path alias bypass in merge output-source collision check
  • Fixed URL injection via unsanitized since parameter in fleet client

Weekly OSS security release digest.

The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.

No spam, unsubscribe anytime.

Share this release

Track ingero-io/ingero

Get notified when new releases ship.

Sign up free

About ingero-io/ingero

eBPF-based GPU causal observability agent with MCP server. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency.

All releases →

Related context

Earlier breaking changes

  • v0.17.0 Dropped 'annotate --socket' option from CLI.

Beta — feedback welcome: [email protected]