This release includes breaking changes for platform teams planning a safe upgrade.
✓ No known CVEs patched in this version
Topics
+11 more
Affected surfaces
Summary
AI summaryIngero v0.9.1 adds multi‑node distributed GPU tracing, fan‑out queries, offline DB merge and Perfetto export.
Full changelog
Note: The multi-node features in this release (fan-out queries, offline merge, Perfetto export) are interim solutions for cross-node GPU
investigation. A dedicated cluster-level observability and diagnostics tool with native multi-node support is coming soon.
One command. Every node. Full causal chain.
Ingero can now investigate distributed GPU workloads across multiple nodes from a single CLI command, MCP tool call, or offline database merge. Diagnose
which rank stalled, why, and see it all in a Perfetto timeline.
What's New
Node Identity & Rank Awareness
Every traced event is now tagged with its node identity and distributed training rank.
sudo ingero trace --node gpu-node-07tags all events with the node name- Rank auto-detection from
torchrun/torch.distributed.launchenvironment variables (RANK,LOCAL_RANK,WORLD_SIZE) - Event IDs are node-namespaced (
gpu-node-07:1,gpu-node-07:2) — merge-safe by design - Schema v0.9 with backward-compatible migration for existing databases
Fleet Fan-Out Queries
Query your entire GPU cluster from one command:
# SQL fan-out across 3 nodes
ingero query --nodes node-1:8080,node-2:8080,node-3:8080 \
"SELECT node, source, count(*) FROM events GROUP BY node, source"
# Cross-node causal chains sorted by severity
ingero explain --nodes node-1:8080,node-2:8080,node-3:8080
- Results concatenated with a node column prepended
- Partial failure handling — unreachable nodes produce warnings, not errors
- Configure default nodes in ingero.yaml under
fleet.nodes --no-tlsmode for dashboard on trusted networks (VPC, VPN)
MCP Fleet Tool
AI agents can now investigate entire clusters in one tool call:
query_fleet(action="chains") → merged causal chains from all nodes
query_fleet(action="sql", query="SELECT node, count(*) FROM events GROUP BY node")
Actions: chains, ops, overview, sql. Includes clock skew warnings.
Offline Database Merge
For air-gapped environments or offline analysis:
ingero merge node-1.db node-2.db node-3.db -o cluster.db
ingero query -d cluster.db --since 1h
ingero explain -d cluster.db --chains
- Node-namespaced IDs ensure zero collisions
- Stack traces deduplicated by hash
--force-nodeassigns identity to pre-v0.9 databases
Perfetto Timeline Export
Export multi-node traces for visual timeline analysis:
ingero export --format perfetto -d cluster.db -o trace.json
Open in https://ui.perfetto.dev — one process track per node/rank, CUDA events as duration spans, causal chains as severity-colored markers. Immediately
spot which rank stalled while others waited.
Clock Skew Detection
Automatic NTP-style clock offset estimation across nodes:
WARNING: node-2 is ~47ms ahead of node-1 (RTT: 2ms)
- Live estimation during fan-out queries (3 samples, median, ~1ms precision on LAN)
- Offline heuristic during
ingero merge(session timestamp comparison) - Configurable threshold:
--clock-skew-threshold 10ms
New Commands
| Command | Description |
|---------|-------------|
| ingero merge | Merge SQLite databases from multiple nodes |
| ingero export --format perfetto | Export to Chrome Trace Event Format |
New Flags
| Flag | Commands | Description |
|------|----------|-------------|
| --node | trace | Tag events with node identity |
| --nodes | query, explain, export | Fan-out to multiple nodes |
| --json | check | Output system readiness results as JSON |
| --no-tls | dashboard | Plain HTTP for fleet queries on trusted networks |
| --force-node | merge | Assign node identity to legacy databases |
| --clock-skew-threshold | query, explain, merge | Clock skew warning threshold |
| --timeout | query, explain, export | Per-node timeout for fleet queries |
| --ca-cert, --client-cert, --client-key | query, explain, export | Optional mTLS for fleet queries |
New API Endpoints
| Endpoint | Description |
|----------|-------------|
| POST /api/v1/query | Execute read-only SQL (used by fleet fan-out) |
| GET /api/v1/time | Server timestamp for clock skew estimation |
New MCP Tool
| Tool | Description |
|------|-------------|
| query_fleet | Fan-out query across multiple nodes (chains, ops, overview, sql) |
Sample Data
Multi-node sample databases are included in investigations/ — 3 node databases (180-252 KB each), a merged cluster database, and a Perfetto timeline. Try them:
ingero explain --db investigations/sample-cluster.db --chains
ingero export --format perfetto --db investigations/sample-cluster.db -o trace.json
Validated On
- 3 x AWS g4dn.xlarge (Tesla T4, 15 GB VRAM), Kernel 6.17.0-1007-aws, NVIDIA 580.126.09
- Fan-out query, explain, merge, export, clock skew, partial failure (1 node down), single-node backward compatibility
- Mixed binary + Docker deployment across fleet nodes
- All existing tests pass + 80 new tests across 6 packages
- Full Ingero-EE orchestrator validation: OOM deflection, straggler remediation, watchdog, NCCL suspend/resume, fault injection, recovery persistence
Upgrade Notes
- Schema migration: Existing databases are automatically migrated to v0.9 on first open (adds node, rank, local_rank, world_size columns). Migration is
non-destructive — existing data is preserved. - Backward compatible: All single-node workflows are unchanged. The
--nodeflag defaults toos.Hostname()when not specified. - No new dependencies: Pure Go, no CGO, no new external libraries.
Bug Fixes
- Fixed
demo --no-gpunil pointer panic when GPU is present on the machine (nil RankCache in synthetic mode) - Fixed
--nodes "[host:port,...]"bracket format including brackets in hostnames — now strips surrounding brackets before parsing - Fixed MCP
query_fleetsql action rejecting thequeryparameter — now accepts bothqueryandsqlfields - Added
--ca-cert,--client-cert,--client-keymTLS flags toexportcommand (query and explain had them, export did not) - Added
--jsonflag tocheckcommand for structured JSON output - Fixed
gpu-test.shusing wrong Python path — now auto-detects/opt/pytorch/bin/python3when system Python lacks PyTorch - Fixed staticcheck S1001 lint: removed unnecessary copy loop in
/api/v1/queryhandler - Fixed eventSeq not seeded from DB on restart — prevented ID collisions across trace sessions
- Fixed nil pointer panic in merge batch commit loop on disk-full conditions
- Fixed race condition in MCP fleet client initialization (now uses
sync.Once) - Fixed silent I/O error swallowing in Perfetto export writer
- Fixed duplicate IDs when merging multiple legacy DBs with same
--force-node - Fixed path alias bypass in merge output-source collision check
- Fixed URL injection via unsanitized
sinceparameter in fleet client
Weekly OSS security release digest.
The CVE patches and breaking changes that affected production tools this week. One email, every Sunday.
No spam, unsubscribe anytime.
Share this release
About ingero-io/ingero
eBPF-based GPU causal observability agent with MCP server. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency.
Related context
Related tools
Earlier breaking changes
- v0.17.0 Dropped 'annotate --socket' option from CLI.
Beta — feedback welcome: [email protected]