|
| 1 | +# ADR 0031: OpenTelemetry Observability for Tasks and Agents |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Proposed |
| 6 | + |
| 7 | +## Date |
| 8 | + |
| 9 | +2026-04-07 |
| 10 | + |
| 11 | +## Context |
| 12 | + |
| 13 | +KubeOpenCode runs AI agents as Kubernetes workloads. Understanding what happens inside a Task — how many tokens were consumed, which tools were called, why the task was slow or failed — is critical for enterprise adoption. Users currently have limited visibility into Task execution. |
| 14 | + |
| 15 | +### Current State |
| 16 | + |
| 17 | +**KubeOpenCode side:** |
| 18 | +- Prometheus metrics exist (`internal/controller/metrics.go`): `kubeopencode_tasks_total`, `kubeopencode_task_duration_seconds`, `kubeopencode_agent_capacity`, `kubeopencode_agent_queue_length` |
| 19 | +- No distributed tracing or OpenTelemetry integration |
| 20 | +- Agent Pod labels support PodMonitor/ServiceMonitor discovery |
| 21 | + |
| 22 | +**OpenCode side:** |
| 23 | +- OpenCode already supports OpenTelemetry via AI SDK's `experimental_telemetry` flag |
| 24 | +- Configuration: `experimental.openTelemetry: true` in `opencode.json` |
| 25 | +- Emits spans for every LLM call with metadata (userId, sessionId) |
| 26 | +- Depends on `@opentelemetry/api@1.9.0` through the `ai` package |
| 27 | +- Requires an external OTel Collector to receive spans (no built-in exporter) |
| 28 | + |
| 29 | +**Kubernetes OTel ecosystem:** |
| 30 | +- [OpenTelemetry Operator](https://opentelemetry.io/docs/kubernetes/operator/) supports auto-injection of collector sidecars via pod annotations |
| 31 | +- OTel Collector can export to Jaeger, Tempo, OTLP endpoints, etc. |
| 32 | +- K8s attributes processor automatically enriches spans with pod/namespace/node metadata |
| 33 | + |
| 34 | +### Problem |
| 35 | + |
| 36 | +1. Users cannot see what happened inside a Task (token usage, tool calls, LLM latency) |
| 37 | +2. Failed Tasks are hard to debug — users only see "Failed" status with no breakdown |
| 38 | +3. Cost attribution per namespace/team/agent is not possible |
| 39 | +4. CronTask patterns (recurring tasks) have no performance trend visibility |
| 40 | + |
| 41 | +## Decision |
| 42 | + |
| 43 | +Integrate OpenTelemetry in phases, leveraging the existing OpenCode OTel support and the Kubernetes OTel Operator ecosystem. |
| 44 | + |
| 45 | +### Phase 1: Enable OpenCode OTel Spans (Low effort, high value) |
| 46 | + |
| 47 | +**Goal**: Surface LLM-level observability (token usage, latency, model info) for every Task. |
| 48 | + |
| 49 | +**Changes:** |
| 50 | + |
| 51 | +1. **Auto-inject OTel config into OpenCode** — When observability is enabled, the controller injects `experimental.openTelemetry: true` into the OpenCode config (`/tools/opencode.json`) during Pod construction in `pod_builder.go`. |
| 52 | + |
| 53 | +2. **Propagate Task identity via environment variables** — Set `OTEL_RESOURCE_ATTRIBUTES` on the worker container: |
| 54 | + ``` |
| 55 | + OTEL_RESOURCE_ATTRIBUTES=kubeopencode.task.name=<task>,kubeopencode.task.namespace=<ns>,kubeopencode.agent.name=<agent> |
| 56 | + ``` |
| 57 | + This enriches all spans with Task/Agent identity without modifying OpenCode. |
| 58 | + |
| 59 | +3. **OTel Collector sidecar injection** — Users deploy the [OpenTelemetry Operator](https://opentelemetry.io/docs/kubernetes/operator/) and create an `OpenTelemetryCollector` resource. KubeOpenCode adds the annotation `sidecar.opentelemetry.io/inject: "true"` to Task/Agent Pods when observability is enabled, triggering automatic collector sidecar injection by the OTel Operator. |
| 60 | + |
| 61 | +4. **Configuration** — Add an `observability` field to `KubeOpenCodeConfig` (cluster-scoped): |
| 62 | + ```yaml |
| 63 | + apiVersion: kubeopencode.io/v1alpha1 |
| 64 | + kind: KubeOpenCodeConfig |
| 65 | + metadata: |
| 66 | + name: cluster |
| 67 | + spec: |
| 68 | + observability: |
| 69 | + openTelemetry: |
| 70 | + enabled: true |
| 71 | + # Optional: override the OTel Collector sidecar injection annotation |
| 72 | + collectorAnnotation: "sidecar.opentelemetry.io/inject" |
| 73 | + collectorAnnotationValue: "true" |
| 74 | + ``` |
| 75 | +
|
| 76 | +**What users get**: Every LLM call in every Task produces OTel spans visible in Jaeger/Tempo/Grafana, with Task/Agent/namespace identity attached. |
| 77 | +
|
| 78 | +### Phase 2: Controller-Side Tracing (Medium effort) |
| 79 | +
|
| 80 | +**Goal**: End-to-end trace from Task creation to completion. |
| 81 | +
|
| 82 | +**Changes:** |
| 83 | +
|
| 84 | +1. **Add OTel tracing to controller reconcile loops** — Use the Go OTel SDK (`go.opentelemetry.io/otel`) to create spans for key operations: |
| 85 | + - Task reconcile: `task.reconcile` (root span) |
| 86 | + - Pod creation: `task.pod.create` |
| 87 | + - Context resolution: `task.context.resolve` (git clone, configmap fetch, URL fetch) |
| 88 | + - Status updates: `task.status.update` |
| 89 | + |
| 90 | +2. **Inject trace context into Pod** — Pass the `traceparent` header as an environment variable (`TRACEPARENT`) so OpenCode spans are linked to the controller's trace. This creates a single trace spanning controller → init containers → OpenCode LLM calls. |
| 91 | + |
| 92 | +3. **CronTask trace correlation** — CronTask-triggered Tasks carry a `kubeopencode.crontask.name` attribute, enabling queries like "show me all traces for CronTask X over the past week". |
| 93 | + |
| 94 | +**What users get**: A single trace view showing the full lifecycle: reconcile → git-init → context-init → LLM call 1 → tool call → LLM call 2 → completion. |
| 95 | + |
| 96 | +### Phase 3: OTel Metrics (Complement Prometheus) |
| 97 | + |
| 98 | +**Goal**: Export LLM-specific metrics via OTel for cost and performance dashboards. |
| 99 | + |
| 100 | +This phase bridges the gap between existing Prometheus metrics (infrastructure-level) and LLM-specific metrics: |
| 101 | + |
| 102 | +| Metric | Source | Purpose | |
| 103 | +|--------|--------|---------| |
| 104 | +| `kubeopencode.llm.tokens.input` | OTel spans | Cost attribution | |
| 105 | +| `kubeopencode.llm.tokens.output` | OTel spans | Cost attribution | |
| 106 | +| `kubeopencode.llm.duration` | OTel spans | Latency tracking | |
| 107 | +| `kubeopencode.llm.calls` | OTel spans | Usage patterns | |
| 108 | +| `kubeopencode.tool.calls` | OTel spans | Tool usage analysis | |
| 109 | + |
| 110 | +These can be derived from spans using the OTel Collector's [Span Metrics Connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnector), requiring no code changes — just collector configuration. |
| 111 | + |
| 112 | +**What users get**: Grafana dashboards showing token costs by namespace/agent/user, LLM latency trends, and tool usage breakdown. |
| 113 | + |
| 114 | +### What We Explicitly Do NOT Build |
| 115 | + |
| 116 | +- **Custom tracing UI** — Use existing tools (Jaeger, Grafana Tempo) |
| 117 | +- **Built-in OTel Collector** — Users bring their own via the OTel Operator |
| 118 | +- **Mandatory dependency** — OTel is opt-in; KubeOpenCode works without it |
| 119 | +- **Log collection** — Out of scope; handled by standard K8s logging stacks |
| 120 | + |
| 121 | +## Consequences |
| 122 | + |
| 123 | +### Positive |
| 124 | + |
| 125 | +- **Zero-dependency for basic use** — OTel is purely opt-in via `KubeOpenCodeConfig` |
| 126 | +- **Leverages existing ecosystem** — OTel Operator + Collector + Grafana; no custom infrastructure |
| 127 | +- **Enterprise value** — Cost visibility, debugging, and audit trails are top enterprise requirements |
| 128 | +- **Incremental** — Each phase delivers standalone value; no big-bang rollout |
| 129 | +- **OpenCode already supports it** — Phase 1 mostly wires up existing capabilities |
| 130 | + |
| 131 | +### Negative |
| 132 | + |
| 133 | +- **OTel Operator dependency for full value** — Users need to install the OTel Operator separately |
| 134 | +- **Experimental flag risk** — OpenCode's `experimental_telemetry` could change or be removed upstream |
| 135 | +- **Sidecar overhead** — OTel Collector sidecar adds ~50MB memory per Pod (configurable) |
| 136 | +- **Phase 2 adds Go dependencies** — `go.opentelemetry.io/otel` and related packages |
| 137 | + |
| 138 | +### Risks and Mitigations |
| 139 | + |
| 140 | +| Risk | Mitigation | |
| 141 | +|------|-----------| |
| 142 | +| OpenCode removes `experimental_telemetry` | It's moving toward stable in AI SDK; if removed, we adapt the config injection | |
| 143 | +| OTel Collector sidecar adds latency | Collector is async (batch exporter); no impact on LLM call latency | |
| 144 | +| Too much trace data for long Tasks | Configure sampling in OTel Collector (head or tail sampling) | |
| 145 | + |
| 146 | +## References |
| 147 | + |
| 148 | +- [OpenTelemetry Operator for Kubernetes](https://opentelemetry.io/docs/kubernetes/operator/) |
| 149 | +- [AI SDK Telemetry](https://sdk.vercel.ai/docs/ai-sdk-core/telemetry) |
| 150 | +- [OTel Collector Span Metrics Connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnector) |
| 151 | +- ADR 0013: Defer Token Usage Tracking to Post-v0.1 (related — OTel spans provide token data) |
0 commit comments