Skip to content

Commit 1db2f40

Browse files
xuezhaojunclaude
andcommitted
docs: add ADR 0031 for OpenTelemetry observability integration
Propose a phased approach to integrate OpenTelemetry into KubeOpenCode, leveraging OpenCode's existing experimental_telemetry support and the Kubernetes OTel Operator ecosystem for LLM call tracing, cost attribution, and end-to-end Task lifecycle observability. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: xuezhaojun <zxue@redhat.com>
1 parent f1e5e9b commit 1db2f40

File tree

1 file changed

+151
-0
lines changed

1 file changed

+151
-0
lines changed
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# ADR 0031: OpenTelemetry Observability for Tasks and Agents
2+
3+
## Status
4+
5+
Proposed
6+
7+
## Date
8+
9+
2026-04-07
10+
11+
## Context
12+
13+
KubeOpenCode runs AI agents as Kubernetes workloads. Understanding what happens inside a Task — how many tokens were consumed, which tools were called, why the task was slow or failed — is critical for enterprise adoption. Users currently have limited visibility into Task execution.
14+
15+
### Current State
16+
17+
**KubeOpenCode side:**
18+
- Prometheus metrics exist (`internal/controller/metrics.go`): `kubeopencode_tasks_total`, `kubeopencode_task_duration_seconds`, `kubeopencode_agent_capacity`, `kubeopencode_agent_queue_length`
19+
- No distributed tracing or OpenTelemetry integration
20+
- Agent Pod labels support PodMonitor/ServiceMonitor discovery
21+
22+
**OpenCode side:**
23+
- OpenCode already supports OpenTelemetry via AI SDK's `experimental_telemetry` flag
24+
- Configuration: `experimental.openTelemetry: true` in `opencode.json`
25+
- Emits spans for every LLM call with metadata (userId, sessionId)
26+
- Depends on `@opentelemetry/api@1.9.0` through the `ai` package
27+
- Requires an external OTel Collector to receive spans (no built-in exporter)
28+
29+
**Kubernetes OTel ecosystem:**
30+
- [OpenTelemetry Operator](https://opentelemetry.io/docs/kubernetes/operator/) supports auto-injection of collector sidecars via pod annotations
31+
- OTel Collector can export to Jaeger, Tempo, OTLP endpoints, etc.
32+
- K8s attributes processor automatically enriches spans with pod/namespace/node metadata
33+
34+
### Problem
35+
36+
1. Users cannot see what happened inside a Task (token usage, tool calls, LLM latency)
37+
2. Failed Tasks are hard to debug — users only see "Failed" status with no breakdown
38+
3. Cost attribution per namespace/team/agent is not possible
39+
4. CronTask patterns (recurring tasks) have no performance trend visibility
40+
41+
## Decision
42+
43+
Integrate OpenTelemetry in phases, leveraging the existing OpenCode OTel support and the Kubernetes OTel Operator ecosystem.
44+
45+
### Phase 1: Enable OpenCode OTel Spans (Low effort, high value)
46+
47+
**Goal**: Surface LLM-level observability (token usage, latency, model info) for every Task.
48+
49+
**Changes:**
50+
51+
1. **Auto-inject OTel config into OpenCode** — When observability is enabled, the controller injects `experimental.openTelemetry: true` into the OpenCode config (`/tools/opencode.json`) during Pod construction in `pod_builder.go`.
52+
53+
2. **Propagate Task identity via environment variables** — Set `OTEL_RESOURCE_ATTRIBUTES` on the worker container:
54+
```
55+
OTEL_RESOURCE_ATTRIBUTES=kubeopencode.task.name=<task>,kubeopencode.task.namespace=<ns>,kubeopencode.agent.name=<agent>
56+
```
57+
This enriches all spans with Task/Agent identity without modifying OpenCode.
58+
59+
3. **OTel Collector sidecar injection** — Users deploy the [OpenTelemetry Operator](https://opentelemetry.io/docs/kubernetes/operator/) and create an `OpenTelemetryCollector` resource. KubeOpenCode adds the annotation `sidecar.opentelemetry.io/inject: "true"` to Task/Agent Pods when observability is enabled, triggering automatic collector sidecar injection by the OTel Operator.
60+
61+
4. **Configuration** — Add an `observability` field to `KubeOpenCodeConfig` (cluster-scoped):
62+
```yaml
63+
apiVersion: kubeopencode.io/v1alpha1
64+
kind: KubeOpenCodeConfig
65+
metadata:
66+
name: cluster
67+
spec:
68+
observability:
69+
openTelemetry:
70+
enabled: true
71+
# Optional: override the OTel Collector sidecar injection annotation
72+
collectorAnnotation: "sidecar.opentelemetry.io/inject"
73+
collectorAnnotationValue: "true"
74+
```
75+
76+
**What users get**: Every LLM call in every Task produces OTel spans visible in Jaeger/Tempo/Grafana, with Task/Agent/namespace identity attached.
77+
78+
### Phase 2: Controller-Side Tracing (Medium effort)
79+
80+
**Goal**: End-to-end trace from Task creation to completion.
81+
82+
**Changes:**
83+
84+
1. **Add OTel tracing to controller reconcile loops** — Use the Go OTel SDK (`go.opentelemetry.io/otel`) to create spans for key operations:
85+
- Task reconcile: `task.reconcile` (root span)
86+
- Pod creation: `task.pod.create`
87+
- Context resolution: `task.context.resolve` (git clone, configmap fetch, URL fetch)
88+
- Status updates: `task.status.update`
89+
90+
2. **Inject trace context into Pod** — Pass the `traceparent` header as an environment variable (`TRACEPARENT`) so OpenCode spans are linked to the controller's trace. This creates a single trace spanning controller → init containers → OpenCode LLM calls.
91+
92+
3. **CronTask trace correlation** — CronTask-triggered Tasks carry a `kubeopencode.crontask.name` attribute, enabling queries like "show me all traces for CronTask X over the past week".
93+
94+
**What users get**: A single trace view showing the full lifecycle: reconcile → git-init → context-init → LLM call 1 → tool call → LLM call 2 → completion.
95+
96+
### Phase 3: OTel Metrics (Complement Prometheus)
97+
98+
**Goal**: Export LLM-specific metrics via OTel for cost and performance dashboards.
99+
100+
This phase bridges the gap between existing Prometheus metrics (infrastructure-level) and LLM-specific metrics:
101+
102+
| Metric | Source | Purpose |
103+
|--------|--------|---------|
104+
| `kubeopencode.llm.tokens.input` | OTel spans | Cost attribution |
105+
| `kubeopencode.llm.tokens.output` | OTel spans | Cost attribution |
106+
| `kubeopencode.llm.duration` | OTel spans | Latency tracking |
107+
| `kubeopencode.llm.calls` | OTel spans | Usage patterns |
108+
| `kubeopencode.tool.calls` | OTel spans | Tool usage analysis |
109+
110+
These can be derived from spans using the OTel Collector's [Span Metrics Connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnector), requiring no code changes — just collector configuration.
111+
112+
**What users get**: Grafana dashboards showing token costs by namespace/agent/user, LLM latency trends, and tool usage breakdown.
113+
114+
### What We Explicitly Do NOT Build
115+
116+
- **Custom tracing UI** — Use existing tools (Jaeger, Grafana Tempo)
117+
- **Built-in OTel Collector** — Users bring their own via the OTel Operator
118+
- **Mandatory dependency** — OTel is opt-in; KubeOpenCode works without it
119+
- **Log collection** — Out of scope; handled by standard K8s logging stacks
120+
121+
## Consequences
122+
123+
### Positive
124+
125+
- **Zero-dependency for basic use** — OTel is purely opt-in via `KubeOpenCodeConfig`
126+
- **Leverages existing ecosystem** — OTel Operator + Collector + Grafana; no custom infrastructure
127+
- **Enterprise value** — Cost visibility, debugging, and audit trails are top enterprise requirements
128+
- **Incremental** — Each phase delivers standalone value; no big-bang rollout
129+
- **OpenCode already supports it** — Phase 1 mostly wires up existing capabilities
130+
131+
### Negative
132+
133+
- **OTel Operator dependency for full value** — Users need to install the OTel Operator separately
134+
- **Experimental flag risk** — OpenCode's `experimental_telemetry` could change or be removed upstream
135+
- **Sidecar overhead** — OTel Collector sidecar adds ~50MB memory per Pod (configurable)
136+
- **Phase 2 adds Go dependencies** — `go.opentelemetry.io/otel` and related packages
137+
138+
### Risks and Mitigations
139+
140+
| Risk | Mitigation |
141+
|------|-----------|
142+
| OpenCode removes `experimental_telemetry` | It's moving toward stable in AI SDK; if removed, we adapt the config injection |
143+
| OTel Collector sidecar adds latency | Collector is async (batch exporter); no impact on LLM call latency |
144+
| Too much trace data for long Tasks | Configure sampling in OTel Collector (head or tail sampling) |
145+
146+
## References
147+
148+
- [OpenTelemetry Operator for Kubernetes](https://opentelemetry.io/docs/kubernetes/operator/)
149+
- [AI SDK Telemetry](https://sdk.vercel.ai/docs/ai-sdk-core/telemetry)
150+
- [OTel Collector Span Metrics Connector](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/connector/spanmetricsconnector)
151+
- ADR 0013: Defer Token Usage Tracking to Post-v0.1 (related — OTel spans provide token data)

0 commit comments

Comments
 (0)