feat: add Prometheus metric for agent first connection duration#24179
feat: add Prometheus metric for agent first connection duration#24179jscottmiller merged 5 commits intomainfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
…etric Add coderd_agents_first_connection_seconds histogram metric that records the duration from agent creation to first connection. This fills an observability gap between provisioner job timings and agent startup script metrics, enabling accurate P90/P95 workspace build time calculations. The metric is observed in the existing Agents() polling loop using data already fetched by GetWorkspaceAgentsForMetrics(). Each agent's connection time is recorded exactly once via a deduplication map that is pruned each cycle to prevent unbounded memory growth. Labels: template_name, agent_name, username, workspace_name. Buckets: 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h. Closes coder/internal#724
5fc1bf6 to
659819d
Compare
| Name: "first_connection_seconds", | ||
| Help: "Duration from agent creation to first connection in seconds.", | ||
| Buckets: []float64{1, 10, 30, 60, 120, 300, 600, 1800, 3600}, | ||
| }, []string{agentmetrics.LabelTemplateName, agentmetrics.LabelAgentName, agentmetrics.LabelUsername, agentmetrics.LabelWorkspaceName}) |
There was a problem hiding this comment.
The cardinality of LabelWorkspaceName was called out as a potential issue. I agree, though we already have metrics that use workspace name as a dimension so I'm inclined to keep it in and monitor.
There was a problem hiding this comment.
Agreed. It seems reasonable that a consumer of these metrics would want to know what workspaces are being affected if the connection times aren't uniform.
| // Record first connection duration exactly once per agent. | ||
| if agent.WorkspaceAgent.FirstConnectedAt.Valid { | ||
| if _, alreadyObserved := observedFirstConnection[agent.WorkspaceAgent.ID]; !alreadyObserved { | ||
| duration := agent.WorkspaceAgent.FirstConnectedAt.Time.Sub(agent.WorkspaceAgent.CreatedAt).Seconds() |
There was a problem hiding this comment.
I have recently seen/fixed metrics report as negative values in tests when the clock skews or jumps for whatever reason between the first connection time and the creation time. After a PostgreSQL round-trip, timestamps lose their monotonic clock component, making the subtraction susceptible to wall clock adjustments. If these round trip through the DB, it might be good idea to be ready for a negative duration here.
There was a problem hiding this comment.
Right - this implementation assumes that the clocks aren't too skewed between the coderd that set workspace build's CreatedAt and the coderd process that set FirstConnectedAt. It looks like we there is a similar assumption in our workspace_build_duration_seconds metrics. I'm going to update this call site to log a warning and discard the result if we encounter a negative value. That way, we can at least see if we're getting skew in one direction. We could also filter in the other direction, though I'm less inclined to do so since the build times can get quite long and I don't want to bias the data with a poor assumption.
|
|
||
| // Prune observed agents that are no longer in the | ||
| // current fetch to prevent unbounded memory growth. | ||
| { |
There was a problem hiding this comment.
It isn't. Normally I wouldn't do this (and in fact this was added by the coding agent), but given that the containing function is already so large I think that it's a good choice to isolate this behavior a bit.
| Namespace: "coderd", | ||
| Subsystem: "agents", | ||
| Name: "first_connection_seconds", | ||
| Help: "Duration from agent creation to first connection in seconds.", |
There was a problem hiding this comment.
Maybe specify what the connection is being made to. I assume this is the control plane.
Negative durations indicate clock skew between the coderd replica that created the agent and the one that recorded first_connected_at. Log a warning and skip the histogram observation to avoid polluting the metric.
| agent.WorkspaceName, | ||
| ).Observe(duration) | ||
| if duration < 0 { | ||
| logger.Warn(ctx, "negative agent first connection duration, possible clock skew", |
There was a problem hiding this comment.
micro nit: we could mention here what's happening as a result
negative agent first connection duration (possible clock skew); dropping sample
- Clarify metric help text to specify connection is to the control plane. - Update warn log to mention the sample is being dropped.
Summary
Add
coderd_agents_first_connection_secondshistogram metric that records theduration from workspace agent creation to first connection. This fills an
observability gap — provisioner job timings and startup script metrics exist,
but the agent connection phase (which can take several minutes) was not exposed
to Prometheus.
Closes #21282
Changes
coderd/prometheusmetrics/prometheusmetrics.go— Define and register aHistogramVecin the existingAgents()polling loop. Observefirst_connected_at - created_atexactly once per agent via a deduplicationmap, pruned each tick to prevent unbounded memory growth.
coderd/prometheusmetrics/prometheusmetrics_test.go— UpdateTestAgentsto set
first_connected_aton the test agent and assert the histogram iscollected with correct labels, sample count, and sample sum.
docs/admin/integrations/prometheus.md,scripts/metricsdocgen/generated_metrics—Auto-generated documentation updates from
make gen.Metric details
coderd_agents_first_connection_secondstemplate_name,agent_name,username,workspace_nameExample PromQL
Implementation notes
Design decisions
histogram_quantile()for percentile queries.Agents()polling loop: All required data is already fetched byGetWorkspaceAgentsForMetrics()— no new DB queries.map[uuid.UUID]struct{}: Prevents re-observing the same agentacross polling ticks. Pruned each cycle to bound memory.
coderd_provisionerd_workspace_build_timings_secondsrange (1s–1h).
Overhead at scale (100k active workspaces)
The deduplication map (
observedFirstConnection) and per-tick pruning map(
currentAgentIDs) are bothmap[[16]byte]struct{}. At 100k agents:Both are negligible relative to the existing cost of the
Agents()loop (the DBquery, per-agent
GetWorkspaceAppsByAgentIDcalls, and coordinator node lookupsdominate).