feat: add Prometheus metric for agent first connection duration by jscottmiller · Pull Request #24179 · coder/coder

jscottmiller · 2026-04-08T22:14:18Z

Summary

Add coderd_agents_first_connection_seconds histogram metric that records the
duration from workspace agent creation to first connection. This fills an
observability gap — provisioner job timings and startup script metrics exist,
but the agent connection phase (which can take several minutes) was not exposed
to Prometheus.

Closes #21282

Changes

coderd/prometheusmetrics/prometheusmetrics.go — Define and register a
HistogramVec in the existing Agents() polling loop. Observe
first_connected_at - created_at exactly once per agent via a deduplication
map, pruned each tick to prevent unbounded memory growth.
coderd/prometheusmetrics/prometheusmetrics_test.go — Update TestAgents
to set first_connected_at on the test agent and assert the histogram is
collected with correct labels, sample count, and sample sum.
docs/admin/integrations/prometheus.md, scripts/metricsdocgen/generated_metrics —
Auto-generated documentation updates from make gen.

Metric details

Property	Value
Name	`coderd_agents_first_connection_seconds`
Type	histogram
Labels	`template_name`, `agent_name`, `username`, `workspace_name`
Buckets	1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h

Example PromQL

# P95 agent connection time by template
histogram_quantile(0.95,
  sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name)
)

Implementation notes

Design decisions

Histogram over gauge: Enables histogram_quantile() for percentile queries.
Observe in Agents() polling loop: All required data is already fetched by
GetWorkspaceAgentsForMetrics() — no new DB queries.
Dedup via map[uuid.UUID]struct{}: Prevents re-observing the same agent
across polling ticks. Pruned each cycle to bound memory.
Buckets: Aligned with coderd_provisionerd_workspace_build_timings_seconds
range (1s–1h).

Overhead at scale (100k active workspaces)

The deduplication map (observedFirstConnection) and per-tick pruning map
(currentAgentIDs) are both map[[16]byte]struct{}. At 100k agents:

Memory: ~2.25 MB persistent + ~2.25 MB transient per tick = ~4.5 MB peak.
CPU: ~25 ms of map operations per tick (one tick per minute) = <0.05% of one core.

Both are negligible relative to the existing cost of the Agents() loop (the DB
query, per-agent GetWorkspaceAppsByAgentID calls, and coordinator node lookups
dominate).

🤖 Generated by Coder Agents

github-actions · 2026-04-08T22:14:29Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

…etric Add coderd_agents_first_connection_seconds histogram metric that records the duration from agent creation to first connection. This fills an observability gap between provisioner job timings and agent startup script metrics, enabling accurate P90/P95 workspace build time calculations. The metric is observed in the existing Agents() polling loop using data already fetched by GetWorkspaceAgentsForMetrics(). Each agent's connection time is recorded exactly once via a deduplication map that is pruned each cycle to prevent unbounded memory growth. Labels: template_name, agent_name, username, workspace_name. Buckets: 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h. Closes coder/internal#724

jscottmiller · 2026-04-10T16:45:30Z

+		Name:      "first_connection_seconds",
+		Help:      "Duration from agent creation to first connection in seconds.",
+		Buckets:   []float64{1, 10, 30, 60, 120, 300, 600, 1800, 3600},
+	}, []string{agentmetrics.LabelTemplateName, agentmetrics.LabelAgentName, agentmetrics.LabelUsername, agentmetrics.LabelWorkspaceName})


The cardinality of LabelWorkspaceName was called out as a potential issue. I agree, though we already have metrics that use workspace name as a dimension so I'm inclined to keep it in and monitor.

Agreed. It seems reasonable that a consumer of these metrics would want to know what workspaces are being affected if the connection times aren't uniform.

zedkipp · 2026-04-10T21:57:57Z

+				// Record first connection duration exactly once per agent.
+				if agent.WorkspaceAgent.FirstConnectedAt.Valid {
+					if _, alreadyObserved := observedFirstConnection[agent.WorkspaceAgent.ID]; !alreadyObserved {
+						duration := agent.WorkspaceAgent.FirstConnectedAt.Time.Sub(agent.WorkspaceAgent.CreatedAt).Seconds()


I have recently seen/fixed metrics report as negative values in tests when the clock skews or jumps for whatever reason between the first connection time and the creation time. After a PostgreSQL round-trip, timestamps lose their monotonic clock component, making the subtraction susceptible to wall clock adjustments. If these round trip through the DB, it might be good idea to be ready for a negative duration here.

Right - this implementation assumes that the clocks aren't too skewed between the coderd that set workspace build's CreatedAt and the coderd process that set FirstConnectedAt. It looks like we there is a similar assumption in our workspace_build_duration_seconds metrics. I'm going to update this call site to log a warning and discard the result if we encounter a negative value. That way, we can at least see if we're getting skew in one direction. We could also filter in the other direction, though I'm less inclined to do so since the build times can get quite long and I don't want to bias the data with a poor assumption.

zedkipp · 2026-04-10T21:59:32Z


+			// Prune observed agents that are no longer in the
+			// current fetch to prevent unbounded memory growth.
+			{


Is this scoping needed?

It isn't. Normally I wouldn't do this (and in fact this was added by the coding agent), but given that the containing function is already so large I think that it's a good choice to isolate this behavior a bit.

zedkipp · 2026-04-10T22:03:20Z

+		Namespace: "coderd",
+		Subsystem: "agents",
+		Name:      "first_connection_seconds",
+		Help:      "Duration from agent creation to first connection in seconds.",


Maybe specify what the connection is being made to. I assume this is the control plane.

Negative durations indicate clock skew between the coderd replica that created the agent and the one that recorded first_connected_at. Log a warning and skip the histogram observation to avoid polluting the metric.

zedkipp · 2026-04-13T20:24:02Z

-							agent.WorkspaceName,
-						).Observe(duration)
+						if duration < 0 {
+							logger.Warn(ctx, "negative agent first connection duration, possible clock skew",


micro nit: we could mention here what's happening as a result
negative agent first connection duration (possible clock skew); dropping sample

- Clarify metric help text to specify connection is to the control plane. - Update warn log to mention the sample is being dropped.

github-actions bot added the community Pull Requests and issues created by the community. label Apr 8, 2026

github-actions bot assigned jscottmiller Apr 8, 2026

jscottmiller changed the title ~~feat(coderd/prometheusmetrics): add agent first connection duration metric~~ feat: add Prometheus metric for agent first connection duration Apr 8, 2026

matifali removed the community Pull Requests and issues created by the community. label Apr 9, 2026

jscottmiller added 2 commits April 10, 2026 15:36

docs: add comment explaining observedFirstConnection map

659819d

jscottmiller force-pushed the eng-724/agent-first-connection-metric branch from 5fc1bf6 to 659819d Compare April 10, 2026 15:52

jscottmiller commented Apr 10, 2026

View reviewed changes

jscottmiller marked this pull request as ready for review April 10, 2026 17:36

zedkipp reviewed Apr 10, 2026

View reviewed changes

zedkipp approved these changes Apr 10, 2026

View reviewed changes

zedkipp reviewed Apr 10, 2026

View reviewed changes

fix: discard negative agent first connection durations

5e29069

Negative durations indicate clock skew between the coderd replica that created the agent and the one that recorded first_connected_at. Log a warning and skip the histogram observation to avoid polluting the metric.

zedkipp approved these changes Apr 13, 2026

View reviewed changes

zedkipp reviewed Apr 13, 2026

View reviewed changes

jscottmiller added 2 commits April 14, 2026 15:54

fix(coderd/prometheusmetrics): address PR review comments

fd8c77c

- Clarify metric help text to specify connection is to the control plane. - Update warn log to mention the sample is being dropped.

chore: regenerate metrics docs after help text update

9e49e26

jscottmiller merged commit 20b953a into main Apr 14, 2026
28 checks passed

jscottmiller deleted the eng-724/agent-first-connection-metric branch April 14, 2026 17:00

github-actions bot locked and limited conversation to collaborators Apr 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Prometheus metric for agent first connection duration#24179

feat: add Prometheus metric for agent first connection duration#24179
jscottmiller merged 5 commits intomainfrom
eng-724/agent-first-connection-metric

jscottmiller commented Apr 8, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 8, 2026 •

edited

Loading

Uh oh!

jscottmiller Apr 10, 2026

Uh oh!

zedkipp Apr 10, 2026

Uh oh!

zedkipp Apr 10, 2026

Uh oh!

jscottmiller Apr 13, 2026

Uh oh!

zedkipp Apr 10, 2026

Uh oh!

jscottmiller Apr 13, 2026

Uh oh!

zedkipp Apr 10, 2026

Uh oh!

zedkipp Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jscottmiller commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Metric details

Example PromQL

Design decisions

Overhead at scale (100k active workspaces)

Uh oh!

github-actions bot commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jscottmiller commented Apr 8, 2026 •

edited

Loading

github-actions bot commented Apr 8, 2026 •

edited

Loading