Skip to content

feat: add Prometheus metric for agent first connection duration#24179

Merged
jscottmiller merged 5 commits intomainfrom
eng-724/agent-first-connection-metric
Apr 14, 2026
Merged

feat: add Prometheus metric for agent first connection duration#24179
jscottmiller merged 5 commits intomainfrom
eng-724/agent-first-connection-metric

Conversation

@jscottmiller
Copy link
Copy Markdown
Contributor

@jscottmiller jscottmiller commented Apr 8, 2026

Summary

Add coderd_agents_first_connection_seconds histogram metric that records the
duration from workspace agent creation to first connection. This fills an
observability gap — provisioner job timings and startup script metrics exist,
but the agent connection phase (which can take several minutes) was not exposed
to Prometheus.

Closes #21282

Changes

  • coderd/prometheusmetrics/prometheusmetrics.go — Define and register a
    HistogramVec in the existing Agents() polling loop. Observe
    first_connected_at - created_at exactly once per agent via a deduplication
    map, pruned each tick to prevent unbounded memory growth.
  • coderd/prometheusmetrics/prometheusmetrics_test.go — Update TestAgents
    to set first_connected_at on the test agent and assert the histogram is
    collected with correct labels, sample count, and sample sum.
  • docs/admin/integrations/prometheus.md, scripts/metricsdocgen/generated_metrics
    Auto-generated documentation updates from make gen.

Metric details

Property Value
Name coderd_agents_first_connection_seconds
Type histogram
Labels template_name, agent_name, username, workspace_name
Buckets 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h

Example PromQL

# P95 agent connection time by template
histogram_quantile(0.95,
  sum(rate(coderd_agents_first_connection_seconds_bucket[1h])) by (le, template_name)
)
Implementation notes

Design decisions

  • Histogram over gauge: Enables histogram_quantile() for percentile queries.
  • Observe in Agents() polling loop: All required data is already fetched by
    GetWorkspaceAgentsForMetrics() — no new DB queries.
  • Dedup via map[uuid.UUID]struct{}: Prevents re-observing the same agent
    across polling ticks. Pruned each cycle to bound memory.
  • Buckets: Aligned with coderd_provisionerd_workspace_build_timings_seconds
    range (1s–1h).

Overhead at scale (100k active workspaces)

The deduplication map (observedFirstConnection) and per-tick pruning map
(currentAgentIDs) are both map[[16]byte]struct{}. At 100k agents:

  • Memory: ~2.25 MB persistent + ~2.25 MB transient per tick = ~4.5 MB peak.
  • CPU: ~25 ms of map operations per tick (one tick per minute) = <0.05% of one core.

Both are negligible relative to the existing cost of the Agents() loop (the DB
query, per-agent GetWorkspaceAppsByAgentID calls, and coordinator node lookups
dominate).

🤖 Generated by Coder Agents

@github-actions github-actions bot added the community Pull Requests and issues created by the community. label Apr 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 8, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@jscottmiller jscottmiller changed the title feat(coderd/prometheusmetrics): add agent first connection duration metric feat: add Prometheus metric for agent first connection duration Apr 8, 2026
@matifali matifali removed the community Pull Requests and issues created by the community. label Apr 9, 2026
…etric

Add coderd_agents_first_connection_seconds histogram metric that
records the duration from agent creation to first connection. This
fills an observability gap between provisioner job timings and agent
startup script metrics, enabling accurate P90/P95 workspace build
time calculations.

The metric is observed in the existing Agents() polling loop using
data already fetched by GetWorkspaceAgentsForMetrics(). Each agent's
connection time is recorded exactly once via a deduplication map that
is pruned each cycle to prevent unbounded memory growth.

Labels: template_name, agent_name, username, workspace_name.
Buckets: 1s, 10s, 30s, 1m, 2m, 5m, 10m, 30m, 1h.

Closes coder/internal#724
@jscottmiller jscottmiller force-pushed the eng-724/agent-first-connection-metric branch from 5fc1bf6 to 659819d Compare April 10, 2026 15:52
Name: "first_connection_seconds",
Help: "Duration from agent creation to first connection in seconds.",
Buckets: []float64{1, 10, 30, 60, 120, 300, 600, 1800, 3600},
}, []string{agentmetrics.LabelTemplateName, agentmetrics.LabelAgentName, agentmetrics.LabelUsername, agentmetrics.LabelWorkspaceName})
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cardinality of LabelWorkspaceName was called out as a potential issue. I agree, though we already have metrics that use workspace name as a dimension so I'm inclined to keep it in and monitor.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It seems reasonable that a consumer of these metrics would want to know what workspaces are being affected if the connection times aren't uniform.

@jscottmiller jscottmiller marked this pull request as ready for review April 10, 2026 17:36
// Record first connection duration exactly once per agent.
if agent.WorkspaceAgent.FirstConnectedAt.Valid {
if _, alreadyObserved := observedFirstConnection[agent.WorkspaceAgent.ID]; !alreadyObserved {
duration := agent.WorkspaceAgent.FirstConnectedAt.Time.Sub(agent.WorkspaceAgent.CreatedAt).Seconds()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have recently seen/fixed metrics report as negative values in tests when the clock skews or jumps for whatever reason between the first connection time and the creation time. After a PostgreSQL round-trip, timestamps lose their monotonic clock component, making the subtraction susceptible to wall clock adjustments. If these round trip through the DB, it might be good idea to be ready for a negative duration here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - this implementation assumes that the clocks aren't too skewed between the coderd that set workspace build's CreatedAt and the coderd process that set FirstConnectedAt. It looks like we there is a similar assumption in our workspace_build_duration_seconds metrics. I'm going to update this call site to log a warning and discard the result if we encounter a negative value. That way, we can at least see if we're getting skew in one direction. We could also filter in the other direction, though I'm less inclined to do so since the build times can get quite long and I don't want to bias the data with a poor assumption.


// Prune observed agents that are no longer in the
// current fetch to prevent unbounded memory growth.
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this scoping needed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't. Normally I wouldn't do this (and in fact this was added by the coding agent), but given that the containing function is already so large I think that it's a good choice to isolate this behavior a bit.

Namespace: "coderd",
Subsystem: "agents",
Name: "first_connection_seconds",
Help: "Duration from agent creation to first connection in seconds.",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe specify what the connection is being made to. I assume this is the control plane.

Negative durations indicate clock skew between the coderd replica
that created the agent and the one that recorded first_connected_at.
Log a warning and skip the histogram observation to avoid polluting
the metric.
agent.WorkspaceName,
).Observe(duration)
if duration < 0 {
logger.Warn(ctx, "negative agent first connection duration, possible clock skew",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro nit: we could mention here what's happening as a result
negative agent first connection duration (possible clock skew); dropping sample

- Clarify metric help text to specify connection is to the control plane.
- Update warn log to mention the sample is being dropped.
@jscottmiller jscottmiller merged commit 20b953a into main Apr 14, 2026
28 checks passed
@jscottmiller jscottmiller deleted the eng-724/agent-first-connection-metric branch April 14, 2026 17:00
@github-actions github-actions bot locked and limited conversation to collaborators Apr 14, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Prometheus metric for agent first connection duration

3 participants