feat: add key pool failover metrics to aibridge by ssncferreira · Pull Request #25901 · coder/coder

ssncferreira · 2026-06-01T13:24:44Z

Description

This PR adds Prometheus metrics for aibridge's API-key failover, giving visibility into key pool health and failover behavior per provider.

The following metrics are introduced:

key_pool_state (gauge): number of keys currently in each state (valid, temporary, permanent) per provider, sampled at scrape time.
key_pool_state_transitions_total (counter): key state transitions during failover, labeled by reason (rate_limited, unauthorized, forbidden).
key_pool_exhaustions_total (counter): times a pool ran out of usable keys, labeled by outcome (rate_limited, auth_failed).
key_pool_failover_attempts (histogram): keys attempted before success or exhaustion (per interception for bridged requests, per request for passthrough).

Changes

Moves MarkKeyOnStatus and key-pool error handling onto *keypool.Pool.
Attaches metrics to each provider's key pool at install time, on construction and on provider reload.
Adds a scrape-time state collector and a KeyPools() accessor on the bridge pool to feed it.
Tracks per-request key attempts in the bridged and passthrough failover paths.
Adds test coverage for the new metrics across the keypool unit tests, the bridged intercept failover tests, and the passthrough failover test.

Closes coder/internal#1447
Closes https://linear.app/codercom/issue/AIGOV-198/aibridge-key-failover-observability

Note

Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira

ssncferreira · 2026-06-01T13:24:58Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

ssncferreira · 2026-06-02T16:47:46Z

/coder-agents-review

coder-agents-review · 2026-06-02T16:47:50Z

Chat: Review posted | View chat
Requested: 2026-06-02 16:47 UTC by @ssncferreira
Spend: $54.22 / $100.00

Review history

R1 (2026-06-02): 19 reviewers, 4 Nit, 5 P3, COMMENT. Review

deep-review v0.6.1 | Round 1 | 05b8fb6..b2c1768

Last posted: Round 1, 9 findings (5 P3, 4 Nit), COMMENT. Review

Finding inventory

Finding Inventory

Law analysis

Effective LOC: +1243 -179 (40 files). Head SHA: b2c1768. Verdict: Don't split. Enforcement: N/A.

Findings

#	Sev	Status	Location	Summary	Round	Reviewer	Posted
CRF-1	P3	Open	aibridge/keypool/keypool.go:94	`WithMetrics` writes `p.metrics` without synchronization; safe today but no enforcement	R1	Hisoka P4, Meruem P3, Knov P3	Yes
CRF-2	P3	Open	aibridge/intercept/chatcompletions/blocking_internal_test.go:324	~30-line metric assertion block copy-pasted 12+ times across 7 files	R1	Robin P3, Bisky Nit, Meruem Note	Yes
CRF-3	P3	Open	aibridge/keypool/state_collector_test.go:55	`no_pools` test case passes vacuously; nil expectedStateCounts means no assertions	R1	Bisky	Yes
CRF-4	P3	Open	aibridge/metrics/metrics.go:159	`key_pool_failover_attempts` name misleading for bridged requests; counts all key hand-outs including agentic iterations	R1	Knov	Yes
CRF-5	P3	Open	aibridge/intercept/chatcompletions/blocking.go:91	Redundant comments restating code across 11 locations in 10 files	R1	Gon	Yes
CRF-6	Nit	Open	aibridge/intercept/chatcompletions/blocking.go:277	Return-value doc mostly restates type signature; only `int` meaning is non-obvious	R1	Gon	Yes
CRF-7	Nit	Open	aibridge/intercept/chatcompletions/blocking_internal_test.go:29	Positional `PromCounterHasValue` instead of name-based `promhelp.CounterValue`	R1	Robin	Yes
CRF-8	Nit	Open	aibridge/keypool/keypool.go:109	`New()` doc comment does not mention added `providerName` parameter	R1	Leorio	Yes
CRF-9	Nit	Open	aibridge/intercept/chatcompletions/streaming.go:186	Inconsistent exhaustion-path attempt accumulation across interceptor variants	R1	Knov	Yes
CRF-10	Note	Dropped by orchestrator (justified by architecture; logger dependency on KeyFailoverConfig)	aibridge/provider/provider.go:88	`KeyPool()` returns same value as `KeyFailoverConfig().Pool`	R1	Robin	No
CRF-11	Note	Dropped by orchestrator (enforced by config validation and DB uniqueness)	aibridge/keypool/state_collector.go:37	Collector assumes provider names are unique	R1	Hisoka, Meruem	No

Contested and acknowledged

(none)

Round log

Round 1

Panel. 18 reviewers (Bisky, Chopper, Ging-Go, Gon, Hisoka, Killua, Kite, Knov, Leorio, Mafu-san, Mafuuu, Melody, Meruem, Pariston, Razor, Robin, Ryosuke, Zoro). Netero: no findings. Law: don't split. 4 P3, 4 Nit, 2 Note (dropped). Reviewed against 05b8fb6..b2c1768.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer	Focus
Bisky	tests
Chopper	ops/errors
Churn-guard	change verification
Ging	language modernization
Gon	naming
Hisoka	edge cases
Killua	perf
Kite	change integrity
Knov	contracts
Knuckle	SQL
Kurapika	security
Law	decomposition
Leorio	docs
Luffy	product
Mafu-san	process
Mafuuu	contracts
Melody	dispatch/pairing
Meruem	structural
Nami	frontend
Netero	mechanical checks
Pariston	premise testing
Pen-botter	product gaps
Razor	verification
Robin	duplication
Ryosuke	Go arch
Takumi	concurrency
Zoro	shape

🤖 Managed by Coder Agents.

coder-agents-review

Well-structured PR. The four metrics (state gauge, transition counter, exhaustion counter, failover histogram) cover the key pool lifecycle end-to-end, and the scrape-time state collector correctly handles lazy cooldown recovery. Moving MarkKeyOnStatus from a free function to a *Pool method and giving the pool its provider identity at construction time eliminates a class of mismatch bugs. Test density is 74.6% with both positive and negative metric assertions across all 7 recording paths.

Mafu-san: "Nil-receiver handling on RecordAttempts was built in without being asked, preventing panics in BYOK paths." Pariston: "I tried to build a case against this PR and couldn't."

4 P3, 4 Nit.

🤖 This review was automatically generated with Coder Agents.

coder-agents-review · 2026-06-02T17:25:38Z

+
+// WithMetrics attaches metrics so failover events are recorded. It must be
+// called before the pool serves requests.
+func (p *Pool) WithMetrics(m *metrics.Metrics) {


P3 [CRF-1] WithMetrics writes p.metrics without synchronization. Today the call sites are correct: attachKeyPoolMetrics always runs before providers.Store() publishes the pool. But the method offers no compile-time or runtime enforcement of the ordering constraint. A future caller wiring metrics to a pool that is already serving requests would race on p.metrics reads in MarkKeyOnStatus, recordExhaustion, and RecordAttempts.

The structural fix would be accepting *metrics.Metrics in New() alongside providerName. Both reviewers acknowledge the architecture makes this impractical today (pools are built in provider constructors before metrics exist), so the two-phase init is justified. The temporal safety is real but relies on call-site discipline.

"The comment 'must be called before the pool serves requests' is the only guard." (Hisoka P4, Meruem P3, Knov P3)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

 			}
 			assert.Equal(t, tc.expectedCredentialHint, interceptor.Credential().Hint, "credential hint")
+
+			// A centralized interception records one failover-attempts


P3 [CRF-2] The ~30-line metric assertion block (histogram check + transitions-by-reason loop + exhaustions-by-outcome loop) is copy-pasted into 12+ test function bodies across 7 files. If the assertion logic is wrong, a label changes, or a new metric is added, the fix must be replicated in 12+ spots.

Siblings: streaming_internal_test.go:414, messages/blocking_internal_test.go:315, messages/streaming_internal_test.go:342, responses/blocking_internal_test.go:315, responses/streaming_internal_test.go:326, plus each file's _AgenticLoopFailover variant, plus passthrough_internal_test.go:607.

A small helper in aibridge/internal/testutil taking (t, reg, provider, expectedTransitions, expectedExhaustions) would collapse each copy to one call. The tests are in different internal packages, but internal/testutil is already imported by all of them.

(Robin P3, Bisky Nit, Meruem Note)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

+		pools               func(t *testing.T, clk quartz.Clock) []*keypool.Pool
+		expectedStateCounts []stateCount
+	}{
+		{


P3 [CRF-3] The no_pools test case sets expectedStateCounts: nil. The assertion loop iterates tc.expectedStateCounts, which is nil, so the loop body never executes. The test passes vacuously. If the collector had a bug that emitted spurious metrics for an empty pool list, this test would not catch it.

Fix: assert absence when no metrics are expected:

if tc.expectedStateCounts == nil { for _, family := range gathered { assert.NotEqual(t, "key_pool_state", family.GetName(), "no metrics expected for empty pool list") } return }

(Bisky)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

+		// Pessimistic cardinality: 2 providers, 6 buckets + 3 extra series = up to 18.
+		KeyPoolFailoverAttempts: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
+			Subsystem: "key_pool",
+			Name:      "failover_attempts",


P3 [CRF-4] The metric name key_pool_failover_attempts implies failure-triggered key rotation, but for bridged requests the observation value sums walker.Attempts() across all agentic-loop iterations. A single-key pool processing 5 agentic iterations records 5, which looks like 5 failover retries when 0 failovers occurred.

An operator alerting on histogram_quantile(0.95, key_pool_failover_attempts) > 3 would fire on deep agentic loops with a perfectly healthy key pool. The help text documents "per interception for bridged requests," but the metric name is the contract most operators read first.

Could we consider key_pool_key_attempts or key_pool_keys_tried to make the aggregation semantics visible in the name?

(Knov)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

@@ -88,6 +88,11 @@ func (i *BlockingInterception) ProcessRequest(w http.ResponseWriter, r *http.Req
 		logger.Warn(ctx, "failed to retrieve last user prompt", slog.Error(err))
 	}



P3 [CRF-5] This comment restates the two lines below it. var totalKeyAttempts int and defer func() { ... RecordAttempts(totalKeyAttempts) }() already communicate "accumulate key attempts and record once on completion." The comment adds nothing a Go developer wouldn't get from the code.

Same text appears in 5 more files: chatcompletions/streaming.go:131, messages/blocking.go:108, messages/streaming.go:156, responses/blocking.go:89, responses/streaming.go:109.

Two additional redundant comments follow the same pattern:

chatcompletions/streaming.go:185 / messages/streaming.go:216 / responses/streaming.go:184: "Count this iteration's key attempts." restates totalKeyAttempts += walker.Attempts().

keypool/failover.go:72: "One walker per request, so record its key failover attempts when the request completes." restates the defer call.

cli/aibridged.go:50: "Report current key pool state per provider at scrape time." restates MustRegister(NewStateCollector(...)).

11 comments across 10 files; all are code transliterations. Delete.

(Gon)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

@@ -267,12 +274,14 @@ func (i *BlockingInterception) ProcessRequest(w http.ResponseWriter, r *http.Req
 	return nil
 }



Nit [CRF-6] The appended return-value description mostly restates the type signature (*openai.ChatCompletion, int, error). Only the int meaning is non-obvious. Consider trimming to: "The second return counts key attempts for this call."

Siblings: messages/blocking.go:352, responses/blocking.go:155, chatcompletions/blocking.go:296, messages/blocking.go:372, responses/blocking.go:176.

(Gon)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

@@ -22,7 +23,10 @@ import (
 	"github.com/coder/coder/v2/aibridge/internal/testutil"
 	"github.com/coder/coder/v2/aibridge/keypool"
 	"github.com/coder/coder/v2/aibridge/mcp"


Nit [CRF-7] Tests import both codertestutil.PromCounterHasValue (positional label matching at testutil/prometheus.go:72) and promhelp.HistogramValue (name-based matching at coderd/coderdtest/promhelp/metrics.go:67). The positional API matches labels by position without checking names, which is more fragile than promhelp's maps.Equal approach. Using promhelp.CounterValue consistently would unify the assertion style and eliminate the second import.

Same pattern in 6 other test files.

(Robin)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

+	p.metrics.KeyPoolFailoverAttempts.WithLabelValues(p.providerName).Observe(float64(attempts))
 }

 // New creates a pool from the given keys. All keys start in


Nit [CRF-8] The doc comment for New does not mention the added providerName parameter, which controls metric labels (provider on every counter, histogram, and gauge this pool emits) and log fields. A developer calling New("", keys, clk) gets silently mislabeled metrics.

Suggestion:

// New creates a pool from the given keys, labeled by providerName in // metrics and logs. All keys start in the valid state. Returns // ErrNoKeys if keys is empty and ErrDuplicateKey if any key appears // more than once.

(Leorio)

🤖

coder-agents-review · 2026-06-02T17:25:38Z

 			)
 		}

+		// Count this iteration's key attempts.


Nit [CRF-9] When walker.Next() returns a keyPoolErr, this streaming interceptor breaks out of the outer loop without executing totalKeyAttempts += walker.Attempts(). The responses/streaming.go interceptor explicitly adds walker.Attempts() on the exhaustion path (line 148) before returning. Both are correct because walker.Attempts() returns 0 on exhaustion (no key was handed out), but the inconsistency between interceptor variants could confuse a future reader.

Adding the no-op totalKeyAttempts += walker.Attempts() before the break would make all three streaming implementations structurally identical.

Sibling: messages/streaming.go:216.

(Knov)

🤖

github-actions Bot assigned ssncferreira Jun 1, 2026

ssncferreira force-pushed the ssncf/keypool-failover-metrics branch 9 times, most recently from 84de421 to b2c1768 Compare June 2, 2026 16:43

coder-agents-review Bot reviewed Jun 2, 2026

View reviewed changes

ssncferreira force-pushed the ssncf/keypool-failover-metrics branch 2 times, most recently from 56d2a23 to 6b7f7ef Compare June 3, 2026 18:56

ssncferreira changed the base branch from main to graphite-base/25901 June 3, 2026 19:36

ssncferreira force-pushed the ssncf/keypool-failover-metrics branch from 6b7f7ef to 71313a0 Compare June 3, 2026 19:36

ssncferreira changed the base branch from graphite-base/25901 to ssncf/consolidate-failover-tests June 3, 2026 19:36

This was referenced Jun 3, 2026

refactor(aibridge): move shared mock helpers to testutil #25999

Draft

refactor(aibridge): consolidate key failover interceptor tests #26032

Draft

feat: add key pool failover metrics to aibridge

79895ba

ssncferreira force-pushed the ssncf/consolidate-failover-tests branch from 10db29f to e126bf3 Compare June 3, 2026 19:40

ssncferreira force-pushed the ssncf/keypool-failover-metrics branch from 71313a0 to 79895ba Compare June 3, 2026 19:40

		@@ -88,6 +88,11 @@ func (i BlockingInterception) ProcessRequest(w http.ResponseWriter, r http.Req
		logger.Warn(ctx, "failed to retrieve last user prompt", slog.Error(err))
		}

		@@ -267,12 +274,14 @@ func (i BlockingInterception) ProcessRequest(w http.ResponseWriter, r http.Req
		return nil
		}

Conversation

ssncferreira commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Uh oh!

ssncferreira commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssncferreira commented Jun 2, 2026

Uh oh!

coder-agents-review Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Finding Inventory

Law analysis

Findings

Contested and acknowledged

Round log

Round 1

Uh oh!

coder-agents-review Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

coder-agents-review Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ssncferreira commented Jun 1, 2026 •

edited

Loading

ssncferreira commented Jun 1, 2026 •

edited

Loading

coder-agents-review Bot commented Jun 2, 2026 •

edited

Loading