feat: add key pool failover metrics to aibridge#25901
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
84de421 to
b2c1768
Compare
|
/coder-agents-review |
|
Chat: Review posted | View chat Review history
deep-review v0.6.1 | Round 1 | Last posted: Round 1, 9 findings (5 P3, 4 Nit), COMMENT. Review Finding inventoryFinding InventoryLaw analysisEffective LOC: +1243 -179 (40 files). Head SHA: b2c1768. Verdict: Don't split. Enforcement: N/A. Findings
Contested and acknowledged(none) Round logRound 1Panel. 18 reviewers (Bisky, Chopper, Ging-Go, Gon, Hisoka, Killua, Kite, Knov, Leorio, Mafu-san, Mafuuu, Melody, Meruem, Pariston, Razor, Robin, Ryosuke, Zoro). Netero: no findings. Law: don't split. 4 P3, 4 Nit, 2 Note (dropped). Reviewed against 05b8fb6..b2c1768. About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
There was a problem hiding this comment.
Well-structured PR. The four metrics (state gauge, transition counter, exhaustion counter, failover histogram) cover the key pool lifecycle end-to-end, and the scrape-time state collector correctly handles lazy cooldown recovery. Moving MarkKeyOnStatus from a free function to a *Pool method and giving the pool its provider identity at construction time eliminates a class of mismatch bugs. Test density is 74.6% with both positive and negative metric assertions across all 7 recording paths.
Mafu-san: "Nil-receiver handling on RecordAttempts was built in without being asked, preventing panics in BYOK paths." Pariston: "I tried to build a case against this PR and couldn't."
4 P3, 4 Nit.
🤖 This review was automatically generated with Coder Agents.
|
|
||
| // WithMetrics attaches metrics so failover events are recorded. It must be | ||
| // called before the pool serves requests. | ||
| func (p *Pool) WithMetrics(m *metrics.Metrics) { |
There was a problem hiding this comment.
P3 [CRF-1] WithMetrics writes p.metrics without synchronization. Today the call sites are correct: attachKeyPoolMetrics always runs before providers.Store() publishes the pool. But the method offers no compile-time or runtime enforcement of the ordering constraint. A future caller wiring metrics to a pool that is already serving requests would race on p.metrics reads in MarkKeyOnStatus, recordExhaustion, and RecordAttempts.
The structural fix would be accepting *metrics.Metrics in New() alongside providerName. Both reviewers acknowledge the architecture makes this impractical today (pools are built in provider constructors before metrics exist), so the two-phase init is justified. The temporal safety is real but relies on call-site discipline.
"The comment 'must be called before the pool serves requests' is the only guard." (Hisoka P4, Meruem P3, Knov P3)
🤖
| } | ||
| assert.Equal(t, tc.expectedCredentialHint, interceptor.Credential().Hint, "credential hint") | ||
|
|
||
| // A centralized interception records one failover-attempts |
There was a problem hiding this comment.
P3 [CRF-2] The ~30-line metric assertion block (histogram check + transitions-by-reason loop + exhaustions-by-outcome loop) is copy-pasted into 12+ test function bodies across 7 files. If the assertion logic is wrong, a label changes, or a new metric is added, the fix must be replicated in 12+ spots.
Siblings: streaming_internal_test.go:414, messages/blocking_internal_test.go:315, messages/streaming_internal_test.go:342, responses/blocking_internal_test.go:315, responses/streaming_internal_test.go:326, plus each file's _AgenticLoopFailover variant, plus passthrough_internal_test.go:607.
A small helper in aibridge/internal/testutil taking (t, reg, provider, expectedTransitions, expectedExhaustions) would collapse each copy to one call. The tests are in different internal packages, but internal/testutil is already imported by all of them.
(Robin P3, Bisky Nit, Meruem Note)
🤖
| pools func(t *testing.T, clk quartz.Clock) []*keypool.Pool | ||
| expectedStateCounts []stateCount | ||
| }{ | ||
| { |
There was a problem hiding this comment.
P3 [CRF-3] The no_pools test case sets expectedStateCounts: nil. The assertion loop iterates tc.expectedStateCounts, which is nil, so the loop body never executes. The test passes vacuously. If the collector had a bug that emitted spurious metrics for an empty pool list, this test would not catch it.
Fix: assert absence when no metrics are expected:
if tc.expectedStateCounts == nil {
for _, family := range gathered {
assert.NotEqual(t, "key_pool_state", family.GetName(), "no metrics expected for empty pool list")
}
return
}(Bisky)
🤖
| // Pessimistic cardinality: 2 providers, 6 buckets + 3 extra series = up to 18. | ||
| KeyPoolFailoverAttempts: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{ | ||
| Subsystem: "key_pool", | ||
| Name: "failover_attempts", |
There was a problem hiding this comment.
P3 [CRF-4] The metric name key_pool_failover_attempts implies failure-triggered key rotation, but for bridged requests the observation value sums walker.Attempts() across all agentic-loop iterations. A single-key pool processing 5 agentic iterations records 5, which looks like 5 failover retries when 0 failovers occurred.
An operator alerting on histogram_quantile(0.95, key_pool_failover_attempts) > 3 would fire on deep agentic loops with a perfectly healthy key pool. The help text documents "per interception for bridged requests," but the metric name is the contract most operators read first.
Could we consider key_pool_key_attempts or key_pool_keys_tried to make the aggregation semantics visible in the name?
(Knov)
🤖
| @@ -88,6 +88,11 @@ func (i *BlockingInterception) ProcessRequest(w http.ResponseWriter, r *http.Req | |||
| logger.Warn(ctx, "failed to retrieve last user prompt", slog.Error(err)) | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
P3 [CRF-5] This comment restates the two lines below it. var totalKeyAttempts int and defer func() { ... RecordAttempts(totalKeyAttempts) }() already communicate "accumulate key attempts and record once on completion." The comment adds nothing a Go developer wouldn't get from the code.
Same text appears in 5 more files: chatcompletions/streaming.go:131, messages/blocking.go:108, messages/streaming.go:156, responses/blocking.go:89, responses/streaming.go:109.
Two additional redundant comments follow the same pattern:
chatcompletions/streaming.go:185/messages/streaming.go:216/responses/streaming.go:184: "Count this iteration's key attempts." restatestotalKeyAttempts += walker.Attempts().keypool/failover.go:72: "One walker per request, so record its key failover attempts when the request completes." restates the defer call.cli/aibridged.go:50: "Report current key pool state per provider at scrape time." restatesMustRegister(NewStateCollector(...)).
11 comments across 10 files; all are code transliterations. Delete.
(Gon)
🤖
| @@ -267,12 +274,14 @@ func (i *BlockingInterception) ProcessRequest(w http.ResponseWriter, r *http.Req | |||
| return nil | |||
| } | |||
|
|
|||
There was a problem hiding this comment.
Nit [CRF-6] The appended return-value description mostly restates the type signature (*openai.ChatCompletion, int, error). Only the int meaning is non-obvious. Consider trimming to: "The second return counts key attempts for this call."
Siblings: messages/blocking.go:352, responses/blocking.go:155, chatcompletions/blocking.go:296, messages/blocking.go:372, responses/blocking.go:176.
(Gon)
🤖
| @@ -22,7 +23,10 @@ import ( | |||
| "github.com/coder/coder/v2/aibridge/internal/testutil" | |||
| "github.com/coder/coder/v2/aibridge/keypool" | |||
| "github.com/coder/coder/v2/aibridge/mcp" | |||
There was a problem hiding this comment.
Nit [CRF-7] Tests import both codertestutil.PromCounterHasValue (positional label matching at testutil/prometheus.go:72) and promhelp.HistogramValue (name-based matching at coderd/coderdtest/promhelp/metrics.go:67). The positional API matches labels by position without checking names, which is more fragile than promhelp's maps.Equal approach. Using promhelp.CounterValue consistently would unify the assertion style and eliminate the second import.
Same pattern in 6 other test files.
(Robin)
🤖
| p.metrics.KeyPoolFailoverAttempts.WithLabelValues(p.providerName).Observe(float64(attempts)) | ||
| } | ||
|
|
||
| // New creates a pool from the given keys. All keys start in |
There was a problem hiding this comment.
Nit [CRF-8] The doc comment for New does not mention the added providerName parameter, which controls metric labels (provider on every counter, histogram, and gauge this pool emits) and log fields. A developer calling New("", keys, clk) gets silently mislabeled metrics.
Suggestion:
// New creates a pool from the given keys, labeled by providerName in
// metrics and logs. All keys start in the valid state. Returns
// ErrNoKeys if keys is empty and ErrDuplicateKey if any key appears
// more than once.(Leorio)
🤖
| ) | ||
| } | ||
|
|
||
| // Count this iteration's key attempts. |
There was a problem hiding this comment.
Nit [CRF-9] When walker.Next() returns a keyPoolErr, this streaming interceptor breaks out of the outer loop without executing totalKeyAttempts += walker.Attempts(). The responses/streaming.go interceptor explicitly adds walker.Attempts() on the exhaustion path (line 148) before returning. Both are correct because walker.Attempts() returns 0 on exhaustion (no key was handed out), but the inconsistency between interceptor variants could confuse a future reader.
Adding the no-op totalKeyAttempts += walker.Attempts() before the break would make all three streaming implementations structurally identical.
Sibling: messages/streaming.go:216.
(Knov)
🤖
56d2a23 to
6b7f7ef
Compare
6b7f7ef to
71313a0
Compare
10db29f to
e126bf3
Compare
71313a0 to
79895ba
Compare

Description
This PR adds Prometheus metrics for aibridge's API-key failover, giving visibility into key pool health and failover behavior per provider.
The following metrics are introduced:
key_pool_state(gauge): number of keys currently in each state (valid,temporary,permanent) per provider, sampled at scrape time.key_pool_state_transitions_total(counter): key state transitions during failover, labeled byreason(rate_limited,unauthorized,forbidden).key_pool_exhaustions_total(counter): times a pool ran out of usable keys, labeled byoutcome(rate_limited,auth_failed).key_pool_failover_attempts(histogram): keys attempted before success or exhaustion (per interception for bridged requests, per request for passthrough).Changes
MarkKeyOnStatusand key-pool error handling onto*keypool.Pool.KeyPools()accessor on the bridge pool to feed it.Closes coder/internal#1447
Closes https://linear.app/codercom/issue/AIGOV-198/aibridge-key-failover-observability
Note
Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira