Skip to content

feat: add key pool failover metrics to aibridge#25901

Draft
ssncferreira wants to merge 1 commit into
ssncf/consolidate-failover-testsfrom
ssncf/keypool-failover-metrics
Draft

feat: add key pool failover metrics to aibridge#25901
ssncferreira wants to merge 1 commit into
ssncf/consolidate-failover-testsfrom
ssncf/keypool-failover-metrics

Conversation

@ssncferreira
Copy link
Copy Markdown
Contributor

@ssncferreira ssncferreira commented Jun 1, 2026

Description

This PR adds Prometheus metrics for aibridge's API-key failover, giving visibility into key pool health and failover behavior per provider.

The following metrics are introduced:

  • key_pool_state (gauge): number of keys currently in each state (valid, temporary, permanent) per provider, sampled at scrape time.
  • key_pool_state_transitions_total (counter): key state transitions during failover, labeled by reason (rate_limited, unauthorized, forbidden).
  • key_pool_exhaustions_total (counter): times a pool ran out of usable keys, labeled by outcome (rate_limited, auth_failed).
  • key_pool_failover_attempts (histogram): keys attempted before success or exhaustion (per interception for bridged requests, per request for passthrough).

Changes

  • Moves MarkKeyOnStatus and key-pool error handling onto *keypool.Pool.
  • Attaches metrics to each provider's key pool at install time, on construction and on provider reload.
  • Adds a scrape-time state collector and a KeyPools() accessor on the bridge pool to feed it.
  • Tracks per-request key attempts in the bridged and passthrough failover paths.
  • Adds test coverage for the new metrics across the keypool unit tests, the bridged intercept failover tests, and the passthrough failover test.

Closes coder/internal#1447
Closes https://linear.app/codercom/issue/AIGOV-198/aibridge-key-failover-observability

Note

Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira

Copy link
Copy Markdown
Contributor Author

ssncferreira commented Jun 1, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@ssncferreira ssncferreira force-pushed the ssncf/keypool-failover-metrics branch 9 times, most recently from 84de421 to b2c1768 Compare June 2, 2026 16:43
@ssncferreira
Copy link
Copy Markdown
Contributor Author

/coder-agents-review

@coder-agents-review
Copy link
Copy Markdown
Contributor

coder-agents-review Bot commented Jun 2, 2026

Chat: Review posted | View chat
Requested: 2026-06-02 16:47 UTC by @ssncferreira
Spend: $54.22 / $100.00

Review history
  • R1 (2026-06-02): 19 reviewers, 4 Nit, 5 P3, COMMENT. Review

deep-review v0.6.1 | Round 1 | 05b8fb6..b2c1768

Last posted: Round 1, 9 findings (5 P3, 4 Nit), COMMENT. Review

Finding inventory

Finding Inventory

Law analysis

Effective LOC: +1243 -179 (40 files). Head SHA: b2c1768. Verdict: Don't split. Enforcement: N/A.

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P3 Open aibridge/keypool/keypool.go:94 WithMetrics writes p.metrics without synchronization; safe today but no enforcement R1 Hisoka P4, Meruem P3, Knov P3 Yes
CRF-2 P3 Open aibridge/intercept/chatcompletions/blocking_internal_test.go:324 ~30-line metric assertion block copy-pasted 12+ times across 7 files R1 Robin P3, Bisky Nit, Meruem Note Yes
CRF-3 P3 Open aibridge/keypool/state_collector_test.go:55 no_pools test case passes vacuously; nil expectedStateCounts means no assertions R1 Bisky Yes
CRF-4 P3 Open aibridge/metrics/metrics.go:159 key_pool_failover_attempts name misleading for bridged requests; counts all key hand-outs including agentic iterations R1 Knov Yes
CRF-5 P3 Open aibridge/intercept/chatcompletions/blocking.go:91 Redundant comments restating code across 11 locations in 10 files R1 Gon Yes
CRF-6 Nit Open aibridge/intercept/chatcompletions/blocking.go:277 Return-value doc mostly restates type signature; only int meaning is non-obvious R1 Gon Yes
CRF-7 Nit Open aibridge/intercept/chatcompletions/blocking_internal_test.go:29 Positional PromCounterHasValue instead of name-based promhelp.CounterValue R1 Robin Yes
CRF-8 Nit Open aibridge/keypool/keypool.go:109 New() doc comment does not mention added providerName parameter R1 Leorio Yes
CRF-9 Nit Open aibridge/intercept/chatcompletions/streaming.go:186 Inconsistent exhaustion-path attempt accumulation across interceptor variants R1 Knov Yes
CRF-10 Note Dropped by orchestrator (justified by architecture; logger dependency on KeyFailoverConfig) aibridge/provider/provider.go:88 KeyPool() returns same value as KeyFailoverConfig().Pool R1 Robin No
CRF-11 Note Dropped by orchestrator (enforced by config validation and DB uniqueness) aibridge/keypool/state_collector.go:37 Collector assumes provider names are unique R1 Hisoka, Meruem No

Contested and acknowledged

(none)

Round log

Round 1

Panel. 18 reviewers (Bisky, Chopper, Ging-Go, Gon, Hisoka, Killua, Kite, Knov, Leorio, Mafu-san, Mafuuu, Melody, Meruem, Pariston, Razor, Robin, Ryosuke, Zoro). Netero: no findings. Law: don't split. 4 P3, 4 Nit, 2 Note (dropped). Reviewed against 05b8fb6..b2c1768.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Law decomposition
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-structured PR. The four metrics (state gauge, transition counter, exhaustion counter, failover histogram) cover the key pool lifecycle end-to-end, and the scrape-time state collector correctly handles lazy cooldown recovery. Moving MarkKeyOnStatus from a free function to a *Pool method and giving the pool its provider identity at construction time eliminates a class of mismatch bugs. Test density is 74.6% with both positive and negative metric assertions across all 7 recording paths.

Mafu-san: "Nil-receiver handling on RecordAttempts was built in without being asked, preventing panics in BYOK paths." Pariston: "I tried to build a case against this PR and couldn't."

4 P3, 4 Nit.

🤖 This review was automatically generated with Coder Agents.

Comment thread aibridge/keypool/keypool.go Outdated

// WithMetrics attaches metrics so failover events are recorded. It must be
// called before the pool serves requests.
func (p *Pool) WithMetrics(m *metrics.Metrics) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-1] WithMetrics writes p.metrics without synchronization. Today the call sites are correct: attachKeyPoolMetrics always runs before providers.Store() publishes the pool. But the method offers no compile-time or runtime enforcement of the ordering constraint. A future caller wiring metrics to a pool that is already serving requests would race on p.metrics reads in MarkKeyOnStatus, recordExhaustion, and RecordAttempts.

The structural fix would be accepting *metrics.Metrics in New() alongside providerName. Both reviewers acknowledge the architecture makes this impractical today (pools are built in provider constructors before metrics exist), so the two-phase init is justified. The temporal safety is real but relies on call-site discipline.

"The comment 'must be called before the pool serves requests' is the only guard." (Hisoka P4, Meruem P3, Knov P3)

🤖

}
assert.Equal(t, tc.expectedCredentialHint, interceptor.Credential().Hint, "credential hint")

// A centralized interception records one failover-attempts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-2] The ~30-line metric assertion block (histogram check + transitions-by-reason loop + exhaustions-by-outcome loop) is copy-pasted into 12+ test function bodies across 7 files. If the assertion logic is wrong, a label changes, or a new metric is added, the fix must be replicated in 12+ spots.

Siblings: streaming_internal_test.go:414, messages/blocking_internal_test.go:315, messages/streaming_internal_test.go:342, responses/blocking_internal_test.go:315, responses/streaming_internal_test.go:326, plus each file's _AgenticLoopFailover variant, plus passthrough_internal_test.go:607.

A small helper in aibridge/internal/testutil taking (t, reg, provider, expectedTransitions, expectedExhaustions) would collapse each copy to one call. The tests are in different internal packages, but internal/testutil is already imported by all of them.

(Robin P3, Bisky Nit, Meruem Note)

🤖

pools func(t *testing.T, clk quartz.Clock) []*keypool.Pool
expectedStateCounts []stateCount
}{
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-3] The no_pools test case sets expectedStateCounts: nil. The assertion loop iterates tc.expectedStateCounts, which is nil, so the loop body never executes. The test passes vacuously. If the collector had a bug that emitted spurious metrics for an empty pool list, this test would not catch it.

Fix: assert absence when no metrics are expected:

if tc.expectedStateCounts == nil {
    for _, family := range gathered {
        assert.NotEqual(t, "key_pool_state", family.GetName(), "no metrics expected for empty pool list")
    }
    return
}

(Bisky)

🤖

// Pessimistic cardinality: 2 providers, 6 buckets + 3 extra series = up to 18.
KeyPoolFailoverAttempts: promauto.With(reg).NewHistogramVec(prometheus.HistogramOpts{
Subsystem: "key_pool",
Name: "failover_attempts",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-4] The metric name key_pool_failover_attempts implies failure-triggered key rotation, but for bridged requests the observation value sums walker.Attempts() across all agentic-loop iterations. A single-key pool processing 5 agentic iterations records 5, which looks like 5 failover retries when 0 failovers occurred.

An operator alerting on histogram_quantile(0.95, key_pool_failover_attempts) > 3 would fire on deep agentic loops with a perfectly healthy key pool. The help text documents "per interception for bridged requests," but the metric name is the contract most operators read first.

Could we consider key_pool_key_attempts or key_pool_keys_tried to make the aggregation semantics visible in the name?

(Knov)

🤖

@@ -88,6 +88,11 @@ func (i *BlockingInterception) ProcessRequest(w http.ResponseWriter, r *http.Req
logger.Warn(ctx, "failed to retrieve last user prompt", slog.Error(err))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-5] This comment restates the two lines below it. var totalKeyAttempts int and defer func() { ... RecordAttempts(totalKeyAttempts) }() already communicate "accumulate key attempts and record once on completion." The comment adds nothing a Go developer wouldn't get from the code.

Same text appears in 5 more files: chatcompletions/streaming.go:131, messages/blocking.go:108, messages/streaming.go:156, responses/blocking.go:89, responses/streaming.go:109.

Two additional redundant comments follow the same pattern:

  • chatcompletions/streaming.go:185 / messages/streaming.go:216 / responses/streaming.go:184: "Count this iteration's key attempts." restates totalKeyAttempts += walker.Attempts().
  • keypool/failover.go:72: "One walker per request, so record its key failover attempts when the request completes." restates the defer call.
  • cli/aibridged.go:50: "Report current key pool state per provider at scrape time." restates MustRegister(NewStateCollector(...)).

11 comments across 10 files; all are code transliterations. Delete.

(Gon)

🤖

@@ -267,12 +274,14 @@ func (i *BlockingInterception) ProcessRequest(w http.ResponseWriter, r *http.Req
return nil
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-6] The appended return-value description mostly restates the type signature (*openai.ChatCompletion, int, error). Only the int meaning is non-obvious. Consider trimming to: "The second return counts key attempts for this call."

Siblings: messages/blocking.go:352, responses/blocking.go:155, chatcompletions/blocking.go:296, messages/blocking.go:372, responses/blocking.go:176.

(Gon)

🤖

@@ -22,7 +23,10 @@ import (
"github.com/coder/coder/v2/aibridge/internal/testutil"
"github.com/coder/coder/v2/aibridge/keypool"
"github.com/coder/coder/v2/aibridge/mcp"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-7] Tests import both codertestutil.PromCounterHasValue (positional label matching at testutil/prometheus.go:72) and promhelp.HistogramValue (name-based matching at coderd/coderdtest/promhelp/metrics.go:67). The positional API matches labels by position without checking names, which is more fragile than promhelp's maps.Equal approach. Using promhelp.CounterValue consistently would unify the assertion style and eliminate the second import.

Same pattern in 6 other test files.

(Robin)

🤖

Comment thread aibridge/keypool/keypool.go Outdated
p.metrics.KeyPoolFailoverAttempts.WithLabelValues(p.providerName).Observe(float64(attempts))
}

// New creates a pool from the given keys. All keys start in
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-8] The doc comment for New does not mention the added providerName parameter, which controls metric labels (provider on every counter, histogram, and gauge this pool emits) and log fields. A developer calling New("", keys, clk) gets silently mislabeled metrics.

Suggestion:

// New creates a pool from the given keys, labeled by providerName in
// metrics and logs. All keys start in the valid state. Returns
// ErrNoKeys if keys is empty and ErrDuplicateKey if any key appears
// more than once.

(Leorio)

🤖

)
}

// Count this iteration's key attempts.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [CRF-9] When walker.Next() returns a keyPoolErr, this streaming interceptor breaks out of the outer loop without executing totalKeyAttempts += walker.Attempts(). The responses/streaming.go interceptor explicitly adds walker.Attempts() on the exhaustion path (line 148) before returning. Both are correct because walker.Attempts() returns 0 on exhaustion (no key was handed out), but the inconsistency between interceptor variants could confuse a future reader.

Adding the no-op totalKeyAttempts += walker.Attempts() before the break would make all three streaming implementations structurally identical.

Sibling: messages/streaming.go:216.

(Knov)

🤖

@ssncferreira ssncferreira force-pushed the ssncf/keypool-failover-metrics branch 2 times, most recently from 56d2a23 to 6b7f7ef Compare June 3, 2026 18:56
@ssncferreira ssncferreira changed the base branch from main to graphite-base/25901 June 3, 2026 19:36
@ssncferreira ssncferreira force-pushed the ssncf/keypool-failover-metrics branch from 6b7f7ef to 71313a0 Compare June 3, 2026 19:36
@ssncferreira ssncferreira changed the base branch from graphite-base/25901 to ssncf/consolidate-failover-tests June 3, 2026 19:36
@ssncferreira ssncferreira force-pushed the ssncf/consolidate-failover-tests branch from 10db29f to e126bf3 Compare June 3, 2026 19:40
@ssncferreira ssncferreira force-pushed the ssncf/keypool-failover-metrics branch from 71313a0 to 79895ba Compare June 3, 2026 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aibridge: key failover observability

1 participant