feat: add automatic key failover for AI Bridge Anthropic#24836
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
15e983c to
c486d89
Compare
c486d89 to
cb3d525
Compare
f13fb00 to
63d2574
Compare
|
/coder-agents-review |
There was a problem hiding this comment.
Solid key pool design. The Walker/Key/Pool decomposition is clean, state transitions are well-guarded with mutexes, and the exhaustion-error hierarchy (transient with cooldown vs permanent) gives clients actionable retry guidance. The integration test crossing request boundaries is excellent coverage. The blocking failover path is tight. The test suite is genuine: no tautologies, real assertions on pool state.
1 P2, 12 P3, 7 Nits, 4 Notes. The P2 is a correctness regression in the streaming pre-stream error path where non-Anthropic upstream failures (TCP reset, DNS, TLS) produce a silent 200. The P3s cluster around test gaps, a latent WriteHeader(0) panic, a TOCTOU in exhaustionError, and inconsistencies between the blocking and streaming error paths.
"Every caller expects either a well-formed SSE stream or a conventional HTTP error. The streaming path has no equivalent propagation." (Mafuuu)
aibridge/keypool/keypool.go:153
Nit [DEREM-16] MarkTemporary returns false when extending an active cooldown (key already in cooldown, new deadline is later). MarkKeyOnStatus gates logger.Warn on that boolean, so a deadline extension from 10s to 60s is applied silently. Operators lose visibility into cooldown escalation under sustained 429s.
Either log the extension unconditionally in
MarkKeyOnStatus, or add a second boolean return signaling "deadline was updated." (Meruem)
🤖
aibridge/intercept/messages/base.go:433
Nit [DEREM-18] antErr was named when every caller passed an Anthropic SDK error. This PR adds pool-exhaustion callers, making the prefix wrong for half the call sites. streaming.go already uses respErr for the same parameter.
(Gon)
🤖
🤖 This review was automatically generated with Coder Agents.
| switch { | ||
| case errors.As(err, &transient): | ||
| return newErrorResponse( | ||
| "all configured keys are rate-limited", |
There was a problem hiding this comment.
how about using .Error() instead of rephrasing the same message.
There was a problem hiding this comment.
You mean reuse the error message from keypool? We could, but we have two levels of error messages here: (1) the keypool error message describes pool state for logs and operators, and (2) the response error message is what's surfaced to developers integrating with the API. They serve different purposes, so I'd rather keep them separate.
There was a problem hiding this comment.
Then I think it would be better to define this error somewhere instead of using string values. It could be shared between providers.
4ae41e2 to
e9b9e65
Compare
| expectRetryAfter: "1", | ||
| }, | ||
| { | ||
| // 200ms rounds up to Retry-After: 1. |
There was a problem hiding this comment.
200ms and 500ms test cases look the same. I think 1 can be removed.
| // processing error into a relayable responseError. Returns nil | ||
| // when the error is unrecoverable, in which case nothing can be | ||
| // relayed back. | ||
| func (*StreamingInterception) mapStreamError(ctx context.Context, logger slog.Logger, streamErr, lastErr error) *responseError { |
There was a problem hiding this comment.
How about merging this function wtih mapExhaustionError? In both cases they do the same, convert some other error into responseError.
Maybe responseError could have constructor from error?
| data: {"type":"message_stop"} | ||
| ` | ||
|
|
||
| func TestStreamingInterception_KeyFailover(t *testing.T) { |
There was a problem hiding this comment.
This test looks the same as blocking. Could be merged.
I don't see test case with mid stream error?
|
|
||
| // stubToolCaller is a minimal mcp.ToolCaller that returns a fixed | ||
| // text result, so the agentic continuation can proceed. | ||
| type stubToolCaller struct{} |
There was a problem hiding this comment.
Could be part of testutil package.
Defined here but also used in blocking test. Similiar rateLimitBody is defined in blocking test but used here.
|
|
||
| StatusCode int `json:"-"` | ||
| StatusCode int `json:"-"` | ||
| RetryAfter time.Duration `json:"-"` |
There was a problem hiding this comment.
this is only used as .Seconds() could be stored as int
| pool := w.pool | ||
| if pool == nil { | ||
| return nil, ErrAllKeysExhausted | ||
| return nil, ErrPermanentKeyPool |
There was a problem hiding this comment.
Could be different kind of error to distingusing empty pool vs all keys permanent.
4bd785f to
2682942
Compare
Merge activity
|

Description
Adds automatic key failover for centralized Anthropic provider. When a key pool is configured, each upstream call walks the pool and tries keys in order until one succeeds or the pool is exhausted. Keys are marked temporary on 429 (with cooldown from
Retry-After) and permanent on 401/403. Errors that aren't key-specific don't trigger failover. Each agentic-loop iteration gets its own fresh walker, so a tool-call continuation can fail over independently of the initial request.BYOK is unchanged: BYOK requests run as a single attempt with no failover.
Changes
config.Anthropiccarries aKeyPool.Keyremains for BYOK X-Api-Key set per interception.keypoolerror types:TransientExhaustionError(carries soonest cooldown) andErrPermanentExhaustion. Replace the priorErrAllKeysExhausted."type": "error"field.Related Issues
Related to: coder/internal#1446
Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes
Follow-up PRs
Note
Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira