Skip to content

feat: add automatic key failover for AI Bridge Anthropic#24836

Merged
ssncferreira merged 14 commits into
mainfrom
ssncf/aibridge-anthropic-key-failover
May 7, 2026
Merged

feat: add automatic key failover for AI Bridge Anthropic#24836
ssncferreira merged 14 commits into
mainfrom
ssncf/aibridge-anthropic-key-failover

Conversation

@ssncferreira
Copy link
Copy Markdown
Contributor

@ssncferreira ssncferreira commented Apr 30, 2026

Description

Adds automatic key failover for centralized Anthropic provider. When a key pool is configured, each upstream call walks the pool and tries keys in order until one succeeds or the pool is exhausted. Keys are marked temporary on 429 (with cooldown from Retry-After) and permanent on 401/403. Errors that aren't key-specific don't trigger failover. Each agentic-loop iteration gets its own fresh walker, so a tool-call continuation can fail over independently of the initial request.

BYOK is unchanged: BYOK requests run as a single attempt with no failover.

Changes

  • config.Anthropic carries a KeyPool. Key remains for BYOK X-Api-Key set per interception.
  • Blocking interceptor: walks the pool, marks keys on key-specific failures, returns on first success or non-failover error.
  • Streaming interceptor: per-iteration walker. Pre-stream failures fail over to the next key; mid-stream errors are relayed as SSE events.
  • New keypool error types: TransientExhaustionError (carries soonest cooldown) and ErrPermanentExhaustion. Replace the prior ErrAllKeysExhausted.
  • Error responses now consistently include the outer "type": "error" field.

Related Issues

Related to: coder/internal#1446
Related to: https://linear.app/codercom/issue/AIGOV-197/aibridge-automatic-key-failover-for-bridged-and-passthrough-routes

Follow-up PRs

  • Bedrock multi-key support.
  • Refactor provider vs interceptor config separation.
  • Record the actually-used key in the interception credential hint after failover.

Note

Initially generated by Claude Opus 4.7, modified and reviewed by @ssncferreira

Copy link
Copy Markdown
Contributor Author

ssncferreira commented Apr 30, 2026

@ssncferreira ssncferreira force-pushed the ssncf/aibridge-anthropic-key-failover branch 3 times, most recently from 15e983c to c486d89 Compare April 30, 2026 17:44
@ssncferreira ssncferreira force-pushed the ssncf/aibridge-anthropic-key-failover branch from c486d89 to cb3d525 Compare April 30, 2026 17:57
@ssncferreira ssncferreira force-pushed the ssncf/aibridge-anthropic-key-failover branch 10 times, most recently from f13fb00 to 63d2574 Compare May 4, 2026 16:08
@ssncferreira ssncferreira marked this pull request as ready for review May 4, 2026 16:10
@ssncferreira
Copy link
Copy Markdown
Contributor Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid key pool design. The Walker/Key/Pool decomposition is clean, state transitions are well-guarded with mutexes, and the exhaustion-error hierarchy (transient with cooldown vs permanent) gives clients actionable retry guidance. The integration test crossing request boundaries is excellent coverage. The blocking failover path is tight. The test suite is genuine: no tautologies, real assertions on pool state.

1 P2, 12 P3, 7 Nits, 4 Notes. The P2 is a correctness regression in the streaming pre-stream error path where non-Anthropic upstream failures (TCP reset, DNS, TLS) produce a silent 200. The P3s cluster around test gaps, a latent WriteHeader(0) panic, a TOCTOU in exhaustionError, and inconsistencies between the blocking and streaming error paths.

"Every caller expects either a well-formed SSE stream or a conventional HTTP error. The streaming path has no equivalent propagation." (Mafuuu)


aibridge/keypool/keypool.go:153

Nit [DEREM-16] MarkTemporary returns false when extending an active cooldown (key already in cooldown, new deadline is later). MarkKeyOnStatus gates logger.Warn on that boolean, so a deadline extension from 10s to 60s is applied silently. Operators lose visibility into cooldown escalation under sustained 429s.

Either log the extension unconditionally in MarkKeyOnStatus, or add a second boolean return signaling "deadline was updated." (Meruem)

🤖

aibridge/intercept/messages/base.go:433

Nit [DEREM-18] antErr was named when every caller passed an Anthropic SDK error. This PR adds pool-exhaustion callers, making the prefix wrong for half the call sites. streaming.go already uses respErr for the same parameter.

(Gon)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread aibridge/intercept/messages/streaming.go Outdated
Comment thread aibridge/intercept/messages/base.go
Comment thread aibridge/keypool/keypool.go Outdated
Comment thread aibridge/intercept/messages/blocking_test.go
Comment thread aibridge/provider/anthropic.go
Comment thread aibridge/provider/anthropic.go
Comment thread aibridge/provider/anthropic.go
Comment thread aibridge/intercept/messages/streaming.go
Comment thread aibridge/keypool/keypool.go
Comment thread aibridge/intercept/messages/streaming.go
Comment thread aibridge/intercept/messages/base.go
Comment thread aibridge/intercept/messages/base.go
switch {
case errors.As(err, &transient):
return newErrorResponse(
"all configured keys are rate-limited",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about using .Error() instead of rephrasing the same message.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean reuse the error message from keypool? We could, but we have two levels of error messages here: (1) the keypool error message describes pool state for logs and operators, and (2) the response error message is what's surfaced to developers integrating with the API. They serve different purposes, so I'd rather keep them separate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I think it would be better to define this error somewhere instead of using string values. It could be shared between providers.

Comment thread aibridge/intercept/messages/base.go
Comment thread aibridge/intercept/messages/blocking.go Outdated
Comment thread aibridge/keypool/keymark.go Outdated
Comment thread aibridge/keypool/keypool.go
Comment thread aibridge/provider/anthropic.go
Comment thread aibridge/provider/anthropic.go
Comment thread aibridge/provider/anthropic.go
Comment thread aibridge/intercept/messages/base.go
@ssncferreira ssncferreira requested a review from pawbana May 6, 2026 10:44
@ssncferreira ssncferreira force-pushed the ssncf/aibridge-anthropic-key-failover branch from 4ae41e2 to e9b9e65 Compare May 7, 2026 07:51
expectRetryAfter: "1",
},
{
// 200ms rounds up to Retry-After: 1.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

200ms and 500ms test cases look the same. I think 1 can be removed.

Comment thread aibridge/intercept/messages/base.go Outdated
Comment thread aibridge/intercept/messages/blocking.go
// processing error into a relayable responseError. Returns nil
// when the error is unrecoverable, in which case nothing can be
// relayed back.
func (*StreamingInterception) mapStreamError(ctx context.Context, logger slog.Logger, streamErr, lastErr error) *responseError {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about merging this function wtih mapExhaustionError? In both cases they do the same, convert some other error into responseError.

Maybe responseError could have constructor from error?

data: {"type":"message_stop"}
`

func TestStreamingInterception_KeyFailover(t *testing.T) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks the same as blocking. Could be merged.

I don't see test case with mid stream error?


// stubToolCaller is a minimal mcp.ToolCaller that returns a fixed
// text result, so the agentic continuation can proceed.
type stubToolCaller struct{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be part of testutil package.

Defined here but also used in blocking test. Similiar rateLimitBody is defined in blocking test but used here.


StatusCode int `json:"-"`
StatusCode int `json:"-"`
RetryAfter time.Duration `json:"-"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only used as .Seconds() could be stored as int

Comment thread aibridge/keypool/keymark.go Outdated
pool := w.pool
if pool == nil {
return nil, ErrAllKeysExhausted
return nil, ErrPermanentKeyPool
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be different kind of error to distingusing empty pool vs all keys permanent.

Comment thread aibridge/provider/anthropic.go
@ssncferreira ssncferreira force-pushed the ssncf/aibridge-anthropic-key-failover branch from 4bd785f to 2682942 Compare May 7, 2026 11:36
Copy link
Copy Markdown
Contributor Author

ssncferreira commented May 7, 2026

Merge activity

  • May 7, 1:57 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • May 7, 1:57 PM UTC: @ssncferreira merged this pull request with Graphite.

@ssncferreira ssncferreira merged commit f1155ac into main May 7, 2026
23 checks passed
@ssncferreira ssncferreira deleted the ssncf/aibridge-anthropic-key-failover branch May 7, 2026 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants