feat: model-level provider priority lists with runtime fallback#23649
feat: model-level provider priority lists with runtime fallback#23649ibetitsmike wants to merge 14 commits intomike/multi-provider-configs-stage1from
Conversation
|
@codex review |
Documentation CheckUpdates Needed
Automated review via Coder Tasks |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 08f847d066
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
coderd/database/migrations/000454_model_provider_priority.down.sql
Outdated
Show resolved
Hide resolved
2821e52 to
a475e54
Compare
f3ff651 to
185fe71
Compare
e44bf71 to
ceb2d46
Compare
1628633 to
c4adf90
Compare
5912186 to
dcbf8bc
Compare
…red provider attachments
… provider config order
…provider attachments
… provider attachment order
…e attachment stories
185fe71 to
a492f04
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a492f045d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| CompressionThreshold: compressionThreshold, | ||
| Options: modelConfigRaw, | ||
| ProviderConfigID: providerConfigID, | ||
| ProviderConfigID: uuid.NullUUID{}, |
There was a problem hiding this comment.
Preserve provider_config_id until bound-delete query is replaced
This update path always writes provider_config_id as NULL, but provider deletion still soft-deletes bound models via SoftDeleteBoundChatModelConfigsByProviderConfigID (which filters on chat_model_configs.provider_config_id). After any model edit, that row no longer matches the delete query, so deleting a provider leaves previously bound models undeleted and drops their attachment rows via FK cascade; chats then reference a model that now fails runtime resolution (no provider configs attached) instead of preserving historical binding.
Useful? React with 👍 / 👎.
| FROM | ||
| chat_model_configs cmc | ||
| JOIN | ||
| chat_providers cp ON cp.provider = cmc.provider AND cp.enabled = TRUE |
There was a problem hiding this comment.
Seed migration attachments from legacy bound config first
The backfill inserts attachments by joining every enabled provider in the same family, but it never reads the existing chat_model_configs.provider_config_id binding. On upgrade, models that were explicitly bound to a specific provider config can be reordered to a different first attachment (or attached to multiple configs they were never bound to), which changes runtime provider selection as soon as resolveLanguageModelForConfig starts using attachment priority.
Useful? React with 👍 / 👎.
| if trimmedAPIKey := strings.TrimSpace(dbAPIKey); trimmedAPIKey != "" { | ||
| keys.ByProvider[normalizedProvider] = trimmedAPIKey | ||
| switch normalizedProvider { | ||
| case fantasyopenai.Name: | ||
| keys.OpenAI = trimmedAPIKey |
There was a problem hiding this comment.
Clear fallback API key when attached config has no key
This helper only overwrites keys when dbAPIKey is non-empty, so attachment resolution can inherit a family-level fallback key from another config in baseKeys. In resolveLanguageModelForConfig, that inherited key makes resolvedKeys.APIKey(...) non-empty, so an attachment with no own key is treated as usable instead of being skipped, breaking per-config fallback ordering and key isolation for same-provider attachments.
Useful? React with 👍 / 👎.
|
@codex review |
mafredri
left a comment
There was a problem hiding this comment.
Multi-subsystem PR: schema migration with backfill, Go API+runtime, SDK types, and React frontend. The join-table design for model-to-provider priority lists is sound, and the validation in validateProviderConfigIDs is well-structured. The test for ordered provider config listing (ListIncludesOrderedProviderConfigs) is thorough.
Severity count: 3 P0, 3 P1, 5 P2, 2 P3, plus notes.
The three P0s must be resolved before merge:
P0 (1): Provider deletion is broken (coderd/exp_chats.go:4691). Two tests fail on this branch: DeleteProviderSoftDeletesBoundModelsPreservingChatHistory and ReassignsDefaultAfterDeletingProviderOwnedDefault. SoftDeleteBoundChatModelConfigsByProviderConfigID filters on chat_model_configs.provider_config_id, which is now always NULL (the handlers write ProviderConfigID: uuid.NullUUID{} on every create and update). The query matches zero rows. Then DeleteChatProviderByID triggers ON DELETE CASCADE on the join table, silently removing attachment rows. The model survives with zero attachments, stays enabled and visible, but resolveLanguageModelForConfig returns "no provider configs attached" for every chat using it. Fix: rewrite the query to join through chat_model_provider_configs. (Netero P0, Hisoka P1, Kite P0, Knov P0)
P1 (related): SoftDeleteUnboundChatModelConfigsByProvider now matches ALL models (coderd/exp_chats.go:4680). The query filters provider_config_id IS NULL AND provider = @provider. Since provider_config_id is always NULL, every model in the family matches. When shouldCleanUnbound is true (last provider in family deleted), this soft-deletes all models in the family, including those with valid join-table attachments to disabled providers. Same root cause: the "unbound" concept should use the join table. (Netero P2, Kite P0)
P0 (2): dbcrypt bypass (see inline comment on chatmodelproviderconfigs.sql:10).
P0 (3): Backfill ignores existing bindings (see inline comment on migration).
The P1 cluster is also severe: the new resolveLanguageModelForConfig ignores all three provider credential policy flags (see inline comment on chatd.go:5428). 436 lines of tests that covered this behavior were deleted with no replacement.
Additional notes (P3 and below, grouped to avoid comment noise):
Dead SQL: GetEnabledChatModelConfigs still filters on the now-always-NULL provider_config_id column. The bound/unbound SQL branches are dead code. (Hisoka, Kite, Knuckle, Meruem)
DB query per message: resolveLanguageModelForConfig calls p.db.GetModelProviderConfigs per chat message, bypassing configCache. Creates a split: cached for user keys, uncached for provider attachments. (Netero, Meruem, Kite)
Naming: ProviderConfigs field holds []ChatModelProviderAttachment. DB model is ChatModelProviderConfig. Domain vocabulary is split. (Gon)
maps.Clone: ProviderKeysFromConfig hand-rolls map copies where maps.Clone was already used. (Ging-Go)
Prior review: #comment-2992212772 (auto-attach nondeterminism) remains open, code unchanged, no response. #comment-2992212775 (rollback unique index) is resolved.
Meruem: "The
provider_config_idcolumn onchat_model_configsis semantically dead, but three queries still use it. All three produce wrong results."
🤖 This review was automatically generated with Coder Agents.
| cp.enabled AS provider_enabled, | ||
| cp.api_key AS provider_api_key, | ||
| cp.base_url AS provider_base_url | ||
| FROM |
There was a problem hiding this comment.
P0 JOIN queries bypass dbcrypt decryption, sending encrypted ciphertext as LLM API keys in enterprise deployments. (Knuckle P0)
GetModelProviderConfigsandGetModelProviderConfigsByModelIDsSELECTcp.api_keyvia a JOIN withchat_providers. Theenterprise/dbcryptlayer intercepts queries that returndatabase.ChatProviderand decryptsAPIKey. These new queries return custom row types with no dbcrypt wrapper.
Verified: enterprise/dbcrypt/dbcrypt.go has wrappers for GetChatProviderByID, GetEnabledChatProviders, InsertChatProvider, UpdateChatProvider, but none for GetModelProviderConfigs or GetModelProviderConfigsByModelIDs. The tests pass because dbtestutil.NewDB does not wrap with dbcrypt.
Two consequences: (1) resolveLanguageModelForConfig sends ciphertext as bearer tokens, every LLM call returns 401, chat is dead for enterprise+dbcrypt. (2) The ciphertext overwrites correctly resolved user keys in baseKeys.
Fix: either add dbcrypt wrappers for both queries (requires selecting cp.api_key_key_id in the JOIN), or stop selecting cp.api_key in the join query and load the decrypted key via the existing GetChatProviderByID when the runtime needs it.
🤖
| CREATE INDEX idx_chat_model_provider_configs_order | ||
| ON chat_model_provider_configs (model_config_id, priority); | ||
|
|
||
| INSERT INTO chat_model_provider_configs (model_config_id, provider_config_id, priority) |
There was a problem hiding this comment.
P1 Backfill ignores existing provider_config_id bindings, expanding all models to every enabled family provider. (Mafuuu P2, Knuckle P1, Knov P0)
The backfill joins
chat_model_configs cmc JOIN chat_providers cp ON cp.provider = cmc.provider AND cp.enabled = TRUE. This matches ALL enabled providers in the family, regardless of whethercmc.provider_config_idpointed to one specific provider.
Scenario: two OpenAI configs (A created Jan 1, B created Feb 1). Model was bound to B. Backfill creates A at priority 0, B at priority 1. Runtime picks A. The model silently switches providers, potentially using a different API key and base URL.
Also, models bound to disabled providers get zero attachments (the JOIN filters cp.enabled = TRUE), making them permanently unusable. The old code fell back to family defaults.
Fix: distinguish bound from unbound models in the backfill. Bound models should get a single priority-0 attachment for their explicit binding. Unbound models get all enabled family providers.
🤖
| p.providerAPIKeys, | ||
| database.ChatProvidersByFamilyPrecedence(providers), | ||
| userKeys, | ||
| resolvedKeys := chatprovider.ProviderKeysFromConfig( |
There was a problem hiding this comment.
P1 Provider credential policy flags (CentralApiKeyEnabled, AllowUserApiKey, AllowCentralApiKeyFallback) are ignored at runtime. (Mafuuu P1, Meruem P1, Hisoka P2, Kite P2, Knov P2, Knuckle P2)
Six reviewers independently flagged this. ProviderKeysFromConfig(baseKeys, attachment.Provider, attachment.ProviderApiKey, attachment.ProviderBaseUrl) unconditionally overlays the DB-stored API key. The join query does not even SELECT central_api_key_enabled, allow_user_api_key, or allow_central_api_key_fallback.
The deleted resolveProviderAPIKeysForModel routed every provider through ResolveUserProviderKeys, which evaluated all three flags. The deleted TestResolveProviderAPIKeysForModel (436 lines) covered these scenarios. No replacement tests exist.
Concrete failures:
CentralApiKeyEnabled=falsewith a non-empty DB key: the key is used anyway.AllowUserApiKey=truewith a user key: the DB key overwrites the user key inbaseKeys.AllowUserApiKey=false: user keys leak throughbaseKeysfrom family-level resolution.
Fix: add the three policy columns to GetModelProviderConfigs and evaluate them per-attachment, or call ResolveUserProviderKeys per attachment.
🤖
| @@ -5178,7 +5291,6 @@ func (api *API) updateChatModelConfig(rw http.ResponseWriter, r *http.Request) { | |||
| } | |||
|
|
|||
| provider := existing.Provider | |||
There was a problem hiding this comment.
P2 Provider family change without provider_config_ids silently creates cross-family orphan attachments. (Mafuuu P2, Meruem P1, Kite P2, Kurapika P2, Knov P1)
The old code tracked
providerChangedand auto-resolved a new provider binding. The new code removedproviderChangedtracking entirely.
PATCH {"provider": "anthropic"} on an openai model without provider_config_ids: existing openai attachments remain. At runtime, resolveLanguageModelForConfig uses attachment.Provider ("openai") to build the SDK client with modelConfig.Model (possibly an anthropic model name). The model is silently broken.
Tests ProviderChangePreservesExistingAttachments and ProviderChangePreservesAttachmentsEvenWithNoMatchingProvider assert this as intentional. But the runtime consequence is not tested.
Fix: either require provider_config_ids when provider changes (reject with 400), or auto-delete stale attachments and auto-attach the new family's provider.
🤖
| keys.ByProvider[normalizedProvider] = trimmedKey | ||
| if err != nil { | ||
| return nil, chatprovider.ProviderAPIKeys{}, xerrors.Errorf( | ||
| "create model: %w", |
There was a problem hiding this comment.
P2 ModelFromConfig error halts the fallback loop instead of continuing to the next attachment. (Netero P2, Chopper P2, Bisky P2)
The loop returns immediately when
chatprovider.ModelFromConfigfails for the first enabled+keyed attachment. A misconfiguration in the highest-priority attachment stops the entire fallback chain.
The error message carries no identification of which provider config was attempted. An operator receiving "create model: invalid base URL" cannot tell which attachment out of three has the problem.
No test exercises this path. FallbackToSecondProvider tests disabled and keyless skipping but not SDK-construction failures.
Fix: log the per-attachment error at debug level with the provider config ID, continue to the next attachment. Return hard only after all attachments are exhausted.
🤖
| req = { ...baseReq, provider_config_id: selectedProviderConfigID }; | ||
| } else { | ||
| req = baseReq; | ||
| const currentIds = values.providerConfigIds; |
There was a problem hiding this comment.
P2 Frontend silently swallows update when user removes all provider config attachments. (Chopper P2)
if (existingOrderedIds.length > 0 && currentIds.length === 0) {
return;
}The user can remove all attachments via the X buttons. When they click Save, this guard silently returns without sending a request, showing an error, or providing visual feedback. The backend would reject with 400 ("Provider config IDs cannot be empty"), but the request is never sent.
Fix: either disable the Save button when zero attachments remain, show a validation error, or let the request through so the backend's error message reaches the user.
🤖
| @@ -0,0 +1 @@ | |||
| DROP TABLE IF EXISTS chat_model_provider_configs; | |||
There was a problem hiding this comment.
P2 Down migration is lossy: rolling back leaves all models permanently unbound. (Meruem P2, Kite P3)
The up migration backfills the join table from existing data. The runtime code sets provider_config_id = uuid.NullUUID{} on every insert/update. The down migration only drops the join table. After rollback, every model has provider_config_id = NULL. The three queries that depend on this column (GetEnabledChatModelConfigs, SoftDeleteBound..., SoftDeleteUnbound...) will treat all models as unbound.
Fix: restore provider_config_id from the priority-0 row in chat_model_provider_configs before dropping the table.
🤖
| Priority: row.Priority, | ||
| DisplayName: row.ProviderDisplayName, | ||
| Enabled: row.ProviderEnabled, | ||
| HasAPIKey: row.ProviderApiKey != "", |
There was a problem hiding this comment.
P2 HasAPIKey checks only the DB-stored key, diverging from the provider-level effectiveChatProviderConfigHasAPIKey logic. (Chopper P2)
convertProviderConfigAttachments sets HasAPIKey: row.ProviderApiKey != "". The provider listing endpoint uses effectiveChatProviderConfigHasAPIKey, which merges deployment-level keys (e.g., CODER_AI_OPENAI_API_KEY env var). A provider with no DB key but an env var shows HasAPIKey: false in the attachment, misleading the admin.
Additionally, ProviderKeysFromConfig uses strings.TrimSpace(dbAPIKey), so a whitespace-only DB key shows HasAPIKey: true here but is treated as empty at runtime. (Knov Note)
Fix: use the same effectiveChatProviderConfigHasAPIKey computation, or add a TrimSpace before the empty check.
🤖
| } | ||
| return model, dbConfig, resolvedKeys, nil | ||
| } | ||
|
|
There was a problem hiding this comment.
P3 resolveLanguageModelForConfig queries the DB per chat message, bypassing the config cache. (Netero P3, Meruem P3, Kite Note)
The old code used p.configCache.EnabledProviders and p.configCache.EnabledProviderByID. The new code calls p.db.GetModelProviderConfigs(ctx, modelConfig.ID) directly. This adds a DB round-trip per message. The attachment data is stable between config changes and would benefit from the existing configCache pattern.
This also creates a split: cached for user keys (via resolveUserProviderAPIKeys), uncached for provider attachments. Stale cache entries and fresh DB reads can produce contradictory state within a single resolution.
🤖
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 32ca1dd080
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| CompressionThreshold: compressionThreshold, | ||
| Options: modelConfigRaw, | ||
| ProviderConfigID: providerConfigID, | ||
| ProviderConfigID: uuid.NullUUID{}, |
There was a problem hiding this comment.
Keep provider_config_id populated until delete cleanup is migrated
Writing ProviderConfigID as uuid.NullUUID{} here breaks provider-deletion cleanup, because deleteChatProvider in the same file still soft-deletes bound models through SoftDeleteBoundChatModelConfigsByProviderConfigID (legacy chat_model_configs.provider_config_id). In a multi-config family, deleting one provider skips unbound cleanup, cascades away attachment rows, and leaves the model enabled but unattached; runtime resolution then fails with no provider configs attached. Keep the legacy column in sync (or migrate delete cleanup to the attachment table) before clearing it on create/update.
Useful? React with 👍 / 👎.
| if len(attachments) == 0 { | ||
| return nil, chatprovider.ProviderAPIKeys{}, xerrors.New("no provider configs attached") |
There was a problem hiding this comment.
Allow env-key fallback when a model has no provider attachments
This hard failure regresses env-only setups: the commit permits models with zero attachments (e.g., createChatModelConfig leaves resolvedProviderConfigIDs empty when no enabled DB provider exists, and migration 000466 backfills only from chat_providers rows), but resolveLanguageModelForConfig now immediately errors instead of using existing deployment/user key fallback. That makes previously valid env-backed models unavailable with no provider configs attached.
Useful? React with 👍 / 👎.
Summary
Builds on #23647 (Stage 1). Adds model-level provider config priority lists so each model can reference multiple provider configs from the same provider family in fallback order. Disabled/unusable configs are skipped at selection time.
Database
chat_model_provider_configsjoin table withmodel_config_id,provider_config_id,priority, timestamps, and proper constraints/indexes.ROW_NUMBER()for safe multi-row handling).API
ChatModelConfigresponse now includesprovider_configs(ordered attachments with provider metadata, enabled/has_api_key status).CreateChatModelConfigRequestacceptsprovider_config_ids(ordered list). Omitting auto-attaches the enabled provider.UpdateChatModelConfigRequestacceptsprovider_config_ids(null= unchanged, non-null = full replacement).Runtime
resolveChatModelwalks ordered provider attachments instead of merging by provider family name.Frontend
providerConfigIdsadded to form values with proper initialization.Tests