Skip to content

feat: model-level provider priority lists with runtime fallback#23649

Open
ibetitsmike wants to merge 14 commits intomike/multi-provider-configs-stage1from
mike/multi-provider-configs-stage2
Open

feat: model-level provider priority lists with runtime fallback#23649
ibetitsmike wants to merge 14 commits intomike/multi-provider-configs-stage1from
mike/multi-provider-configs-stage2

Conversation

@ibetitsmike
Copy link
Copy Markdown
Collaborator

@ibetitsmike ibetitsmike commented Mar 26, 2026

Summary

Builds on #23647 (Stage 1). Adds model-level provider config priority lists so each model can reference multiple provider configs from the same provider family in fallback order. Disabled/unusable configs are skipped at selection time.

Mux working on behalf of Mike.

Database

  • Migration 000466: New chat_model_provider_configs join table with model_config_id, provider_config_id, priority, timestamps, and proper constraints/indexes.
  • Backfill: Existing models get a priority-0 attachment for their family's currently enabled provider (using ROW_NUMBER() for safe multi-row handling).

API

  • ChatModelConfig response now includes provider_configs (ordered attachments with provider metadata, enabled/has_api_key status).
  • CreateChatModelConfigRequest accepts provider_config_ids (ordered list). Omitting auto-attaches the enabled provider.
  • UpdateChatModelConfigRequest accepts provider_config_ids (null = unchanged, non-null = full replacement).
  • Cross-family validation rejects attachments from the wrong provider family.
  • Duplicate provider config IDs are rejected.

Runtime

  • resolveChatModel walks ordered provider attachments instead of merging by provider family name.
  • Skips disabled providers and providers without API keys.
  • First usable attachment wins. If none usable, model is unavailable.

Frontend

  • ModelForm: Provider config priority editor with up/down reorder buttons, status badges, and form submission wiring.
  • ModelsSection: Provider config count indicator in model list rows.
  • modelConfigFormLogic: providerConfigIds added to form values with proper initialization.
  • Stories: New stories for provider config editing, reordering, and list indicators.

Tests

  • New API tests for create with/without configs, cross-family rejection, duplicate IDs, reorder, nil update.
  • New runtime tests for attachment-based provider resolution with fallback.
  • New frontend unit tests for form value initialization and ordering.

@ibetitsmike
Copy link
Copy Markdown
Collaborator Author

@codex review

@ibetitsmike ibetitsmike changed the title feat(chatd): model-level provider priority lists with runtime fallback feat: model-level provider priority lists with runtime fallback Mar 26, 2026
@coder-tasks
Copy link
Copy Markdown
Contributor

coder-tasks bot commented Mar 26, 2026

Documentation Check

Updates Needed

  • docs/ai-coder/agents/models.md - The Add a model section describes selecting a single provider per model, but this PR introduces model-level provider config priority lists with fallback ordering. The section should document: (1) how to attach multiple provider configs to a model in priority order, (2) using the reorder (up/down) buttons in the new priority editor UI, (3) status badges (disabled / no API key) shown per provider config attachment, and (4) fallback runtime behavior — the model walks through ordered attachments, skips disabled providers or those without API keys, and becomes unavailable if none are usable. The models list view also now shows a provider config count indicator per model row, which should be noted.

Automated review via Coder Tasks

@ibetitsmike
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 08f847d066

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@ibetitsmike ibetitsmike force-pushed the mike/multi-provider-configs-stage1 branch from 2821e52 to a475e54 Compare March 26, 2026 09:18
@ibetitsmike ibetitsmike force-pushed the mike/multi-provider-configs-stage2 branch 5 times, most recently from f3ff651 to 185fe71 Compare March 26, 2026 13:44
@ibetitsmike ibetitsmike force-pushed the mike/multi-provider-configs-stage1 branch 3 times, most recently from e44bf71 to ceb2d46 Compare March 27, 2026 00:11
@ibetitsmike ibetitsmike force-pushed the mike/multi-provider-configs-stage1 branch 4 times, most recently from 1628633 to c4adf90 Compare April 3, 2026 00:57
@ibetitsmike ibetitsmike force-pushed the mike/multi-provider-configs-stage1 branch 9 times, most recently from 5912186 to dcbf8bc Compare April 8, 2026 19:00
@ibetitsmike ibetitsmike force-pushed the mike/multi-provider-configs-stage2 branch from 185fe71 to a492f04 Compare April 9, 2026 01:54
@ibetitsmike ibetitsmike requested a review from mafredri April 9, 2026 01:54
@ibetitsmike
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a492f045d9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

CompressionThreshold: compressionThreshold,
Options: modelConfigRaw,
ProviderConfigID: providerConfigID,
ProviderConfigID: uuid.NullUUID{},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve provider_config_id until bound-delete query is replaced

This update path always writes provider_config_id as NULL, but provider deletion still soft-deletes bound models via SoftDeleteBoundChatModelConfigsByProviderConfigID (which filters on chat_model_configs.provider_config_id). After any model edit, that row no longer matches the delete query, so deleting a provider leaves previously bound models undeleted and drops their attachment rows via FK cascade; chats then reference a model that now fails runtime resolution (no provider configs attached) instead of preserving historical binding.

Useful? React with 👍 / 👎.

FROM
chat_model_configs cmc
JOIN
chat_providers cp ON cp.provider = cmc.provider AND cp.enabled = TRUE
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Seed migration attachments from legacy bound config first

The backfill inserts attachments by joining every enabled provider in the same family, but it never reads the existing chat_model_configs.provider_config_id binding. On upgrade, models that were explicitly bound to a specific provider config can be reordered to a different first attachment (or attached to multiple configs they were never bound to), which changes runtime provider selection as soon as resolveLanguageModelForConfig starts using attachment priority.

Useful? React with 👍 / 👎.

Comment on lines +556 to +560
if trimmedAPIKey := strings.TrimSpace(dbAPIKey); trimmedAPIKey != "" {
keys.ByProvider[normalizedProvider] = trimmedAPIKey
switch normalizedProvider {
case fantasyopenai.Name:
keys.OpenAI = trimmedAPIKey
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clear fallback API key when attached config has no key

This helper only overwrites keys when dbAPIKey is non-empty, so attachment resolution can inherit a family-level fallback key from another config in baseKeys. In resolveLanguageModelForConfig, that inherited key makes resolvedKeys.APIKey(...) non-empty, so an attachment with no own key is treated as usable instead of being skipped, breaking per-config fallback ordering and key isolation for same-provider attachments.

Useful? React with 👍 / 👎.

@ibetitsmike
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-subsystem PR: schema migration with backfill, Go API+runtime, SDK types, and React frontend. The join-table design for model-to-provider priority lists is sound, and the validation in validateProviderConfigIDs is well-structured. The test for ordered provider config listing (ListIncludesOrderedProviderConfigs) is thorough.

Severity count: 3 P0, 3 P1, 5 P2, 2 P3, plus notes.

The three P0s must be resolved before merge:

P0 (1): Provider deletion is broken (coderd/exp_chats.go:4691). Two tests fail on this branch: DeleteProviderSoftDeletesBoundModelsPreservingChatHistory and ReassignsDefaultAfterDeletingProviderOwnedDefault. SoftDeleteBoundChatModelConfigsByProviderConfigID filters on chat_model_configs.provider_config_id, which is now always NULL (the handlers write ProviderConfigID: uuid.NullUUID{} on every create and update). The query matches zero rows. Then DeleteChatProviderByID triggers ON DELETE CASCADE on the join table, silently removing attachment rows. The model survives with zero attachments, stays enabled and visible, but resolveLanguageModelForConfig returns "no provider configs attached" for every chat using it. Fix: rewrite the query to join through chat_model_provider_configs. (Netero P0, Hisoka P1, Kite P0, Knov P0)

P1 (related): SoftDeleteUnboundChatModelConfigsByProvider now matches ALL models (coderd/exp_chats.go:4680). The query filters provider_config_id IS NULL AND provider = @provider. Since provider_config_id is always NULL, every model in the family matches. When shouldCleanUnbound is true (last provider in family deleted), this soft-deletes all models in the family, including those with valid join-table attachments to disabled providers. Same root cause: the "unbound" concept should use the join table. (Netero P2, Kite P0)

P0 (2): dbcrypt bypass (see inline comment on chatmodelproviderconfigs.sql:10).

P0 (3): Backfill ignores existing bindings (see inline comment on migration).

The P1 cluster is also severe: the new resolveLanguageModelForConfig ignores all three provider credential policy flags (see inline comment on chatd.go:5428). 436 lines of tests that covered this behavior were deleted with no replacement.

Additional notes (P3 and below, grouped to avoid comment noise):

Dead SQL: GetEnabledChatModelConfigs still filters on the now-always-NULL provider_config_id column. The bound/unbound SQL branches are dead code. (Hisoka, Kite, Knuckle, Meruem)

DB query per message: resolveLanguageModelForConfig calls p.db.GetModelProviderConfigs per chat message, bypassing configCache. Creates a split: cached for user keys, uncached for provider attachments. (Netero, Meruem, Kite)

Naming: ProviderConfigs field holds []ChatModelProviderAttachment. DB model is ChatModelProviderConfig. Domain vocabulary is split. (Gon)

maps.Clone: ProviderKeysFromConfig hand-rolls map copies where maps.Clone was already used. (Ging-Go)

Prior review: #comment-2992212772 (auto-attach nondeterminism) remains open, code unchanged, no response. #comment-2992212775 (rollback unique index) is resolved.

Meruem: "The provider_config_id column on chat_model_configs is semantically dead, but three queries still use it. All three produce wrong results."

🤖 This review was automatically generated with Coder Agents.

cp.enabled AS provider_enabled,
cp.api_key AS provider_api_key,
cp.base_url AS provider_base_url
FROM
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 JOIN queries bypass dbcrypt decryption, sending encrypted ciphertext as LLM API keys in enterprise deployments. (Knuckle P0)

GetModelProviderConfigs and GetModelProviderConfigsByModelIDs SELECT cp.api_key via a JOIN with chat_providers. The enterprise/dbcrypt layer intercepts queries that return database.ChatProvider and decrypts APIKey. These new queries return custom row types with no dbcrypt wrapper.

Verified: enterprise/dbcrypt/dbcrypt.go has wrappers for GetChatProviderByID, GetEnabledChatProviders, InsertChatProvider, UpdateChatProvider, but none for GetModelProviderConfigs or GetModelProviderConfigsByModelIDs. The tests pass because dbtestutil.NewDB does not wrap with dbcrypt.

Two consequences: (1) resolveLanguageModelForConfig sends ciphertext as bearer tokens, every LLM call returns 401, chat is dead for enterprise+dbcrypt. (2) The ciphertext overwrites correctly resolved user keys in baseKeys.

Fix: either add dbcrypt wrappers for both queries (requires selecting cp.api_key_key_id in the JOIN), or stop selecting cp.api_key in the join query and load the decrypted key via the existing GetChatProviderByID when the runtime needs it.

🤖

CREATE INDEX idx_chat_model_provider_configs_order
ON chat_model_provider_configs (model_config_id, priority);

INSERT INTO chat_model_provider_configs (model_config_id, provider_config_id, priority)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Backfill ignores existing provider_config_id bindings, expanding all models to every enabled family provider. (Mafuuu P2, Knuckle P1, Knov P0)

The backfill joins chat_model_configs cmc JOIN chat_providers cp ON cp.provider = cmc.provider AND cp.enabled = TRUE. This matches ALL enabled providers in the family, regardless of whether cmc.provider_config_id pointed to one specific provider.

Scenario: two OpenAI configs (A created Jan 1, B created Feb 1). Model was bound to B. Backfill creates A at priority 0, B at priority 1. Runtime picks A. The model silently switches providers, potentially using a different API key and base URL.

Also, models bound to disabled providers get zero attachments (the JOIN filters cp.enabled = TRUE), making them permanently unusable. The old code fell back to family defaults.

Fix: distinguish bound from unbound models in the backfill. Bound models should get a single priority-0 attachment for their explicit binding. Unbound models get all enabled family providers.

🤖

p.providerAPIKeys,
database.ChatProvidersByFamilyPrecedence(providers),
userKeys,
resolvedKeys := chatprovider.ProviderKeysFromConfig(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Provider credential policy flags (CentralApiKeyEnabled, AllowUserApiKey, AllowCentralApiKeyFallback) are ignored at runtime. (Mafuuu P1, Meruem P1, Hisoka P2, Kite P2, Knov P2, Knuckle P2)

Six reviewers independently flagged this. ProviderKeysFromConfig(baseKeys, attachment.Provider, attachment.ProviderApiKey, attachment.ProviderBaseUrl) unconditionally overlays the DB-stored API key. The join query does not even SELECT central_api_key_enabled, allow_user_api_key, or allow_central_api_key_fallback.

The deleted resolveProviderAPIKeysForModel routed every provider through ResolveUserProviderKeys, which evaluated all three flags. The deleted TestResolveProviderAPIKeysForModel (436 lines) covered these scenarios. No replacement tests exist.

Concrete failures:

  • CentralApiKeyEnabled=false with a non-empty DB key: the key is used anyway.
  • AllowUserApiKey=true with a user key: the DB key overwrites the user key in baseKeys.
  • AllowUserApiKey=false: user keys leak through baseKeys from family-level resolution.

Fix: add the three policy columns to GetModelProviderConfigs and evaluate them per-attachment, or call ResolveUserProviderKeys per attachment.

🤖

@@ -5178,7 +5291,6 @@ func (api *API) updateChatModelConfig(rw http.ResponseWriter, r *http.Request) {
}

provider := existing.Provider
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Provider family change without provider_config_ids silently creates cross-family orphan attachments. (Mafuuu P2, Meruem P1, Kite P2, Kurapika P2, Knov P1)

The old code tracked providerChanged and auto-resolved a new provider binding. The new code removed providerChanged tracking entirely.

PATCH {"provider": "anthropic"} on an openai model without provider_config_ids: existing openai attachments remain. At runtime, resolveLanguageModelForConfig uses attachment.Provider ("openai") to build the SDK client with modelConfig.Model (possibly an anthropic model name). The model is silently broken.

Tests ProviderChangePreservesExistingAttachments and ProviderChangePreservesAttachmentsEvenWithNoMatchingProvider assert this as intentional. But the runtime consequence is not tested.

Fix: either require provider_config_ids when provider changes (reject with 400), or auto-delete stale attachments and auto-attach the new family's provider.

🤖

keys.ByProvider[normalizedProvider] = trimmedKey
if err != nil {
return nil, chatprovider.ProviderAPIKeys{}, xerrors.Errorf(
"create model: %w",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 ModelFromConfig error halts the fallback loop instead of continuing to the next attachment. (Netero P2, Chopper P2, Bisky P2)

The loop returns immediately when chatprovider.ModelFromConfig fails for the first enabled+keyed attachment. A misconfiguration in the highest-priority attachment stops the entire fallback chain.

The error message carries no identification of which provider config was attempted. An operator receiving "create model: invalid base URL" cannot tell which attachment out of three has the problem.

No test exercises this path. FallbackToSecondProvider tests disabled and keyless skipping but not SDK-construction failures.

Fix: log the per-attachment error at debug level with the provider config ID, continue to the next attachment. Return hard only after all attachments are exhausted.

🤖

req = { ...baseReq, provider_config_id: selectedProviderConfigID };
} else {
req = baseReq;
const currentIds = values.providerConfigIds;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Frontend silently swallows update when user removes all provider config attachments. (Chopper P2)

if (existingOrderedIds.length > 0 && currentIds.length === 0) {
    return;
}

The user can remove all attachments via the X buttons. When they click Save, this guard silently returns without sending a request, showing an error, or providing visual feedback. The backend would reject with 400 ("Provider config IDs cannot be empty"), but the request is never sent.

Fix: either disable the Save button when zero attachments remain, show a validation error, or let the request through so the backend's error message reaches the user.

🤖

@@ -0,0 +1 @@
DROP TABLE IF EXISTS chat_model_provider_configs;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Down migration is lossy: rolling back leaves all models permanently unbound. (Meruem P2, Kite P3)

The up migration backfills the join table from existing data. The runtime code sets provider_config_id = uuid.NullUUID{} on every insert/update. The down migration only drops the join table. After rollback, every model has provider_config_id = NULL. The three queries that depend on this column (GetEnabledChatModelConfigs, SoftDeleteBound..., SoftDeleteUnbound...) will treat all models as unbound.

Fix: restore provider_config_id from the priority-0 row in chat_model_provider_configs before dropping the table.

🤖

Priority: row.Priority,
DisplayName: row.ProviderDisplayName,
Enabled: row.ProviderEnabled,
HasAPIKey: row.ProviderApiKey != "",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 HasAPIKey checks only the DB-stored key, diverging from the provider-level effectiveChatProviderConfigHasAPIKey logic. (Chopper P2)

convertProviderConfigAttachments sets HasAPIKey: row.ProviderApiKey != "". The provider listing endpoint uses effectiveChatProviderConfigHasAPIKey, which merges deployment-level keys (e.g., CODER_AI_OPENAI_API_KEY env var). A provider with no DB key but an env var shows HasAPIKey: false in the attachment, misleading the admin.

Additionally, ProviderKeysFromConfig uses strings.TrimSpace(dbAPIKey), so a whitespace-only DB key shows HasAPIKey: true here but is treated as empty at runtime. (Knov Note)

Fix: use the same effectiveChatProviderConfigHasAPIKey computation, or add a TrimSpace before the empty check.

🤖

}
return model, dbConfig, resolvedKeys, nil
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 resolveLanguageModelForConfig queries the DB per chat message, bypassing the config cache. (Netero P3, Meruem P3, Kite Note)

The old code used p.configCache.EnabledProviders and p.configCache.EnabledProviderByID. The new code calls p.db.GetModelProviderConfigs(ctx, modelConfig.ID) directly. This adds a DB round-trip per message. The attachment data is stable between config changes and would benefit from the existing configCache pattern.

This also creates a split: cached for user keys (via resolveUserProviderAPIKeys), uncached for provider attachments. Stale cache entries and fresh DB reads can produce contradictory state within a single resolution.

🤖

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 32ca1dd080

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

CompressionThreshold: compressionThreshold,
Options: modelConfigRaw,
ProviderConfigID: providerConfigID,
ProviderConfigID: uuid.NullUUID{},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep provider_config_id populated until delete cleanup is migrated

Writing ProviderConfigID as uuid.NullUUID{} here breaks provider-deletion cleanup, because deleteChatProvider in the same file still soft-deletes bound models through SoftDeleteBoundChatModelConfigsByProviderConfigID (legacy chat_model_configs.provider_config_id). In a multi-config family, deleting one provider skips unbound cleanup, cascades away attachment rows, and leaves the model enabled but unattached; runtime resolution then fails with no provider configs attached. Keep the legacy column in sync (or migrate delete cleanup to the attachment table) before clearing it on create/update.

Useful? React with 👍 / 👎.

Comment on lines +5411 to +5412
if len(attachments) == 0 {
return nil, chatprovider.ProviderAPIKeys{}, xerrors.New("no provider configs attached")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow env-key fallback when a model has no provider attachments

This hard failure regresses env-only setups: the commit permits models with zero attachments (e.g., createChatModelConfig leaves resolvedProviderConfigIDs empty when no enabled DB provider exists, and migration 000466 backfills only from chat_providers rows), but resolveLanguageModelForConfig now immediately errors instead of using existing deployment/user key fallback. That makes previously valid env-backed models unavailable with no provider configs attached.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants