feat: prepare schema for multiple provider configs per family#24145
feat: prepare schema for multiple provider configs per family#24145ibetitsmike wants to merge 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Lines 4364 to 4368 in adfdefa
The new one-per-family guard is now a read-then-insert sequence, but the migration removes chat_providers_provider_key, so concurrent CreateChatProvider requests for the same family can both pass the count check and both insert rows. That violates the commit’s stated app-layer uniqueness guarantee and can lead to nondeterministic default-provider selection (oldest row wins) and inconsistent model binding behavior. This needs an atomic guard (e.g., transactional lock/advisory lock or a temporary DB uniqueness constraint) rather than a separate pre-check.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
johnstcn
left a comment
There was a problem hiding this comment.
This is looking way better! I think it just needs a merge or rebase as it's missing some changes on main.
| dbm.EXPECT().GetChatFileMetadataByChatID(gomock.Any(), file.ID).Return(rows, nil).AnyTimes() | ||
| check.Args(file.ID).Asserts(rbac.ResourceChat.WithOwner(file.OwnerID.String()).InOrg(file.OrganizationID).WithID(file.ID), policy.ActionRead).Returns(rows) | ||
| })) | ||
| s.Run("DeleteOldChatFiles", s.Mocked(func(dbm *dbmock.MockStore, _ *gofakeit.Faker, check *expects) { |
| for _, listed := range providers { | ||
| if listed.ID == siblingB.ID { | ||
| require.False(t, listed.HasEffectiveAPIKey, | ||
| "sibling with CentralAPIKeyEnabled=false must not leak effective key from keyed sibling") | ||
| return | ||
| } | ||
| } | ||
| t.Fatal("sibling B not found in listed providers") | ||
| }) |
There was a problem hiding this comment.
Might also be good to assert that the raw API key is never found in the response
| func TestChatProvidersByFamilyPrecedence(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| provider := func(id, family string, createdAt time.Time) ChatProvider { |
There was a problem hiding this comment.
nit, non-blocking: since you're just making IDs with all bytes, you could just pass in the raw integer, convert it to hex, and then fill with zeroes. Completely optional and non-blocking.
adfdefa to
64e4c6a
Compare
mafredri
left a comment
There was a problem hiding this comment.
Clean schema design for the multi-provider transition. The migration approach (nullable provider_config_id, backfill, soft-delete-before-hard-delete) is well-structured, and the three-branch GetEnabledChatModelConfigs query correctly handles bound, unbound-with-provider, and env-only model configs. The ChatProvidersByFamilyPrecedence helper with its deterministic sort and test coverage is solid. Good separation of new unit tests for pure functions into exp_chats_internal_test.go.
Two merge blockers: 9+ integration tests fail because the application-level uniqueness guard rejects the multi-provider scenario the tests exercise, and the dbauthz Accounting test fails with 5 missing method entries. Beyond those, a TOCTOU race in the uniqueness guard needs a DB-level backstop.
Severity count: 2 P0, 1 P1, 6 P2, 4 P3, plus notes.
P3 codersdk/chats.go:597-598 -- HasAPIKey and HasEffectiveAPIKey are undocumented and their difference is non-obvious. HasAPIKey = stored key on this config row. HasEffectiveAPIKey = usable key after merging deployment fallbacks, central-key policy, and stored key. Add field comments or rename to HasStoredAPIKey / HasEffectiveAPIKey. (Gon P3)
P3 coderd/database/queries/chatproviders.sql:9 -- GetChatProviderByProvider is :one without ORDER BY after the UNIQUE constraint was dropped. Multiple rows can now match; PostgreSQL returns whichever it finds first. Add ORDER BY created_at ASC, id ASC LIMIT 1 or remove the query. (Knuckle P3, Kite P3)
P3 coderd/exp_chats.go:4378 -- Dead IsUniqueViolation handler. The unique constraint on provider was dropped. This branch is unreachable and its message claims enforcement that no longer exists. (Razor P3)
Knov on the double-pointer API: "Go SDK consumers constructing this struct must reason about three states: nil, &(*uuid.UUID)(nil), and &&someUUID. The custom UnmarshalJSON correctly handles the JSON wire format, but programmatic Go callers face an unusual pattern."
🤖 This review was automatically generated with Coder Agents.
coderd/exp_chats_test.go
Outdated
| Provider: "openai", | ||
| DisplayName: "OpenAI No Central", | ||
| CentralAPIKeyEnabled: ptr.Ref(false), | ||
| AllowUserAPIKey: ptr.Ref(true), |
There was a problem hiding this comment.
P0 Tests that create multiple providers per family fail with 409 Conflict.
At least 9 new tests (CentralDisabledSiblingDoesNotLeakEffectiveKey, ReportsHasAPIKeyPerConfig, ReportsPerConfigEffectiveAPIKeyWhenLaterSiblingStoresKey, EnableGuardrail, FlipEnabled, ResponseReportsPerConfigEffectiveAPIKey, AutoBindsToOldestEnabledProvider, RebindProviderConfig, ProviderConfigIDDisabled) call client.CreateChatProvider twice for the same family. The application-level uniqueness guard at exp_chats.go:4357 returns 409 on the second call. Multiple reviewers verified by running them.
The tests that pass (like MixedProviderConfigsKeepProviderAvailableWhenAnyConfigIsRunnable) work because they insert the second provider via store.InsertChatProvider, bypassing the HTTP guard. The failing tests need the same pattern since multi-provider is a schema concern in this PR, not an API concern yet. (Knuckle P0, Meruem P0, Kurapika P0, Bisky P0, Razor P0)
🤖
| dbm.EXPECT().GetChatFileMetadataByChatID(gomock.Any(), file.ID).Return(rows, nil).AnyTimes() | ||
| check.Args(file.ID).Asserts(rbac.ResourceChat.WithOwner(file.OwnerID.String()).InOrg(file.OrganizationID).WithID(file.ID), policy.ActionRead).Returns(rows) | ||
| })) | ||
| s.Run("DeleteOldChatFiles", s.Mocked(func(dbm *dbmock.MockStore, _ *gofakeit.Faker, check *expects) { |
There was a problem hiding this comment.
P0 Removing these four test cases (DeleteOldChatFiles, DeleteOldChats, GetChatRetentionDays, UpsertChatRetentionDays) breaks TestMethodTestSuite/Accounting.
The dbauthz wrappers and querier interface methods still exist. The accounting framework requires every method to be tested. Additionally, GetChatProviderByProvider was renamed to GetEnabledChatProviderByProvider but the old wrapper at dbauthz.go:2701 still exists and now has no test coverage.
These removals are unrelated to the multi-provider schema change. Restore them. (Mafuuu P0, Bisky P0, Razor P0, Kurapika P2, Knov P2)
🤖
| // Enforce single provider per family. The unique DB constraint was | ||
| // dropped to prepare the schema for multi-provider support, but | ||
| // application-level uniqueness is preserved until that feature ships. | ||
| existingCount, err := api.Database.CountChatProvidersByProviderExcludingID(ctx, |
There was a problem hiding this comment.
P1 TOCTOU race: the uniqueness check and insert are separate non-transactional operations.
CountChatProvidersByProviderExcludingID (here) and InsertChatProvider (line 4364) execute as auto-commit statements. Two concurrent admin requests for the same family can both observe existingCount == 0 and both insert. The old UNIQUE(provider) constraint was the real serialization point; it's now dropped.
The consequence is duplicate providers in a family with no database backstop. Every downstream query (ChatProvidersByFamilyPrecedence, GetEnabledChatProviderByProvider, shouldCleanUnboundModelsAfterProviderDelete) silently picks one and ignores the other. The IsUniqueViolation handler at line 4378 is also dead code now.
Fix options:
- Wrap count + insert in
InTx - Add a partial unique index:
CREATE UNIQUE INDEX idx_chat_providers_one_per_family ON chat_providers (provider)to be dropped explicitly in Part 2
Option 2 is better because it uses the same mechanism as the old schema and doesn't depend on isolation level semantics. (Knuckle P1, Meruem P0, Kurapika P1, Kite P2, Mafuuu P1, Knov P1, Razor P2)
🤖
There was a problem hiding this comment.
Round 2 of silence. Count+insert at lines 4345/4364 still runs outside a transaction. No code change, no response, no ticket. The TOCTOU race remains: two concurrent admin requests for the same family can both pass the count check and both insert, producing duplicate providers.
🤖
| _, hasEffective := chatProviderConfigKeyBooleans(provider, defaultKeys) | ||
| return provider.AllowUserApiKey || hasEffective | ||
| } | ||
| return slices.DeleteFunc(configs, func(config database.ChatModelConfig) bool { |
There was a problem hiding this comment.
P2 filterUserVisibleChatModelConfigs mutates its input slice via slices.DeleteFunc.
slices.DeleteFunc modifies the backing array in place (shifts elements, zeroes tail). The caller at line 596 passes visibleModels, which aliases enabledModels. Currently safe because enabledModels is not read after the filter, but the function signature does not communicate the mutation. Any future caller retaining a reference to the original slice will get corrupted data.
Suggestion: clone the input inside the function (like ChatProvidersByFamilyPrecedence already does) or replace with a collect loop that builds a new slice. (Meruem P2, Knov P2, Kurapika Note, Kite Note, Razor Note)
🤖
There was a problem hiding this comment.
Round 2 of silence. slices.DeleteFunc at line 5577 still mutates the input slice in place. No code change, no response.
🤖
| CentralApiKeyEnabled: centralAPIKeyEnabled, | ||
| }, uuid.Nil), | ||
| ); err != nil { | ||
| if centralAPIKeyEnabled && strings.TrimSpace(req.APIKey) == "" { |
There was a problem hiding this comment.
P2 Behavior change: centralAPIKeyEnabled now requires a directly stored API key, rejecting configurations that previously worked with deployment fallback keys.
Old behavior: validateChatProviderCentralAPIKey called hasEffectiveCentralProviderAPIKey, which checked the stored key, other enabled providers, and deployment env keys. New behavior at this line: centralAPIKeyEnabled && strings.TrimSpace(req.APIKey) == "" rejects if no stored key exists on this row.
Consequence: an admin who configured a provider with central_api_key_enabled: true and no stored key (relying on OPENAI_API_KEY env var) can no longer create or update that provider without providing a key. Runtime chat still works (effective key resolution is unchanged), but the admin path is blocked. Same pattern at line 4513 for update. (Mafuuu P2)
🤖
There was a problem hiding this comment.
Round 2 of silence. Validation at line 4335 still rejects centralAPIKeyEnabled without a stored key, even when a deployment fallback key exists. No code change, no response.
🤖
| httpapi.Write(ctx, rw, http.StatusOK, response) | ||
| } | ||
|
|
||
| func resolveVisibleProviderAvailability( |
There was a problem hiding this comment.
P2 resolveVisibleProviderAvailability, mergeProviderAvailability, filterUserVisibleChatModelConfigs, and runtimeProvidersForVisibleModels are new pure functions with no unit tests.
These have non-trivial logic (switch cases, map lookups, multi-source filtering). They're exercised indirectly through integration tests, but many of those integration tests are currently failing (P0 above), so there's effectively zero verified coverage. Edge cases worth testing: bound config with disabled provider, unbound config with no providers, merge of two "fetch failed" availabilities. (Bisky P2)
🤖
There was a problem hiding this comment.
Round 2 of silence. 4 of 7 pure functions remain without unit tests (resolveVisibleProviderAvailability, mergeProviderAvailability, filterUserVisibleChatModelConfigs, runtimeProvidersForVisibleModels). 3 were covered in round 1. No further changes.
🤖
| "github.com/coder/coder/v2/coderd/x/chatd/chatprovider" | ||
| ) | ||
|
|
||
| func TestShouldCleanUnboundModelsAfterProviderDelete(t *testing.T) { |
There was a problem hiding this comment.
P2 TestShouldCleanUnboundModelsAfterProviderDelete is missing the "enabled sibling remains" case.
Three cases tested: last-enabled-no-fallback (true), fallback-exists (false), disabled-deleted (false). Missing: delete an enabled provider when another enabled sibling remains in the family. Expected: false. This is the normal operating scenario. Without it, a regression that always returns true would not be caught.
Sketch: remainingTotal=1, enabledProviders containing both the deleted and a sibling, empty fallback. Assert false. (Bisky P1)
🤖
There was a problem hiding this comment.
Round 2 of silence. The "enabled sibling remains" test case is still missing. No code change, no response.
🤖
| return q.db.SelectUsageEventsForPublishing(ctx, arg) | ||
| } | ||
|
|
||
| func (q *querier) SoftDeleteBoundChatModelConfigsByProviderConfigID(ctx context.Context, providerConfigID database.SoftDeleteBoundChatModelConfigsByProviderConfigIDParams) (int64, error) { |
There was a problem hiding this comment.
P3 Parameter providerConfigID is a params struct, not a UUID. Same issue at line 5725 where provider is a params struct. Call these arg like every other dbauthz wrapper in the file. (Gon P3)
🤖
| } | ||
|
|
||
| for _, tc := range cases { | ||
| tc := tc |
There was a problem hiding this comment.
P3 tc := tc loop variable shadow is unnecessary since Go 1.22. Same at line 181. (Ging-Go P3, Bisky Note)
🤖
| []chatprovider.ConfiguredProvider, 0, len(enabledProviders), | ||
| []chatprovider.ConfiguredProvider, 0, len(visibleProviders), | ||
| ) | ||
| enabledProviderNames := make(map[string]struct{}, len(enabledProviders)) |
There was a problem hiding this comment.
P3 enabledProviderNames is populated from the unfiltered enabledProviders list, not visibleProviders. For non-admin users, this map is broader than what they can see. The configuredProviders and configuredModels are already scoped to visible entries, so the broader map doesn't leak data in the normal path. However, in the fallback path (ListConfiguredProviderAvailability, when len(configuredModels) == 0), the broader map could surface providers the user shouldn't see. (Mafuuu P3)
🤖
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64e4c6a700
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| existingCount, err := api.Database.CountChatProvidersByProviderExcludingID(ctx, | ||
| database.CountChatProvidersByProviderExcludingIDParams{ | ||
| Provider: provider, | ||
| ID: uuid.Nil, | ||
| }) |
There was a problem hiding this comment.
Make provider-family uniqueness check atomic
The one-per-family guard is now vulnerable to races: this code does CountChatProvidersByProviderExcludingID and then inserts in separate statements without a transaction/lock, and the database unique constraint on chat_providers.provider was removed in the same change. Two concurrent CreateChatProvider requests for the same family can both observe existingCount == 0 and both insert, which breaks the stated single-config guarantee and can produce nondeterministic precedence behavior.
Useful? React with 👍 / 👎.
| return nil | ||
| } | ||
|
|
||
| // GetChatRetentionDays returns the configured chat retention period. | ||
| func (c *ExperimentalClient) GetChatRetentionDays(ctx context.Context) (ChatRetentionDaysResponse, error) { | ||
| res, err := c.Request(ctx, http.MethodGet, "/api/experimental/chats/config/retention-days", nil) | ||
| if err != nil { | ||
| return ChatRetentionDaysResponse{}, err | ||
| } | ||
| defer res.Body.Close() | ||
| if res.StatusCode != http.StatusOK { | ||
| return ChatRetentionDaysResponse{}, ReadBodyAsError(res) | ||
| } | ||
| var resp ChatRetentionDaysResponse | ||
| return resp, json.NewDecoder(res.Body).Decode(&resp) | ||
| } | ||
|
|
||
| // UpdateChatRetentionDays updates the chat retention period. | ||
| func (c *ExperimentalClient) UpdateChatRetentionDays(ctx context.Context, req UpdateChatRetentionDaysRequest) error { | ||
| res, err := c.Request(ctx, http.MethodPut, "/api/experimental/chats/config/retention-days", req) | ||
| if err != nil { | ||
| return err | ||
| } | ||
| defer res.Body.Close() | ||
| if res.StatusCode != http.StatusNoContent { | ||
| return ReadBodyAsError(res) | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| // GetChatTemplateAllowlist returns the deployment-wide chat template allowlist. | ||
| func (c *ExperimentalClient) GetChatTemplateAllowlist(ctx context.Context) (ChatTemplateAllowlist, error) { |
There was a problem hiding this comment.
Restore retention-days client methods
This hunk removes GetChatRetentionDays and UpdateChatRetentionDays from ExperimentalClient while the retention request/response types and server endpoints still exist. That is a silent SDK surface regression for downstream callers upgrading this commit: existing code that previously managed chat retention through the SDK will no longer compile or can no longer call those endpoints via typed helpers.
Useful? React with 👍 / 👎.
| requireSDKError(t, err, http.StatusBadRequest) | ||
| } | ||
|
|
||
| func TestChatRetentionDays(t *testing.T) { |
|
🤖 A couple of additional findings from a deep review: Missing "enabled sibling remains" test case for
|
mafredri
left a comment
There was a problem hiding this comment.
Round 2 fixes address the two P0 merge blockers: tests now use insertProviderConfig to bypass the API uniqueness guard, and all dbauthz test cases are restored. Good.
13 findings from round 1 remain open with no author response (round 2 of silence). No code changes, no comments, no tickets. The TOCTOU race (#3 P1) is the most significant: the unique DB constraint is dropped but the application-level replacement is non-transactional. Remaining P2s and P3s are unchanged.
3 new findings from round 2 reviewers, plus 1 new P3.
Severity count (open): 2 P1, 8 P2, 7 P3.
Hisoka on the dbcrypt trap: "GetChatProviderByProvider is now a function that returns an arbitrary row with an encrypted API key. A beautiful trap."
🤖 This review was automatically generated with Coder Agents.
| }) | ||
|
|
||
| t.Run("UpsertRejectsProviderWithoutUserKeys", func(t *testing.T) { | ||
| t.Run("CreateSecondEnabledDifferentFamily", func(t *testing.T) { |
There was a problem hiding this comment.
P1 Five subtests inside TestUserChatProviderConfigs are character-identical copies of subtests in TestCreateChatProvider/TestUpdateChatProvider and test admin operations, not user provider key behavior.
CreateSecondEnabledDifferentFamily (line 3105), MultipleProviderFamilies (line 3128), ResponseReportsPerConfigEffectiveAPIKey (line 3152), EnableGuardrail (line 3183), and FlipEnabled (line 3218) all use client.CreateChatProvider and client.UpdateChatProvider (admin APIs). They duplicate the same-named subtests under TestCreateChatProvider and TestUpdateChatProvider. They prove nothing new about user provider config behavior.
Four user-scoped tests were deleted without replacement: UpsertRejectsProviderWithoutUserKeys, UpsertRejectsEmptyAPIKey, DeleteRemovesKey, OtherUserDoesNotSeeKey. These covered rejection of user key upsert when the provider disallows it, empty key rejection, key deletion, and cross-user isolation. These behaviors are enforced by the API but no longer tested. (Bisky P1)
🤖
|
|
||
| func (db *dbCrypt) GetChatProviderByProvider(ctx context.Context, providerName string) (database.ChatProvider, error) { | ||
| provider, err := db.Store.GetChatProviderByProvider(ctx, providerName) | ||
| func (db *dbCrypt) GetEnabledChatProviderByProvider(ctx context.Context, providerName string) (database.ChatProvider, error) { |
There was a problem hiding this comment.
P2 GetChatProviderByProvider lost its dbcrypt decryption wrapper when this was renamed to GetEnabledChatProviderByProvider.
The SQL query GetChatProviderByProvider still exists in queries/chatproviders.sql:9 and the querier interface. Calls through a dbcrypt-wrapped store now bypass decryption and return the raw encrypted API key ciphertext. Combined with the non-deterministic :one without ORDER BY (prior #10), this query returns an arbitrary row with encrypted garbage.
Currently only tests call it. But the function is on the Store interface with no compile-time guard against production use. Either add a dbcrypt wrapper for the surviving query, or delete the query entirely. (Hisoka P2, Kite P2)
🤖
Drop the unique constraint on chat_providers.provider and add provider_config_id to chat_model_configs. This prepares the database for multiple provider configs per family while keeping one-per-family enforcement at the application layer. Schema changes: - Drop unique constraint on chat_providers.provider - Add provider_config_id column to chat_model_configs with backfill - Add indexes for provider_config_id and provider lookups Backend changes: - Support provider_config_id binding in model config CRUD - Add provider-enablement filtering to GetEnabledChatModelConfigs - Soft-delete model configs on provider deletion (with updated_by) - Add application-level uniqueness guard for provider creation - Add ensureDefaultChatModelConfig call to deleteChatProvider - Update dbauthz, dbmetrics, dbmock, dbcrypt wiring Runtime and frontend changes for full multi-provider support will follow in a separate PR.
64e4c6a to
c010be7
Compare
Summary
Prepare the database schema to support multiple provider configs per provider family while keeping one-per-family enforcement at the application layer. This is Part 1 of the multi-provider configs feature (split from #23647 per review feedback).
Changes
Schema:
chat_providers.providerprovider_config_idcolumn tochat_model_configswith backfillprovider_config_idand provider lookupsBackend:
provider_config_idbinding in model config CRUDGetEnabledChatModelConfigsupdated_byattribution)ensureDefaultChatModelConfigcall todeleteChatProviderhas_api_key/has_effective_api_keysemantics across endpointsPart 2 (runtime key resolution via
provider_config_idand frontend UI) will follow in a separate PR once this merges.Related: #23647