Skip to content

feat(enterprise): hot-reload aibridged provider pool from DB on pubsub#24897

Closed
dannykopping wants to merge 1 commit into
graphite-base/24897from
dk/aibridge-providers-pool-reload
Closed

feat(enterprise): hot-reload aibridged provider pool from DB on pubsub#24897
dannykopping wants to merge 1 commit into
graphite-base/24897from
dk/aibridge-providers-pool-reload

Conversation

@dannykopping
Copy link
Copy Markdown
Contributor

@dannykopping dannykopping commented May 1, 2026

Disclaimer: implemented by a Coder Agent using Claude Opus 4.7

Part of the implementation of RFC: Common AI Provider Configs (AIGOV-201).

What this PR does

Switches the in-memory aibridged daemon from a static, env-derived provider list to a database-backed list that hot-reloads via pubsub. After this PR:

  • aibridged loads providers from ai_providers at startup (system actor, dbauthz-gated).
  • The AI provider CRUD handlers publish on ai_providers_changed after every successful Insert/Update/SoftDelete on a provider or key.
  • Each replica subscribes to that channel and triggers aibridged.Server.Reload, which atomically swaps the providers slice on the pool and clears the cached RequestBridge instances.
  • In-flight requests continue against their existing RequestBridge until completion; the cache's OnEvict shutdown closes MCP connections in the background after the existing 5-second grace period.

The proxy daemon is intentionally NOT reloaded yet to keep this PR focused; it still receives the boot-time provider snapshot. A follow-up will introduce a Pooler interface for the proxy and mirror this pattern.

Pubsub channel

The channel name (ai_providers_changed, exported as coderd.AIProvidersChangedChannel) is provider-generic, not aibridge-specific: any future consumer of ai_providers rows can subscribe. Today aibridged is the only subscriber.

Pool changes

  • CachedBridgePool stores providers via atomic.Pointer[[]Provider] instead of a fixed slice.
  • New Reload(providers) method on the Pooler interface that atomically swaps the snapshot, calls cache.Clear, and cache.Waits for buffered writes to drain so a subsequent Acquire always sees the cleared state.

Wire-up

  • enterprise/cli/server.go now calls loadProvidersFromDB(ctx, db, cfg) instead of buildProviders(cfg). The legacy env-driven buildProviders is preserved for the proxy daemon path until the proxy reload follow-up lands.
  • After the daemon is started, the CLI subscribes to coderd.AIProvidersChangedChannel. Each notification re-loads providers from the database and calls aibridgeDaemon.Reload(...).

Tests

  • TestPoolReload — prime the cache, Reload, observe that the next Acquire is a fresh build (KeysAdded=1, Misses=1 against the post-Reload zeroed metrics).
  • TestPoolReloadAfterShutdownReload is a safe no-op after Shutdown.
  • TestAIProvidersPubsubPublish / TestAIProviderKeysPubsubPublish — end-to-end: each handler mutation publishes a notification on the providers-changed channel.
  • All existing CRUD tests (TestAIProvidersCRUD, TestAIProviderKeysCRUD) still pass with the new publish hook in the handlers.
Decision log
  • The publish payload is empty by design: receivers re-query the database to get the new state. This keeps the channel agnostic to dbcrypt-key changes and avoids carrying secrets on the bus.
  • The channel is named ai_providers_changed rather than aibridge_providers_changed because the rows it announces are not aibridge-specific. Subscribers are currently aibridged only, but the contract does not assume that.
  • cache.Wait() after cache.Clear() is required because ristretto's set/clear operations are buffered. Without it, a Reload immediately followed by an Acquire could see the old (cached) bridge.
  • I considered having aibridged own the pubsub subscription, but the package currently has no database/pubsub dependencies. Wiring the subscription in enterprise/cli/server.go keeps the daemon's interface narrow (Reload(providers)) and matches how other daemon lifecycles are managed.
  • The mockgen-generated MockPooler was regenerated rather than hand-edited so future regen passes are deterministic.
  • The proxy daemon path is intentionally left on the boot-time snapshot in this PR. Threading the pool reload through the proxy adds another moving part and is best handled in its own PR.

@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 44db962 to ec333ba Compare May 1, 2026 18:34
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from e6263e1 to c0ca02a Compare May 1, 2026 18:34
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from ec333ba to 87d51ac Compare May 1, 2026 19:24
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from c0ca02a to 9d381bb Compare May 1, 2026 19:24
@dannykopping dannykopping changed the title feat(enterprise/aibridged): hot-reload provider pool from DB on pubsub feat(enterprise): hot-reload aibridged provider pool from DB on pubsub May 1, 2026
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 87d51ac to 847aded Compare May 13, 2026 15:10
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from 9d381bb to ebe85d2 Compare May 13, 2026 15:19
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 847aded to 8cb5a08 Compare May 14, 2026 10:57
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from ebe85d2 to f50344a Compare May 14, 2026 10:57
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from f50344a to fe2d5e5 Compare May 14, 2026 11:23
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch 2 times, most recently from eb0f556 to 99f7de6 Compare May 14, 2026 11:41
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from fe2d5e5 to 42d6c00 Compare May 14, 2026 11:41
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 99f7de6 to 9085404 Compare May 14, 2026 13:20
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch 2 times, most recently from d57f04a to 5d9e052 Compare May 14, 2026 13:38
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch 2 times, most recently from 190628c to 2502c3c Compare May 14, 2026 14:13
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from 5d9e052 to 59d051d Compare May 14, 2026 14:13
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 2502c3c to 9ac5fc7 Compare May 15, 2026 11:39
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from 59d051d to 0268595 Compare May 15, 2026 11:39
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 9ac5fc7 to d5c5710 Compare May 15, 2026 13:16
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from c05f349 to 2c8d709 Compare May 19, 2026 11:39
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 2f0543c to 3fc3676 Compare May 19, 2026 12:52
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from 2c8d709 to a229e16 Compare May 19, 2026 12:52
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch from 3fc3676 to 8f98de6 Compare May 19, 2026 13:03
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch 3 times, most recently from bbe52ac to f21ec02 Compare May 20, 2026 08:39
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch 2 times, most recently from d2c57f1 to be49026 Compare May 20, 2026 12:23
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from f21ec02 to 771c1ba Compare May 20, 2026 12:23
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch 2 times, most recently from d148e00 to 75b4a11 Compare May 20, 2026 12:45
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from 771c1ba to 8c1140e Compare May 20, 2026 12:45
@dannykopping dannykopping changed the base branch from dk/aibridge-providers-env-migration to graphite-base/24897 May 20, 2026 13:15
@dannykopping dannykopping force-pushed the graphite-base/24897 branch from 75b4a11 to c3cde94 Compare May 20, 2026 15:05
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from 8c1140e to 0b18288 Compare May 20, 2026 15:05
@dannykopping dannykopping changed the base branch from graphite-base/24897 to dk/aibridge-providers-env-migration May 20, 2026 15:06
@dannykopping dannykopping force-pushed the dk/aibridge-providers-env-migration branch 2 times, most recently from 7eac2f3 to d1abba1 Compare May 20, 2026 15:34
@dannykopping dannykopping force-pushed the dk/aibridge-providers-pool-reload branch from 0b18288 to 3d87f3f Compare May 20, 2026 15:34
… on pubsub

Switches the in-memory aibridged daemon from a static, env-derived
provider list to a database-backed list that hot-reloads via pubsub.
After this PR:

  - aibridged loads providers from ai_providers at startup (system
    actor, dbauthz-gated) and joins them with ai_provider_keys to
    pick the operator-preferred primary key (first by created_at).
  - Non-Bedrock providers with zero ai_provider_keys are skipped
    with a warning; Bedrock providers always have zero keys and
    authenticate via the encrypted settings blob (AWS access key +
    secret).
  - The CRUD handlers from the previous PR publish on
    'ai_providers_changed' after every successful Insert/Update/
    SoftDelete of a provider AND after every Insert/Delete of a
    key, because key changes alone affect the runtime pool.
  - Each replica subscribes to that channel and triggers
    aibridged.Server.Reload, which atomically swaps the providers
    slice on the pool and clears the cached RequestBridge instances.
  - In-flight requests continue against their existing
    RequestBridge until completion; the cache's OnEvict shutdown
    closes MCP connections in the background after a 5-second grace
    period.

The proxy daemon is intentionally NOT reloaded yet to keep this PR
focused; it still receives the boot-time provider snapshot. A
follow-up will introduce a Pooler interface for the proxy and mirror
this pattern.

Pool changes:
  - CachedBridgePool stores providers via atomic.Pointer[[]Provider]
    instead of a fixed slice.
  - New Reload(providers) method on the Pooler interface that
    atomically swaps the snapshot, calls cache.Clear, and waits for
    buffered writes to drain so a subsequent Acquire always sees the
    new set.

Tests:
  - TestPoolReload covers the happy path: build a pool, acquire a
    bridge, Reload, ensure the next Acquire targets the new provider
    set.
  - TestPoolReloadAfterShutdown ensures Reload is a no-op post-Close
    so a stale subscriber notification cannot resurrect a torn-down
    pool.
  - TestAIProvidersPubsubPublish exercises the producer side: each
    of Insert/Update/Delete on a provider emits a notification on
    AIBridgeProvidersChangedChannel.
  - TestAIProviderKeysPubsubPublish does the same for the keys
    sub-resource (Insert and Delete).
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant