Skip to content

feat(coderd/aibridged): fetch providers over DRPC (GetAIProviders)#26650

Draft
dannykopping wants to merge 1 commit into
pawel/aigov-315-implement-basic-coder-aibridge-start-sub-commandfrom
dk/provider-drpc
Draft

feat(coderd/aibridged): fetch providers over DRPC (GetAIProviders)#26650
dannykopping wants to merge 1 commit into
pawel/aigov-315-implement-basic-coder-aibridge-start-sub-commandfrom
dk/provider-drpc

Conversation

@dannykopping

Copy link
Copy Markdown
Contributor

AIGOV-455: Extend DRPC with provider fetch (GetAIProviders)

Context

We are splitting aibridged (the AI Gateway) into a standalone process that
must not access the database directly. coderd remains the source of truth:
it seeds the ai_providers / ai_provider_keys tables from the environment
(SeedAIProvidersFromEnv, holding LockIDAIProvidersEnvSeed). The standalone
gateway has no provider env vars and no DB access, so it must fetch provider
configuration from coderd over DRPC. While seeding is in flight, the gateway's
fetch must synchronize on the seed lock so it does not race and observe a partial
snapshot.

AIGOV-465 (publish a provider-seed completion signal so the gateway can refresh)
is a follow-up. This issue is initialization-only; refresh for the standalone
gateway is out of scope here.

Already done on the current branch (pawel/aigov-315-...)

  • Transport: websocket dialer (coderd/aibridged/dialer.go),
    /api/v2/aibridge/serve endpoint (enterprise/coderd/aibridgeserve.go),
    yamux + DRPC, AI Gateway key auth, proto version negotiation.
  • buildProvider(aiProviderSpec, cfg, metrics) is refactored to be DB-neutral
    (cli/aibridged.go) and is reusable as-is for an RPC response.
  • aibridged.Server.Client() blocks until connected - a natural hook for
    "fetch providers after connect".
  • The standalone gateway currently builds providers from its own env
    (BuildProvidersFromConfig / ReadAIProvidersFromEnv). This is temporary
    and is removed by this issue.

Decisions

# Decision Choice
1 RPC payload Raw provider rows + decrypted keys; client builds aibridge.Provider
2 Seed-race sync Handler opens read tx, AcquireLock(LockIDAIProvidersEnvSeed), then reads
3 Embedded vs standalone Both init via RPC; embedded keeps pubsub as the refresh trigger
4 Embedded wiring Create srv first, async initial reload, drop the boot-time pre-build
5 Proto shape New ProviderConfigurator service, unary GetAIProviders
6 Settings representation Explicit typed AIProviderBedrock proto message
7 Version Additive change, bump CurrentMinor to 1
8 Init robustness Retry until success; an empty result is valid
9 Disabled providers Include disabled + enabled flag; keys only for enabled
10 Lock test Real-Postgres concurrency test
11 Dead code Remove BuildProvidersFromConfig + ProvidersFromConfig in this PR

Design

Proto (coderd/aibridged/proto/aibridged.proto, then make gen)

// ProviderConfigurator serves AI provider configuration to embedded and
// standalone AI Gateway daemons. The database is the single source of truth;
// coderd seeds it from the environment.
service ProviderConfigurator {
  // GetAIProviders returns the full provider set (enabled and disabled).
  // The server blocks on LockIDAIProvidersEnvSeed so the response is never a
  // partial, mid-seed snapshot.
  rpc GetAIProviders(GetAIProvidersRequest) returns (GetAIProvidersResponse);
  // 465 later adds: rpc WatchAIProviders(...) returns (stream ...);
}

message GetAIProvidersRequest {}

message GetAIProvidersResponse {
  repeated AIProvider providers = 1;
}

message AIProvider {
  string name = 1;
  string type = 2;
  bool enabled = 3;
  string base_url = 4;
  // keys carries bearer API keys, populated only for enabled providers.
  repeated string keys = 5;
  // bedrock is set only when the provider authenticates via AWS Bedrock.
  AIProviderBedrock bedrock = 6;
}

message AIProviderBedrock {
  string region = 1;
  string access_key = 2;
  string access_key_secret = 3;
  string model = 4;
  string small_fast_model = 5;
}

version.go: bump CurrentMinor to 1; extend the version-history comment.
The version gate prevents a v1.1 gateway (which needs GetAIProviders) from
connecting to a v1.0 coderd that lacks it, while an old v1.0 gateway still works
against a new coderd.

Server (coderd/aibridgedserver)

  • Add the GetAIProviders handler:
    • Open a read-only InTx.
    • AcquireLock(LockIDAIProvidersEnvSeed) so the handler waits for any in-flight
      seed transaction to commit/rollback before reading.
    • GetAIProviders{IncludeDisabled: true} + GetAIProviderKeysByProviderIDs
      for the enabled provider IDs only.
    • Map rows to the proto messages (keys only for enabled; typed Bedrock).
    • Run under dbauthz.AsAIBridged (same subject/permissions that
      cli.BuildProviders uses today).
    • Do not log the response struct (it carries plaintext keys / Bedrock secrets).
  • Extend the server's narrow store interface with InTx, AcquireLock,
    GetAIProviders, GetAIProviderKeysByProviderIDs.
  • Register ProviderConfigurator in register.go.

Client plumbing

  • Add proto.DRPCProviderConfiguratorClient to the DRPCClient union
    (coderd/aibridged/client.go) and to the concrete Client struct.
  • Construct/register it in both dialer.go (standalone) and
    CreateInMemoryAIBridgeServer (embedded).

Provider building (cli layer)

  • Add a mapper: proto AIProvider -> aiProviderSpec -> buildProvider.
    Disabled entries -> NewDisabledProviderStub. Reuses the existing DB-neutral
    buildProvider, so construction is byte-identical across embedded and standalone.

Embedded (cli/aibridged.go, cli/server.go)

  • Reorder newAIBridgeDaemon: build an empty pool, create srv
    (aibridged.New), then subscribe a reloader whose Reload does
    srv.Client() -> GetAIProviders -> build -> pool.ReplaceProviders.
  • Make the initial Reload asynchronous so startup does not park on Client().
  • Remove the cli/server.go BuildProviders pre-build and the providers
    argument to newAIBridgeDaemon.
  • Pubsub stays the trigger (AIProvidersChangedChannel), so embedded keeps
    hot-reload; only the data path changes from direct-DB to the in-memory RPC.

Standalone (enterprise/cli/aigatewaystart.go)

  • Remove ReadAIProvidersFromEnv + BuildProvidersFromConfig usage.
  • Build an empty pool, create srv + websocket dialer, then run a
    retry-until-success loop: srv.Client() -> GetAIProviders -> build ->
    ReplaceProviders -> serve. A successful empty list is valid and ends the loop.

Cleanup (this PR)

  • Delete cli.BuildProvidersFromConfig and coderd.ProvidersFromConfig.
  • Split the obsolete cli.BuildProviders: its DB-read half (rows + keys in a tx)
    moves server-side into the handler; its builder half is already buildProvider.
  • Keep ReadAIProvidersFromEnv (still used by the embedded env->DB seed at
    cli/server.go:934).

Tests

  • Real-Postgres concurrency test (CODER_PG_CONNECTION_URL, DB=ci):
    hold LockIDAIProvidersEnvSeed in one tx (simulating an in-flight seed), fire
    GetAIProviders concurrently, assert it blocks until release, then returns the
    seeded set. Advisory locks are Postgres-specific, so this cannot use the mock.
  • Round-trip mapping test: enabled / disabled / bedrock / copilot / keys.

Known tradeoffs / implications

  1. BuildProviders is split - DB-read half to the server handler, builder
    half stays as buildProvider on the client.
  2. Empty-pool window at startup for both modes (503s until the first fetch);
    brief and consistent across modes.
  3. Standalone has no refresh until AIGOV-465 - provider add/enable will not
    propagate to a running standalone gateway; embedded still hot-reloads.
  4. Every embedded reload serializes on the seed lock (advisory, fast) and
    depends on the in-memory Client() being available.
  5. Secrets cross the wire (plaintext keys + Bedrock creds). This is not a new
    posture: the Authorizer.IsAuthorized RPC already carries a full API token on
    the same channel. TLS is governed by the client URL scheme for standalone and
    is trivially safe in-memory for embedded.

Out of scope (AIGOV-465 follow-up)

  • Publishing a provider-seed completion signal.
  • A WatchAIProviders streaming RPC and standalone refresh-on-signal.

Split aibridged so the AI Gateway no longer reads the database directly:
coderd stays the source of truth and serves provider configuration over
a new ProviderConfigurator DRPC service (GetAIProviders, API v1.1).

The handler reads providers + keys under a read-only transaction that
acquires LockIDAIProvidersEnvSeed, so the response is never a partial,
mid-seed snapshot. Both embedded and standalone gateways init via this
RPC; embedded keeps pubsub as the hot-reload trigger, only the data path
moves from direct-DB to the in-memory RPC. Standalone retries until the
first fetch succeeds (an empty list is valid).

Removes the temporary env-based provider building for standalone
(BuildProvidersFromConfig / ProvidersFromConfig) and the obsolete
BuildProviders DB-read path; the DB read now lives server-side in the
handler and the builder half stays as buildProvider on the client.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant