Skip to content

feat: hot-reload aibridged and aibridgeproxyd providers on DB changes#25673

Merged
dannykopping merged 8 commits into
mainfrom
dk/aibridge-hot-reload
May 27, 2026
Merged

feat: hot-reload aibridged and aibridgeproxyd providers on DB changes#25673
dannykopping merged 8 commits into
mainfrom
dk/aibridge-hot-reload

Conversation

@dannykopping
Copy link
Copy Markdown
Contributor

@dannykopping dannykopping commented May 26, 2026

Previously the in-process aibridge daemon and the enterprise aibridgeproxy daemon both snapshotted their provider routing once at boot. Any ai_providers or ai_provider_keys mutation required a restart for either to pick it up.

Add an ai_providers_changed pubsub channel that the CRUD handlers publish on after Create / Update / Delete. Both daemons subscribe:

  • aibridged rebuilds its []aibridge.Provider snapshot via BuildProviders and swaps it into the pool atomically. Inflight requests keep serving against the bridge they already acquired; new acquires build against the new snapshot. Per-provider construction errors stay scoped to the offending row.
  • aibridgeproxyd rebuilds its routing snapshot from GetAIProviders and swaps the host→provider map atomically. The MITM listener picks up new providers without restart.

DB read for aibridgeproxyd uses the existing AsAIProviderMetadataReader subject for routing-only access.

@dannykopping dannykopping force-pushed the dk/aibridge-hot-reload branch from fbcdba1 to 94a086e Compare May 26, 2026 08:42
@dannykopping dannykopping force-pushed the dk/aibridge-providers-db-load branch from df99b3e to 0702534 Compare May 26, 2026 08:42
Copy link
Copy Markdown
Contributor Author

dannykopping commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

/coder-agents-review

@coder-agents-review
Copy link
Copy Markdown
Contributor

coder-agents-review Bot commented May 26, 2026

Review posted | Chat
Requested: 2026-05-27 07:50 UTC by @dannykopping
Spend: $105.89 / $100.00

Review history
  • R1 (2026-05-26), 1 Nit, 1 P2, 2 P3, COMMENT. Review
  • R2 (2026-05-26): 17 reviewers, 6 Nit, 2 P2, 11 P3, COMMENT. Review
  • R3 (2026-05-27), 6 Nit, 2 P2, 11 P3, COMMENT. Review
  • R4 (2026-05-27): 17 reviewers, 8 Nit, 2 P2, 15 P3, APPROVE. Review

deep-review v0.5.0 | Round 4 | e91bec8..0513e39

Last posted: Round 4, 25 findings (2 P2, 15 P3, 8 Nit), APPROVE. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P2 Author fixed (20bc28d) coderd/aibridged/pool.go:126 CachedBridgePool.Reload and Server.Reload have no direct unit tests R1 Netero Yes
CRF-2 P3 Author fixed (20bc28d) enterprise/aibridgeproxyd/aibridgeproxyd.go:695 Stale comment claims startup validation ensures domain-provider mappings; no longer true with hot-reload R1 Netero Yes
CRF-3 P3 Author fixed (20bc28d) coderd/aibridged/reload_test.go:95 TestSubscribeProviderReloadIgnoresEventError is vacuous; passes trivially without the error-handling branch R1 Netero Yes
CRF-4 Nit Author fixed (20bc28d) enterprise/aibridgeproxyd/aibridgeproxyd.go:451 Unrelated errors.As to errors.AsType refactor mixed into feature PR R1 Netero Yes
CRF-5 P2 Deferred (CODAGT-499) coderd/ai_providers.go:469 publishAIProvidersChanged does not publish ChatConfigEventProviders, leaving chatd provider cache permanently stale R2 Robin P2, Zoro P4 Yes
CRF-6 P3 Author fixed (951bd09) coderd/aibridged/pool.go:228 Singleflight in Acquire can cache a stale-provider bridge after Reload clears cache R2 Hisoka P3, Knov P3, Meruem P3, Takumi P3 Yes
CRF-7 P3 Author fixed (951bd09) coderd/aibridged/reload.go:53 Initial Reload runs before Subscribe, leaving a gap where a pubsub message is missed R2 Kite P3, Pariston P3, Ryosuke P3, Takumi P3 Yes
CRF-8 P3 Author fixed (951bd09) enterprise/cli/server.go:174 Enterprise shutdown closes proxy server before unsubscribing pubsub, opposite of AGPL path R2 Mafuuu P3, Meruem P3, Takumi P3, Ryosuke P3 Yes
CRF-9 P3 Author fixed (951bd09) enterprise/aibridgeproxyd/reload_internal_test.go:13 No test verifies refreshProviders failure preserves previous router snapshot R2 Bisky Yes
CRF-10 P3 Author fixed (951bd09) coderd/aibridged/pool.go:130 CachedBridgePool.Reload after Shutdown is documented as no-op but untested R2 Bisky P3, Chopper P3 Yes
CRF-11 P3 Author fixed (951bd09) enterprise/aibridgeproxyd/reload.go:78 buildProviderRouter silently drops providers with bad BaseURL; no diagnostic log R2 Chopper P3, Mafuuu P3 Yes
CRF-12 P3 Author fixed (951bd09) enterprise/aibridgeproxyd/aibridgeproxyd.go:699 Error log for expected reload race scenario should be Warn R2 Leorio Yes
CRF-13 P3 Author fixed (951bd09) coderd/aibridged/reload.go:22 Doc comment bloat pattern: multiple comments narrate implementation details as API contracts R2 Gon Yes
CRF-14 P3 Author fixed (951bd09) coderd/aibridged/pool.go:36 Reload name collision between Pooler.Reload (push) and ProviderReloader.Reload (pull) R2 Gon Yes
CRF-15 Nit Author fixed (951bd09) coderd/aibridged/pool.go:115 append([]T(nil), s...) is hand-rolled slices.Clone R2 Ging-Go Yes
CRF-16 Nit Author fixed (951bd09) coderd/aibridged/reload_test.go:25 Manual context.WithTimeout instead of testutil.Context used 3 times R2 Ging-Go Yes
CRF-17 Nit Author fixed (951bd09) enterprise/aibridgeproxyd/aibridgeproxyd.go:160 "whitelist" in a codebase that consistently uses "allowlist" R2 Gon, Leorio Yes
CRF-18 Nit Author fixed (951bd09) coderd/aibridged/reload.go:9 Confusing import alias: bare pubsub is database/pubsub, aliased coderpubsub is coderd/pubsub R2 Ryosuke, Zoro Yes
CRF-19 Nit Author fixed (951bd09) coderd/aibridged/reload_test.go:38 Eventually weakens the synchrony assertion for initial Reload R2 Bisky Yes

| CRF-20 | P3 | Open | coderd/aibridged/pool.go:126 | providerVersion uses time.Now().UnixNano() instead of monotonic counter; NTP/collision defeats staleness guard | R4 | Gon P2, Meruem P3, Ryosuke P3, Hisoka Note, Zoro Nit | Yes |
| CRF-21 | P3 | Open | coderd/aibridged/pool.go:222 | Version-mismatched bridge from singleflight is discarded but never Shutdown'd; MCP connections leak | R4 | Hisoka | Yes |
| CRF-22 | P3 | Open | enterprise/aibridgeproxyd/reload.go:53 | mitmHostsCondition does case-sensitive host matching; providerFromHost normalizes to lowercase | R4 | Meruem | Yes |
| CRF-23 | P3 | Open | coderd/pubsub/aiproviderschangedevent.go:5 | Doc comments enumerate subscribers by name; stale when CODAGT-499 lands | R4 | Gon | Yes |
| CRF-24 | Nit | Open | enterprise/aibridgeproxyd/reload_test.go:86 | make+copy is hand-rolled slices.Clone | R4 | Ging-Go | Yes |
| CRF-25 | Nit | Open | enterprise/aibridgeproxyd/reload_internal_test.go:17 | context.Background() in tests where t.Context() available (6 sites) | R4 | Ging-Go | Yes |

Contested and acknowledged

(none)

Round log

Round 1

Netero-only. 1 P2, 2 P3, 1 Nit. Reviewed against 5a884de..f03cc91.

Round 2

Panel. CRF-1 through CRF-4 addressed. 1 P2, 10 P3, 5 Nit new. Reviewed against 5a884de..20bc28d.

Round 3

BLOCKED. CRF-6 through CRF-19 addressed (951bd09). CRF-5 deferred without ticket. No review.

Round 4

Panel. CRF-5 deferred with ticket (CODAGT-499). 4 P3, 2 Nit new. Reviewed against e91bec8..0513e39.

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass review (Netero). This is a mechanical scan only; the full review panel has not yet reviewed this PR.

The hot-reload design is clean: pubsub notification on CRUD, atomic snapshot swap, inflight requests unaffected. The provider isolation (per-provider construction errors stay scoped) and the routing-only DB access via AsAIProviderMetadataReader are good choices.

Findings: 1 P2, 2 P3, 1 Nit. The P2 is a test coverage gap for the core mutation this PR adds. The panel will review after these are addressed.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/aibridged/pool.go Outdated
Comment thread enterprise/aibridgeproxyd/aibridgeproxyd.go Outdated
Comment thread coderd/aibridged/reload_test.go Outdated
Comment thread enterprise/aibridgeproxyd/aibridgeproxyd.go Outdated
Copy link
Copy Markdown
Contributor Author

/coder-agents-review

@dannykopping dannykopping changed the base branch from dk/aibridge-providers-db-load to graphite-base/25673 May 26, 2026 11:40
@dannykopping dannykopping force-pushed the dk/aibridge-hot-reload branch from 20bc28d to 1f8832a Compare May 26, 2026 11:41
@dannykopping dannykopping force-pushed the graphite-base/25673 branch from 5a884de to 1f8832a Compare May 26, 2026 11:50
@dannykopping dannykopping changed the base branch from graphite-base/25673 to dk/aibridge-providers-db-load May 26, 2026 11:50
@dannykopping dannykopping changed the base branch from dk/aibridge-providers-db-load to graphite-base/25673 May 26, 2026 12:32
Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panel review (Round 2). All four R1 Netero findings verified as fixed in 20bc28d.

The hot-reload architecture is well-designed. Clean separation via ProviderReloader interface lets both daemons share SubscribeProviderReload without coupling. The atomic pointer swaps give consistent routing without locks. The OnEvict fix (capturing bridge before the background goroutine) closes a real ristretto reuse race. The errInjectingPubsub approach for CRF-3 is a proper tool for the problem.

Findings: 1 P2, 10 P3, 5 Nit. The P2 is a notification gap: this PR adds AIProvidersChangedChannel but chatd already wired (and never received) ChatConfigEventProviders for provider cache invalidation. The P3s cluster around three themes: concurrency edge cases (singleflight stale cache, reload-before-subscribe ordering), lifecycle consistency (enterprise shutdown ordering), and operational observability (silent provider drops, error-level log for expected race).

"The old pattern's safety argument died when the preconditions changed." (Pariston, on why boot-time-only validation comments needed updating)

One contradiction worth noting: Leorio praises the AIProvidersChangedChannel doc comment as "exactly what every pubsub channel constant should look like," while Gon flags the same comment for narrating subscriber behavior the constant cannot own. Both have a point; the overall structure is good, but the subscriber-specific details (what consumers do, idempotent design properties) belong with the subscribers.

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/ai_providers.go
Comment thread coderd/aibridged/pool.go
Comment thread coderd/aibridged/reload.go Outdated
Comment thread enterprise/cli/server.go Outdated
Comment thread enterprise/aibridgeproxyd/reload_internal_test.go
Comment thread coderd/aibridged/pool.go Outdated
Comment thread coderd/aibridged/reload_test.go Outdated
Comment thread enterprise/aibridgeproxyd/aibridgeproxyd.go Outdated
Comment thread coderd/aibridged/reload.go Outdated
Comment thread coderd/aibridged/reload_test.go Outdated
@dannykopping dannykopping force-pushed the dk/aibridge-hot-reload branch from 82bfb6d to 620807c Compare May 26, 2026 13:36
@dannykopping dannykopping changed the base branch from graphite-base/25673 to dk/aibridge-providers-db-load May 26, 2026 13:36
@dannykopping dannykopping force-pushed the dk/aibridge-hot-reload branch from 620807c to 97626f8 Compare May 26, 2026 13:40
@dannykopping dannykopping force-pushed the dk/aibridge-providers-db-load branch from 50ed5f8 to 10ef9c7 Compare May 26, 2026 13:40
@dannykopping dannykopping changed the base branch from dk/aibridge-providers-db-load to graphite-base/25673 May 26, 2026 13:56
@dannykopping dannykopping force-pushed the dk/aibridge-hot-reload branch from 97626f8 to 3a35605 Compare May 26, 2026 13:57
@dannykopping dannykopping force-pushed the graphite-base/25673 branch from 10ef9c7 to 282ab7d Compare May 26, 2026 13:57
@graphite-app graphite-app Bot changed the base branch from graphite-base/25673 to main May 26, 2026 13:57
@dannykopping dannykopping force-pushed the dk/aibridge-hot-reload branch 2 times, most recently from fb9c725 to 951bd09 Compare May 26, 2026 19:39
Copy link
Copy Markdown
Contributor Author

/coder-agents-review

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round 3 is blocked. 14 of 15 R2 findings addressed in 951bd09. One finding remains unresolved:

CRF-5 (P2, coderd/ai_providers.go) - publishAIProvidersChanged does not publish ChatConfigEventProviders, leaving chatd's provider cache permanently stale. Author responded: "This will be implemented as a follow-up." No linked ticket.

Further review is blocked until the author either:

  1. Files a ticket for the chatd notification gap and links it here, or
  2. Pushes a fix in this PR, or
  3. A human maintainer explicitly accepts the gap as out of scope for this PR.

This is not a decision the review panel can make. The dead subscriber predates this PR, but this PR adds the publishing infrastructure that could close it.

🤖 This review was automatically generated with Coder Agents.

Copy link
Copy Markdown
Contributor Author

/coder-agents-review

Copy link
Copy Markdown
Contributor

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Panel review (Round 4). All R1+R2 fixes verified, CRF-5 deferred with CODAGT-499. CI passing. 8 of 17 reviewers reported no findings.

The fix commits are thorough. The generation counter for CRF-6 (singleflight stale cache) is clean and tested by TestPoolReplaceProvidersDoesNotJoinStaleSingleflight. The subscribe-before-reload reorder (CRF-7) follows established codebase patterns. The new integration tests (TestProxy_HotReloadRoutingCRUD with 7 phases, TestAIBridgeProviderHotReload end-to-end) are real and well-structured. Test-to-code ratio is now ~2.4:1.

New findings: 4 P3, 2 Nit. The most interesting is a resource leak (CRF-21): when the version guard correctly skips caching a stale bridge, nobody calls Shutdown() on it, so MCP connections leak. Narrow window but worth closing. Five reviewers independently flagged providerVersion using wall-clock nanoseconds instead of a monotonic counter (CRF-20).

"At three users this is invisible. At three thousand users during a burst of provider CRUD, it's three thousand sets of orphaned MCP connections per reload event." (Hisoka, on the uncached bridge leak)

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/aibridged/pool.go
Comment thread coderd/aibridged/pool.go
Comment thread enterprise/aibridgeproxyd/reload.go Outdated
Comment thread coderd/pubsub/aiproviderschangedevent.go
Comment thread enterprise/aibridgeproxyd/reload_test.go Outdated
Comment thread enterprise/aibridgeproxyd/reload_internal_test.go Outdated
Previously the in-process aibridge daemon snapshotted the provider
list once at boot. Any ai_providers or ai_provider_keys mutation
required a restart for it to pick up.

Add an ai_providers_changed pubsub channel that the CRUD handlers
publish on after Create / Update / Delete. The aibridge daemon
subscribes and refreshes its snapshot from the database via
dbauthz.AsAIBridged (now granted ResourceAIProvider:Read).

Pool snapshots swap via atomic.Pointer so inflight requests keep
serving against the bridge they already acquired; new acquires build
against the new snapshot. Per-provider construction errors are
scoped to the offending row, so a single misconfigured provider
cannot poison the whole pool.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dannykopping dannykopping marked this pull request as ready for review May 27, 2026 08:49
@dannykopping dannykopping requested a review from johnstcn May 27, 2026 08:49
dannykopping and others added 6 commits May 27, 2026 10:49
Subscribe the enterprise aibridgeproxyd to the ai_providers_changed
pubsub channel from the parent commit so the proxy's MITM allowlist
and host -> provider-name routing track the database without a
restart.

The proxy stays database-unaware: the enterprise wrapper supplies a
RefreshProvidersFunc that reads ai_providers under
dbauthz.AsAIProviderMetadataReader (the existing least-privilege
routing-only subject), so proxyd does not import database or
dbauthz packages.

Server.Reload swaps the (mitmHosts, host->name) router pointer
atomically. Inflight requests captured the previous router pointer
on entry and continue against it until the connection closes, so
swaps never disrupt active CONNECTs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…codersdk

Boot coderd with the in-process aibridged daemon and two mock HTTP
upstreams, then drive a provider through codersdk Create / Update /
Disable / Enable / Add-second / Delete and assert routing after each
mutation by issuing GET /api/v2/aibridge/{name}/v1/models. require.
Eventually closes the gap between the CRUD return and the pubsub-
driven pool reload.

Also drop unused method receivers on errInjectingPubsub so revive's
unused-receiver check passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… reload

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CRF-22: lowercase req.URL.Host in mitmHostsCondition so CONNECT
  matching is case-insensitive, matching providerFromHost and the
  lowercase hosts stored by buildProviderRouter.
- CRF-24: use slices.Clone instead of make+copy in providerStore.refresh.
- CRF-25: use t.Context() instead of context.Background() in tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dannykopping dannykopping force-pushed the dk/aibridge-hot-reload branch from 0513e39 to 958c96b Compare May 27, 2026 08:57
Copy link
Copy Markdown
Member

@johnstcn johnstcn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have blocking comments.

Comment thread enterprise/coderd/aibridge_reload_test.go Outdated
Comment thread enterprise/cli/server.go Outdated
Comment thread enterprise/cli/aibridgeproxyd_internal_test.go Outdated
Comment on lines +31 to +55
// aibridgedRecorder captures the path of the last request received by
// the mock aibridged backend. Access is mutex-guarded so the test
// goroutine and the proxy's response goroutine can read/write safely.
type aibridgedRecorder struct {
mu sync.Mutex
path string
}

func (r *aibridgedRecorder) record(path string) {
r.mu.Lock()
defer r.mu.Unlock()
r.path = path
}

func (r *aibridgedRecorder) load() string {
r.mu.Lock()
defer r.mu.Unlock()
return r.path
}

func (r *aibridgedRecorder) reset() {
r.mu.Lock()
defer r.mu.Unlock()
r.path = ""
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a use-case for an auto-generated mock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Acknowledged. The hand-rolled mocks here are minimal (a single-method interface), but a generated mock via mockgen would work too. Leaving as-is for now unless you feel strongly.$\n\n> Disclaimer: implemented by a Coder Agent using Claude Opus 4.6

Comment thread enterprise/aibridgeproxyd/reload_internal_test.go Outdated
Comment on lines +364 to +396
type blockingMCPFactory struct {
calls atomic.Int32
firstStarted chan struct{}
releaseFirst chan struct{}
}

func newBlockingMCPFactory() *blockingMCPFactory {
return &blockingMCPFactory{
firstStarted: make(chan struct{}),
releaseFirst: make(chan struct{}),
}
}

func (m *blockingMCPFactory) firstBuildStarted() bool {
select {
case <-m.firstStarted:
return true
default:
return false
}
}

func (m *blockingMCPFactory) Build(ctx context.Context, _ aibridged.Request, _ trace.Tracer) (mcp.ServerProxier, error) {
if m.calls.Add(1) == 1 {
close(m.firstStarted)
select {
case <-m.releaseFirst:
case <-ctx.Done():
return nil, ctx.Err()
}
}
return nil, context.Canceled
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, would it make more sense to use an auto-generated mock here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above; the mockMCPFactory is small enough that a generated mock doesn't save much, but happy to switch if preferred.$\n\n> Disclaimer: implemented by a Coder Agent using Claude Opus 4.6

Comment thread coderd/ai_providers_pubsub_test.go Outdated
Comment thread coderd/ai_providers_pubsub_test.go Outdated
- Use assert.NoError instead of discarding encoder error
- Replace require.Eventually with testutil.Eventually
- Replace t.Context() with testutil.Context(t, testutil.WaitShort)
- Refactor newAIBridgeProxyDaemon to return io.Closer
- Remove funcCloser type, use aiBridgeProxyDaemon struct instead
@dannykopping dannykopping merged commit 79e007c into main May 27, 2026
28 checks passed
Copy link
Copy Markdown
Contributor Author

Merge activity

@dannykopping dannykopping deleted the dk/aibridge-hot-reload branch May 27, 2026 09:58
@github-actions github-actions Bot locked and limited conversation to collaborators May 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants