feat: hot-reload aibridged and aibridgeproxyd providers on DB changes#25673
Conversation
fbcdba1 to
94a086e
Compare
df99b3e to
0702534
Compare
This stack of pull requests is managed by Graphite. Learn more about stacking. |
94a086e to
f03cc91
Compare
0702534 to
5a884de
Compare
|
/coder-agents-review |
|
Review posted | Chat Review historydeep-review v0.5.0 | Round 4 | Last posted: Round 4, 25 findings (2 P2, 15 P3, 8 Nit), APPROVE. Review Finding inventoryFindings
| CRF-20 | P3 | Open | coderd/aibridged/pool.go:126 | providerVersion uses time.Now().UnixNano() instead of monotonic counter; NTP/collision defeats staleness guard | R4 | Gon P2, Meruem P3, Ryosuke P3, Hisoka Note, Zoro Nit | Yes | Contested and acknowledged(none) Round logRound 1Netero-only. 1 P2, 2 P3, 1 Nit. Reviewed against 5a884de..f03cc91. Round 2Panel. CRF-1 through CRF-4 addressed. 1 P2, 10 P3, 5 Nit new. Reviewed against 5a884de..20bc28d. Round 3BLOCKED. CRF-6 through CRF-19 addressed (951bd09). CRF-5 deferred without ticket. No review. Round 4Panel. CRF-5 deferred with ticket (CODAGT-499). 4 P3, 2 Nit new. Reviewed against e91bec8..0513e39. About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
There was a problem hiding this comment.
First-pass review (Netero). This is a mechanical scan only; the full review panel has not yet reviewed this PR.
The hot-reload design is clean: pubsub notification on CRUD, atomic snapshot swap, inflight requests unaffected. The provider isolation (per-provider construction errors stay scoped) and the routing-only DB access via AsAIProviderMetadataReader are good choices.
Findings: 1 P2, 2 P3, 1 Nit. The P2 is a test coverage gap for the core mutation this PR adds. The panel will review after these are addressed.
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
20bc28d to
1f8832a
Compare
5a884de to
1f8832a
Compare
There was a problem hiding this comment.
Panel review (Round 2). All four R1 Netero findings verified as fixed in 20bc28d.
The hot-reload architecture is well-designed. Clean separation via ProviderReloader interface lets both daemons share SubscribeProviderReload without coupling. The atomic pointer swaps give consistent routing without locks. The OnEvict fix (capturing bridge before the background goroutine) closes a real ristretto reuse race. The errInjectingPubsub approach for CRF-3 is a proper tool for the problem.
Findings: 1 P2, 10 P3, 5 Nit. The P2 is a notification gap: this PR adds AIProvidersChangedChannel but chatd already wired (and never received) ChatConfigEventProviders for provider cache invalidation. The P3s cluster around three themes: concurrency edge cases (singleflight stale cache, reload-before-subscribe ordering), lifecycle consistency (enterprise shutdown ordering), and operational observability (silent provider drops, error-level log for expected race).
"The old pattern's safety argument died when the preconditions changed." (Pariston, on why boot-time-only validation comments needed updating)
One contradiction worth noting: Leorio praises the AIProvidersChangedChannel doc comment as "exactly what every pubsub channel constant should look like," while Gon flags the same comment for narrating subscriber behavior the constant cannot own. Both have a point; the overall structure is good, but the subscriber-specific details (what consumers do, idempotent design properties) belong with the subscribers.
🤖 This review was automatically generated with Coder Agents.
82bfb6d to
620807c
Compare
620807c to
97626f8
Compare
50ed5f8 to
10ef9c7
Compare
97626f8 to
3a35605
Compare
10ef9c7 to
282ab7d
Compare
fb9c725 to
951bd09
Compare
|
/coder-agents-review |
There was a problem hiding this comment.
Round 3 is blocked. 14 of 15 R2 findings addressed in 951bd09. One finding remains unresolved:
CRF-5 (P2, coderd/ai_providers.go) - publishAIProvidersChanged does not publish ChatConfigEventProviders, leaving chatd's provider cache permanently stale. Author responded: "This will be implemented as a follow-up." No linked ticket.
Further review is blocked until the author either:
- Files a ticket for the chatd notification gap and links it here, or
- Pushes a fix in this PR, or
- A human maintainer explicitly accepts the gap as out of scope for this PR.
This is not a decision the review panel can make. The dead subscriber predates this PR, but this PR adds the publishing infrastructure that could close it.
🤖 This review was automatically generated with Coder Agents.
|
/coder-agents-review |
There was a problem hiding this comment.
Panel review (Round 4). All R1+R2 fixes verified, CRF-5 deferred with CODAGT-499. CI passing. 8 of 17 reviewers reported no findings.
The fix commits are thorough. The generation counter for CRF-6 (singleflight stale cache) is clean and tested by TestPoolReplaceProvidersDoesNotJoinStaleSingleflight. The subscribe-before-reload reorder (CRF-7) follows established codebase patterns. The new integration tests (TestProxy_HotReloadRoutingCRUD with 7 phases, TestAIBridgeProviderHotReload end-to-end) are real and well-structured. Test-to-code ratio is now ~2.4:1.
New findings: 4 P3, 2 Nit. The most interesting is a resource leak (CRF-21): when the version guard correctly skips caching a stale bridge, nobody calls Shutdown() on it, so MCP connections leak. Narrow window but worth closing. Five reviewers independently flagged providerVersion using wall-clock nanoseconds instead of a monotonic counter (CRF-20).
"At three users this is invisible. At three thousand users during a burst of provider CRUD, it's three thousand sets of orphaned MCP connections per reload event." (Hisoka, on the uncached bridge leak)
🤖 This review was automatically generated with Coder Agents.
Previously the in-process aibridge daemon snapshotted the provider list once at boot. Any ai_providers or ai_provider_keys mutation required a restart for it to pick up. Add an ai_providers_changed pubsub channel that the CRUD handlers publish on after Create / Update / Delete. The aibridge daemon subscribes and refreshes its snapshot from the database via dbauthz.AsAIBridged (now granted ResourceAIProvider:Read). Pool snapshots swap via atomic.Pointer so inflight requests keep serving against the bridge they already acquired; new acquires build against the new snapshot. Per-provider construction errors are scoped to the offending row, so a single misconfigured provider cannot poison the whole pool. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Subscribe the enterprise aibridgeproxyd to the ai_providers_changed pubsub channel from the parent commit so the proxy's MITM allowlist and host -> provider-name routing track the database without a restart. The proxy stays database-unaware: the enterprise wrapper supplies a RefreshProvidersFunc that reads ai_providers under dbauthz.AsAIProviderMetadataReader (the existing least-privilege routing-only subject), so proxyd does not import database or dbauthz packages. Server.Reload swaps the (mitmHosts, host->name) router pointer atomically. Inflight requests captured the previous router pointer on entry and continue against it until the connection closes, so swaps never disrupt active CONNECTs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…codersdk
Boot coderd with the in-process aibridged daemon and two mock HTTP
upstreams, then drive a provider through codersdk Create / Update /
Disable / Enable / Add-second / Delete and assert routing after each
mutation by issuing GET /api/v2/aibridge/{name}/v1/models. require.
Eventually closes the gap between the CRUD return and the pubsub-
driven pool reload.
Also drop unused method receivers on errInjectingPubsub so revive's
unused-receiver check passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… reload Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- CRF-22: lowercase req.URL.Host in mitmHostsCondition so CONNECT matching is case-insensitive, matching providerFromHost and the lowercase hosts stored by buildProviderRouter. - CRF-24: use slices.Clone instead of make+copy in providerStore.refresh. - CRF-25: use t.Context() instead of context.Background() in tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0513e39 to
958c96b
Compare
johnstcn
left a comment
There was a problem hiding this comment.
I don't have blocking comments.
| // aibridgedRecorder captures the path of the last request received by | ||
| // the mock aibridged backend. Access is mutex-guarded so the test | ||
| // goroutine and the proxy's response goroutine can read/write safely. | ||
| type aibridgedRecorder struct { | ||
| mu sync.Mutex | ||
| path string | ||
| } | ||
|
|
||
| func (r *aibridgedRecorder) record(path string) { | ||
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
| r.path = path | ||
| } | ||
|
|
||
| func (r *aibridgedRecorder) load() string { | ||
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
| return r.path | ||
| } | ||
|
|
||
| func (r *aibridgedRecorder) reset() { | ||
| r.mu.Lock() | ||
| defer r.mu.Unlock() | ||
| r.path = "" | ||
| } |
There was a problem hiding this comment.
This seems like a use-case for an auto-generated mock.
There was a problem hiding this comment.
Acknowledged. The hand-rolled mocks here are minimal (a single-method interface), but a generated mock via mockgen would work too. Leaving as-is for now unless you feel strongly.$\n\n> Disclaimer: implemented by a Coder Agent using Claude Opus 4.6
| type blockingMCPFactory struct { | ||
| calls atomic.Int32 | ||
| firstStarted chan struct{} | ||
| releaseFirst chan struct{} | ||
| } | ||
|
|
||
| func newBlockingMCPFactory() *blockingMCPFactory { | ||
| return &blockingMCPFactory{ | ||
| firstStarted: make(chan struct{}), | ||
| releaseFirst: make(chan struct{}), | ||
| } | ||
| } | ||
|
|
||
| func (m *blockingMCPFactory) firstBuildStarted() bool { | ||
| select { | ||
| case <-m.firstStarted: | ||
| return true | ||
| default: | ||
| return false | ||
| } | ||
| } | ||
|
|
||
| func (m *blockingMCPFactory) Build(ctx context.Context, _ aibridged.Request, _ trace.Tracer) (mcp.ServerProxier, error) { | ||
| if m.calls.Add(1) == 1 { | ||
| close(m.firstStarted) | ||
| select { | ||
| case <-m.releaseFirst: | ||
| case <-ctx.Done(): | ||
| return nil, ctx.Err() | ||
| } | ||
| } | ||
| return nil, context.Canceled | ||
| } |
There was a problem hiding this comment.
Again, would it make more sense to use an auto-generated mock here?
There was a problem hiding this comment.
Same as above; the mockMCPFactory is small enough that a generated mock doesn't save much, but happy to switch if preferred.$\n\n> Disclaimer: implemented by a Coder Agent using Claude Opus 4.6
- Use assert.NoError instead of discarding encoder error - Replace require.Eventually with testutil.Eventually - Replace t.Context() with testutil.Context(t, testutil.WaitShort) - Refactor newAIBridgeProxyDaemon to return io.Closer - Remove funcCloser type, use aiBridgeProxyDaemon struct instead
Merge activity
|

Previously the in-process aibridge daemon and the enterprise aibridgeproxy daemon both snapshotted their provider routing once at boot. Any
ai_providersorai_provider_keysmutation required a restart for either to pick it up.Add an
ai_providers_changedpubsub channel that the CRUD handlers publish on after Create / Update / Delete. Both daemons subscribe:[]aibridge.Providersnapshot viaBuildProvidersand swaps it into the pool atomically. Inflight requests keep serving against the bridge they already acquired; new acquires build against the new snapshot. Per-provider construction errors stay scoped to the offending row.GetAIProvidersand swaps the host→provider map atomically. The MITM listener picks up new providers without restart.DB read for aibridgeproxyd uses the existing
AsAIProviderMetadataReadersubject for routing-only access.