fix(coderd): cut DB fan-out on agent instance-identity auth (backport #24973)#24982
Merged
Conversation
Restores `v2.33.0-rc.2`-equivalent query cost for agent instance-identity auth on `v2.33.0-rc.3`, which currently saturates the pgx pool when multiple agents share an instance ID. Customer report against rc.3 traced 233× `Internal error fetching provisioner job resource. fetch related workspace build: context canceled` 500s during a 50-minute incident window to this path. Backport to `release/2.33` will follow as a separate PR after this merges. [#24325](#24325) ("support multiple agents with shared instance-identity auth") rewrote `coderd/workspaceresourceauth.go::handleAuthInstanceID` to use the new `:many` agent lookup followed by a per-candidate filter loop. Each iteration synchronously calls `GetWorkspaceResourceByID` and `GetProvisionerJobByID`. Both go through `dbauthz`, and both fan out into the same `provisioner_job → workspace_build → workspace` cascade because `authorizeProvisionerJob` always re-authorizes the workspace via `GetWorkspaceBuildByJobID → GetWorkspaceByID`. The handler then re-fetches resource and job again for the surviving agent. Net effect on the agent-auth happy path: | | SQL | RBAC | |---|---|---| | rc.2 baseline | 13 | 5 | | rc.3 today, 1 agent | 19 | 7 | | rc.3 today, 2 agents | 26 | 9 | | **After this PR, 1 agent** | **6** | **3** | | **After this PR, 2 agents** | **7** | **3** | Under load, the rc.3 chain blocks on pool acquire and the request blows past the 30s HTTP write timeout. (`coderd/database/dbauthz/dbauthz.go`) Add an `AsSystemRestricted` early-return at the top of `authorizeProvisionerJob`. Instance-identity auth has already proven cloud identity before reaching the DB layer, so re-authorizing the workspace on every provisioner-job lookup is pure overhead. Existing `GetWorkspaceAgentsByInstanceID` already uses the same fast-path pattern. ```go if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err == nil { return nil } ``` (`coderd/workspaceresourceauth.go`) Capture the provisioner job alongside each candidate during the filter loop so the survivor lookup does not re-fetch resource and job after selection. The previous code fired the resource→job→build→workspace cascade twice for the surviving agent. Adds `TestAuthorizeProvisionerJob_SystemFastPath` in `coderd/database/dbauthz/dbauthz_test.go` with two sub-tests: - `AsSystemRestricted/SkipsCascade` — strict mock fails the test if `GetWorkspaceBuildByJobID` or `GetWorkspaceByID` is called. - `NonSystemActor/StillCascades` — auditor (no `ResourceSystem`) still pays the cascade and produces a `NotAuthorized` error, proving the fast-path is gated correctly. Updates 12 existing dbauthz suite cases to expect the new `ResourceSystem.Read` check ahead of the workspace/template-version check, with `FailSystemObjectChecks()` to force the slow path. Existing integration coverage in `TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/{SingleAgent, MultipleAgentsWithSelector, MultipleAgentsNoSelector, SubAgentExcluded, ...}` exercises Part 2 end-to-end and continues to pass. - 3 files changed, +166/-48 - No SQL changes - No `make gen` - No migrations - No audit-table updates - [x] `go test ./coderd/database/dbauthz/` — full suite, ~6s - [x] `go test -run TestPostWorkspaceAuth ./coderd/` — instance-identity handler tests - [x] `go test -run TestProvisionerJob ./coderd/` - [x] `go test -run TestWorkspaceAgent ./coderd/` - [x] `go test ./coderd/provisionerdserver/` - [x] `gofmt -l` clean - **SQL-side filter:** rewrite `GetWorkspaceAgentsByInstanceID` to join `workspace_resources`/`provisioner_jobs` and filter `job.type = 'workspace_build'` server-side, eliminating the filter loop entirely. Cleaner long-term, but changes generated SQL and is too much surface for a release-branch hotfix. Worth doing as a follow-up. - **Full revert of #24325:** removes the multi-agent feature outright; conflicts with downstream commits ([#24441](#24441), [#24438](#24438), [#24313](#24313)). Reserved as fallback if the surgical fix doesn't hold under load testing.
ibetitsmike
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of #24973 to
release/2.33.Summary
Restores
v2.33.0-rc.2-equivalent query cost for agent instance-identity auth, which currently saturates the pgx pool when multiple agents share an instance ID. Customer report against rc.3 traced 233xInternal error fetching provisioner job resource500s during a 50-minute incident window to this path.Changes
authorizeProvisionerJob(coderd/database/dbauthz/dbauthz.go): Short-circuits the per-job RBAC fan-out throughGetWorkspaceBuildByJobID->GetWorkspaceByIDforAsSystemRestrictedcallers.handleAuthInstanceID(coderd/workspaceresourceauth.go): Captures the provisioner job alongside each candidate during the filter loop so the post-selection code reads it directly instead of re-querying.Conflict resolution
One conflict in
coderd/database/dbauthz/dbauthz_test.go: theTestAsAutostarttest function (from an unrelated commit onmain) was brought in as surrounding context during the cherry-pick. It was removed since it tests functionality (ResourceUserSecret.Readfor the Autostart role) not present on the release branch.Tests
TestAuthorizeProvisionerJob_SystemFastPath(3 sub-tests): all passTestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/*(7 sub-tests): all pass