fix(coderd): cut DB fan-out on agent instance-identity auth#24973
Conversation
|
/coder-agents-review |
There was a problem hiding this comment.
Well-targeted hotfix. The cascade analysis is solid, the two cuts (system fast-path + survivor re-fetch elimination) are minimal and correct, and the strict-mock test design is excellent. Seven of eighteen reviewers independently converged on the same pre-existing double-fetch in the filter loop, confirming the SQL-side join alternative is the right follow-up.
1 P3, 5 Nits, 1 Note. No blocking issues.
The per-candidate double-fetch (GetWorkspaceResourceByID internally calls GetProvisionerJobByID for auth, then the handler calls it again explicitly) is pre-existing from #24325 and correctly scoped out. Worth noting: the PR's query table (6 SQL for 1 agent, 7 for 2) understates remaining per-candidate cost by 1 SQL per agent because of this double-fetch. Not a problem for the incident fix, but the SQL-side join follow-up would eliminate it entirely. Consider filing a tracking ticket so it doesn't get lost.
The type-switch bypass for system actors (DEREM-8) is consistent with how system subjects work across the codebase: they carry wildcard permissions, so the switch was never a meaningful gate for them.
"The absence of a registered expectation is the assertion." (Bisky, on the strict-mock design)
coderd/workspaceresourceauth.go:162-170
Note [DEREM-7] Seven reviewers independently flagged the same pre-existing inefficiency: GetWorkspaceResourceByID (dbauthz.go:4946) internally calls GetProvisionerJobByID for authorization, then the handler calls GetProvisionerJobByID again at line 170 to capture the job. Same row, two SQL trips per candidate. With the fast-path both are cheap PK lookups, so this is not a pool-saturation risk, but it does mean per-candidate SQL cost is 3 (resource + internal job + explicit job) rather than the 2 implied by the PR description. The // TODO: Optimize this at dbauthz.go:4940 already marks the location. (Meruem, Hisoka, Razor, Killua, Kite, Ryosuke, Pariston)
🤖
🤖 This review was automatically generated with Coder Agents.
22fb8ee to
387e06d
Compare
|
/coder-agents-review |
Restores rc.2-equivalent query cost for agent instance-identity auth on v2.33.0-rc.3, which currently saturates the pgx pool when multiple agents share an instance ID. Two changes: 1. Add a system fast-path to authorizeProvisionerJob in dbauthz so that AsSystemRestricted callers skip the per-job RBAC fan-out through GetWorkspaceBuildByJobID -> GetWorkspaceByID. Instance-identity auth has already proven cloud identity before reaching the database layer, so re-authorizing the workspace on every provisioner-job lookup is pure overhead. Existing GetWorkspaceAgentsByInstanceID already uses the same fast-path pattern. 2. Capture the provisioner job alongside each candidate during the filter loop in handleAuthInstanceID so the survivor lookup does not re-fetch the resource and job. The previous code fired the resource->job->build->workspace cascade twice for the surviving agent. Combined effect on the agent-auth happy path (1 candidate): rc.2 baseline: 13 SQL + 5 RBAC rc.3 today: 19 SQL + 7 RBAC After fix: 6 SQL + 3 RBAC Adds TestAuthorizeProvisionerJob_SystemFastPath covering both the system-restricted bypass and the non-system cascade path. Updates 12 existing dbauthz suite cases to expect the new ResourceSystem.Read check ahead of the workspace/template-version check, with FailSystemObjectChecks() to force the slow path.
387e06d to
3d31209
Compare
There was a problem hiding this comment.
All six R1 findings addressed cleanly in 387e06d. The invariant comment (DEREM-3 fix) is well-worded and goes beyond what was suggested. The new TemplateVersion/SkipsCascade sub-test (DEREM-6 fix) is a proper independent sub-test, not a patched-in assertion.
1 P4, 2 Nits new in R2. Nothing blocking.
CI shows test-go-pg and test-go-pg-17 failing. Four reviewers ran the dbauthz suite and handler tests locally without issues. The failures appear unrelated to this PR's changes, but worth confirming before merging to a release branch.
Nit [DEREM-12] The PR description says "two sub-tests" under Tests but the function now has three (after the DEREM-6 fix). The footprint stat (+166/-48) is also stale; actual is +200/-48. Minor housekeeping. (Mafu-san)
"If the re-fetch block were accidentally reintroduced, every functional test would pass while the two extra SQL queries silently returned." (Bisky, on the structural enforcement gap)
🤖 This review was automatically generated with Coder Agents.
| } | ||
| if job.Type == database.ProvisionerJobTypeWorkspaceBuild { | ||
| buildAgents = append(buildAgents, candidate) | ||
| buildCandidates = append(buildCandidates, instanceCandidate{ |
There was a problem hiding this comment.
P4 [DEREM-11] Part 1 (fast-path) is enforced by strict-mock tests that fail if the cascade runs. Part 2 (re-fetch elimination) has no equivalent enforcement. The instanceCandidate struct makes reintroduction structurally impractical (you'd need to deliberately bypass the captured selected.job), so the risk is low. A mock-based DB call counter would be fragile and couple to internals. Noting the gap rather than requesting a test. (Bisky P3, downgraded: structural design provides implicit enforcement)
🤖
| // That cascade adds 2 SQL queries + 1 RBAC eval per provisioner-job lookup | ||
| // and saturates the pgx pool when called repeatedly from agent | ||
| // instance-identity auth (see incident report against v2.33.0-rc.3). | ||
| func TestAuthorizeProvisionerJob_SystemFastPath(t *testing.T) { |
There was a problem hiding this comment.
Nit [DEREM-12] The PR description says "two sub-tests" under Tests but this function now has three (after DEREM-6 fix added TemplateVersion/SkipsCascade). The footprint stat (+166/-48) is also stale; actual is +200/-48. Consider updating the description. (Mafu-san)
🤖
|
|
||
| authorizer := rbac.NewAuthorizer(prometheus.NewRegistry()) | ||
|
|
||
| t.Run("AsSystemRestricted/SkipsCascade", func(t *testing.T) { |
There was a problem hiding this comment.
Nit [DEREM-13] "AsSystemRestricted/SkipsCascade" silently covers WorkspaceBuild while its sibling is "AsSystemRestricted/TemplateVersion/SkipsCascade". Renaming to "AsSystemRestricted/WorkspaceBuild/SkipsCascade" would make the pair symmetric and grep-friendly. (Gon)
🤖
Summary
Restores
v2.33.0-rc.2-equivalent query cost for agent instance-identity auth onv2.33.0-rc.3, which currently saturates the pgx pool when multiple agents share an instance ID. Customer report against rc.3 traced 233×Internal error fetching provisioner job resource. fetch related workspace build: context canceled500s during a 50-minute incident window to this path.Backport to
release/2.33will follow as a separate PR after this merges.Root cause
#24325 ("support multiple agents with shared instance-identity auth") rewrote
coderd/workspaceresourceauth.go::handleAuthInstanceIDto use the new:manyagent lookup followed by a per-candidate filter loop. Each iteration synchronously callsGetWorkspaceResourceByIDandGetProvisionerJobByID. Both go throughdbauthz, and both fan out into the sameprovisioner_job → workspace_build → workspacecascade becauseauthorizeProvisionerJobalways re-authorizes the workspace viaGetWorkspaceBuildByJobID → GetWorkspaceByID. The handler then re-fetches resource and job again for the surviving agent.Net effect on the agent-auth happy path:
Under load, the rc.3 chain blocks on pool acquire and the request blows past the 30s HTTP write timeout.
Changes
1. System fast-path on
authorizeProvisionerJob(coderd/database/dbauthz/dbauthz.go)Add an
AsSystemRestrictedearly-return at the top ofauthorizeProvisionerJob. Instance-identity auth has already proven cloud identity before reaching the DB layer, so re-authorizing the workspace on every provisioner-job lookup is pure overhead. ExistingGetWorkspaceAgentsByInstanceIDalready uses the same fast-path pattern.2. Drop survivor re-fetch in
handleAuthInstanceID(coderd/workspaceresourceauth.go)Capture the provisioner job alongside each candidate during the filter loop so the survivor lookup does not re-fetch resource and job after selection. The previous code fired the resource→job→build→workspace cascade twice for the surviving agent.
Tests
Adds
TestAuthorizeProvisionerJob_SystemFastPathincoderd/database/dbauthz/dbauthz_test.gowith two sub-tests:AsSystemRestricted/SkipsCascade— strict mock fails the test ifGetWorkspaceBuildByJobIDorGetWorkspaceByIDis called.NonSystemActor/StillCascades— auditor (noResourceSystem) still pays the cascade and produces aNotAuthorizederror, proving the fast-path is gated correctly.Updates 12 existing dbauthz suite cases to expect the new
ResourceSystem.Readcheck ahead of the workspace/template-version check, withFailSystemObjectChecks()to force the slow path.Existing integration coverage in
TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/{SingleAgent, MultipleAgentsWithSelector, MultipleAgentsNoSelector, SubAgentExcluded, ...}exercises Part 2 end-to-end and continues to pass.Footprint
make genValidation
go test ./coderd/database/dbauthz/— full suite, ~6sgo test -run TestPostWorkspaceAuth ./coderd/— instance-identity handler testsgo test -run TestProvisionerJob ./coderd/go test -run TestWorkspaceAgent ./coderd/go test ./coderd/provisionerdserver/gofmt -lcleanAlternatives considered
GetWorkspaceAgentsByInstanceIDto joinworkspace_resources/provisioner_jobsand filterjob.type = 'workspace_build'server-side, eliminating the filter loop entirely. Cleaner long-term, but changes generated SQL and is too much surface for a release-branch hotfix. Worth doing as a follow-up.