Skip to content

fix(coderd): cut DB fan-out on agent instance-identity auth#24973

Merged
f0ssel merged 1 commit into
mainfrom
fix/agent-auth-pool-starvation-main
May 5, 2026
Merged

fix(coderd): cut DB fan-out on agent instance-identity auth#24973
f0ssel merged 1 commit into
mainfrom
fix/agent-auth-pool-starvation-main

Conversation

@deansheather
Copy link
Copy Markdown
Member

Summary

Restores v2.33.0-rc.2-equivalent query cost for agent instance-identity auth on v2.33.0-rc.3, which currently saturates the pgx pool when multiple agents share an instance ID. Customer report against rc.3 traced 233× Internal error fetching provisioner job resource. fetch related workspace build: context canceled 500s during a 50-minute incident window to this path.

Backport to release/2.33 will follow as a separate PR after this merges.

Root cause

#24325 ("support multiple agents with shared instance-identity auth") rewrote coderd/workspaceresourceauth.go::handleAuthInstanceID to use the new :many agent lookup followed by a per-candidate filter loop. Each iteration synchronously calls GetWorkspaceResourceByID and GetProvisionerJobByID. Both go through dbauthz, and both fan out into the same provisioner_job → workspace_build → workspace cascade because authorizeProvisionerJob always re-authorizes the workspace via GetWorkspaceBuildByJobID → GetWorkspaceByID. The handler then re-fetches resource and job again for the surviving agent.

Net effect on the agent-auth happy path:

SQL RBAC
rc.2 baseline 13 5
rc.3 today, 1 agent 19 7
rc.3 today, 2 agents 26 9
After this PR, 1 agent 6 3
After this PR, 2 agents 7 3

Under load, the rc.3 chain blocks on pool acquire and the request blows past the 30s HTTP write timeout.

Changes

1. System fast-path on authorizeProvisionerJob (coderd/database/dbauthz/dbauthz.go)

Add an AsSystemRestricted early-return at the top of authorizeProvisionerJob. Instance-identity auth has already proven cloud identity before reaching the DB layer, so re-authorizing the workspace on every provisioner-job lookup is pure overhead. Existing GetWorkspaceAgentsByInstanceID already uses the same fast-path pattern.

if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err == nil {
    return nil
}

2. Drop survivor re-fetch in handleAuthInstanceID (coderd/workspaceresourceauth.go)

Capture the provisioner job alongside each candidate during the filter loop so the survivor lookup does not re-fetch resource and job after selection. The previous code fired the resource→job→build→workspace cascade twice for the surviving agent.

Tests

Adds TestAuthorizeProvisionerJob_SystemFastPath in coderd/database/dbauthz/dbauthz_test.go with two sub-tests:

  • AsSystemRestricted/SkipsCascade — strict mock fails the test if GetWorkspaceBuildByJobID or GetWorkspaceByID is called.
  • NonSystemActor/StillCascades — auditor (no ResourceSystem) still pays the cascade and produces a NotAuthorized error, proving the fast-path is gated correctly.

Updates 12 existing dbauthz suite cases to expect the new ResourceSystem.Read check ahead of the workspace/template-version check, with FailSystemObjectChecks() to force the slow path.

Existing integration coverage in TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/{SingleAgent, MultipleAgentsWithSelector, MultipleAgentsNoSelector, SubAgentExcluded, ...} exercises Part 2 end-to-end and continues to pass.

Footprint

  • 3 files changed, +166/-48
  • No SQL changes
  • No make gen
  • No migrations
  • No audit-table updates

Validation

  • go test ./coderd/database/dbauthz/ — full suite, ~6s
  • go test -run TestPostWorkspaceAuth ./coderd/ — instance-identity handler tests
  • go test -run TestProvisionerJob ./coderd/
  • go test -run TestWorkspaceAgent ./coderd/
  • go test ./coderd/provisionerdserver/
  • gofmt -l clean

Alternatives considered

  • SQL-side filter: rewrite GetWorkspaceAgentsByInstanceID to join workspace_resources/provisioner_jobs and filter job.type = 'workspace_build' server-side, eliminating the filter loop entirely. Cleaner long-term, but changes generated SQL and is too much surface for a release-branch hotfix. Worth doing as a follow-up.
  • Full revert of feat: support multiple agents with shared instance-identity auth #24325: removes the multi-agent feature outright; conflicts with downstream commits (#24441, #24438, #24313). Reserved as fallback if the surgical fix doesn't hold under load testing.

Copy link
Copy Markdown
Member Author

/coder-agents-review

Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-targeted hotfix. The cascade analysis is solid, the two cuts (system fast-path + survivor re-fetch elimination) are minimal and correct, and the strict-mock test design is excellent. Seven of eighteen reviewers independently converged on the same pre-existing double-fetch in the filter loop, confirming the SQL-side join alternative is the right follow-up.

1 P3, 5 Nits, 1 Note. No blocking issues.

The per-candidate double-fetch (GetWorkspaceResourceByID internally calls GetProvisionerJobByID for auth, then the handler calls it again explicitly) is pre-existing from #24325 and correctly scoped out. Worth noting: the PR's query table (6 SQL for 1 agent, 7 for 2) understates remaining per-candidate cost by 1 SQL per agent because of this double-fetch. Not a problem for the incident fix, but the SQL-side join follow-up would eliminate it entirely. Consider filing a tracking ticket so it doesn't get lost.

The type-switch bypass for system actors (DEREM-8) is consistent with how system subjects work across the codebase: they carry wildcard permissions, so the switch was never a meaningful gate for them.

"The absence of a registered expectation is the assertion." (Bisky, on the strict-mock design)


coderd/workspaceresourceauth.go:162-170

Note [DEREM-7] Seven reviewers independently flagged the same pre-existing inefficiency: GetWorkspaceResourceByID (dbauthz.go:4946) internally calls GetProvisionerJobByID for authorization, then the handler calls GetProvisionerJobByID again at line 170 to capture the job. Same row, two SQL trips per candidate. With the fast-path both are cheap PK lookups, so this is not a pool-saturation risk, but it does mean per-candidate SQL cost is 3 (resource + internal job + explicit job) rather than the 2 implied by the PR description. The // TODO: Optimize this at dbauthz.go:4940 already marks the location. (Meruem, Hisoka, Razor, Killua, Kite, Ryosuke, Pariston)

🤖

🤖 This review was automatically generated with Coder Agents.

Comment thread coderd/database/dbauthz/dbauthz.go Outdated
Comment thread coderd/database/dbauthz/dbauthz_test.go Outdated
Comment thread coderd/database/dbauthz/dbauthz.go Outdated
Comment thread coderd/workspaceresourceauth.go Outdated
Comment thread coderd/workspaceresourceauth.go Outdated
Comment thread coderd/database/dbauthz/dbauthz_test.go
@deansheather deansheather force-pushed the fix/agent-auth-pool-starvation-main branch from 22fb8ee to 387e06d Compare May 5, 2026 17:09
@deansheather
Copy link
Copy Markdown
Member Author

/coder-agents-review

Restores rc.2-equivalent query cost for agent instance-identity auth on
v2.33.0-rc.3, which currently saturates the pgx pool when multiple
agents share an instance ID.

Two changes:

1. Add a system fast-path to authorizeProvisionerJob in dbauthz so that
   AsSystemRestricted callers skip the per-job RBAC fan-out through
   GetWorkspaceBuildByJobID -> GetWorkspaceByID. Instance-identity auth
   has already proven cloud identity before reaching the database layer,
   so re-authorizing the workspace on every provisioner-job lookup is
   pure overhead. Existing GetWorkspaceAgentsByInstanceID already uses
   the same fast-path pattern.

2. Capture the provisioner job alongside each candidate during the
   filter loop in handleAuthInstanceID so the survivor lookup does not
   re-fetch the resource and job. The previous code fired the
   resource->job->build->workspace cascade twice for the surviving
   agent.

Combined effect on the agent-auth happy path (1 candidate):
  rc.2 baseline:  13 SQL + 5 RBAC
  rc.3 today:     19 SQL + 7 RBAC
  After fix:       6 SQL + 3 RBAC

Adds TestAuthorizeProvisionerJob_SystemFastPath covering both the
system-restricted bypass and the non-system cascade path. Updates 12
existing dbauthz suite cases to expect the new ResourceSystem.Read
check ahead of the workspace/template-version check, with
FailSystemObjectChecks() to force the slow path.
@deansheather deansheather force-pushed the fix/agent-auth-pool-starvation-main branch from 387e06d to 3d31209 Compare May 5, 2026 17:31
Copy link
Copy Markdown

@coder-agents-review coder-agents-review Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All six R1 findings addressed cleanly in 387e06d. The invariant comment (DEREM-3 fix) is well-worded and goes beyond what was suggested. The new TemplateVersion/SkipsCascade sub-test (DEREM-6 fix) is a proper independent sub-test, not a patched-in assertion.

1 P4, 2 Nits new in R2. Nothing blocking.

CI shows test-go-pg and test-go-pg-17 failing. Four reviewers ran the dbauthz suite and handler tests locally without issues. The failures appear unrelated to this PR's changes, but worth confirming before merging to a release branch.

Nit [DEREM-12] The PR description says "two sub-tests" under Tests but the function now has three (after the DEREM-6 fix). The footprint stat (+166/-48) is also stale; actual is +200/-48. Minor housekeeping. (Mafu-san)

"If the re-fetch block were accidentally reintroduced, every functional test would pass while the two extra SQL queries silently returned." (Bisky, on the structural enforcement gap)

🤖 This review was automatically generated with Coder Agents.

}
if job.Type == database.ProvisionerJobTypeWorkspaceBuild {
buildAgents = append(buildAgents, candidate)
buildCandidates = append(buildCandidates, instanceCandidate{
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P4 [DEREM-11] Part 1 (fast-path) is enforced by strict-mock tests that fail if the cascade runs. Part 2 (re-fetch elimination) has no equivalent enforcement. The instanceCandidate struct makes reintroduction structurally impractical (you'd need to deliberately bypass the captured selected.job), so the risk is low. A mock-based DB call counter would be fragile and couple to internals. Noting the gap rather than requesting a test. (Bisky P3, downgraded: structural design provides implicit enforcement)

🤖

// That cascade adds 2 SQL queries + 1 RBAC eval per provisioner-job lookup
// and saturates the pgx pool when called repeatedly from agent
// instance-identity auth (see incident report against v2.33.0-rc.3).
func TestAuthorizeProvisionerJob_SystemFastPath(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [DEREM-12] The PR description says "two sub-tests" under Tests but this function now has three (after DEREM-6 fix added TemplateVersion/SkipsCascade). The footprint stat (+166/-48) is also stale; actual is +200/-48. Consider updating the description. (Mafu-san)

🤖


authorizer := rbac.NewAuthorizer(prometheus.NewRegistry())

t.Run("AsSystemRestricted/SkipsCascade", func(t *testing.T) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit [DEREM-13] "AsSystemRestricted/SkipsCascade" silently covers WorkspaceBuild while its sibling is "AsSystemRestricted/TemplateVersion/SkipsCascade". Renaming to "AsSystemRestricted/WorkspaceBuild/SkipsCascade" would make the pair symmetric and grep-friendly. (Gon)

🤖

@f0ssel f0ssel marked this pull request as ready for review May 5, 2026 19:14
@f0ssel f0ssel merged commit e48d121 into main May 5, 2026
31 checks passed
@f0ssel f0ssel deleted the fix/agent-auth-pool-starvation-main branch May 5, 2026 19:15
@github-actions github-actions Bot locked and limited conversation to collaborators May 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants