Skip to content

fix(coderd): cut DB fan-out on agent instance-identity auth (backport #24973)#24982

Merged
f0ssel merged 1 commit into
release/2.33from
backport/pr-24973-to-release-2.33
May 5, 2026
Merged

fix(coderd): cut DB fan-out on agent instance-identity auth (backport #24973)#24982
f0ssel merged 1 commit into
release/2.33from
backport/pr-24973-to-release-2.33

Conversation

@f0ssel
Copy link
Copy Markdown
Member

@f0ssel f0ssel commented May 5, 2026

Backport of #24973 to release/2.33.

Summary

Restores v2.33.0-rc.2-equivalent query cost for agent instance-identity auth, which currently saturates the pgx pool when multiple agents share an instance ID. Customer report against rc.3 traced 233x Internal error fetching provisioner job resource 500s during a 50-minute incident window to this path.

Changes

  1. System fast-path on authorizeProvisionerJob (coderd/database/dbauthz/dbauthz.go): Short-circuits the per-job RBAC fan-out through GetWorkspaceBuildByJobID -> GetWorkspaceByID for AsSystemRestricted callers.
  2. Drop survivor re-fetch in handleAuthInstanceID (coderd/workspaceresourceauth.go): Captures the provisioner job alongside each candidate during the filter loop so the post-selection code reads it directly instead of re-querying.

Conflict resolution

One conflict in coderd/database/dbauthz/dbauthz_test.go: the TestAsAutostart test function (from an unrelated commit on main) was brought in as surrounding context during the cherry-pick. It was removed since it tests functionality (ResourceUserSecret.Read for the Autostart role) not present on the release branch.

Tests

  • TestAuthorizeProvisionerJob_SystemFastPath (3 sub-tests): all pass
  • TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/* (7 sub-tests): all pass

Generated by Coder Agents

Restores `v2.33.0-rc.2`-equivalent query cost for agent
instance-identity auth on `v2.33.0-rc.3`, which currently saturates the
pgx pool when multiple agents share an instance ID. Customer report
against rc.3 traced 233× `Internal error fetching provisioner job
resource. fetch related workspace build: context canceled` 500s during a
50-minute incident window to this path.

Backport to `release/2.33` will follow as a separate PR after this
merges.

[#24325](#24325) ("support multiple
agents with shared instance-identity auth") rewrote
`coderd/workspaceresourceauth.go::handleAuthInstanceID` to use the new
`:many` agent lookup followed by a per-candidate filter loop. Each
iteration synchronously calls `GetWorkspaceResourceByID` and
`GetProvisionerJobByID`. Both go through `dbauthz`, and both fan out
into the same `provisioner_job → workspace_build → workspace` cascade
because `authorizeProvisionerJob` always re-authorizes the workspace via
`GetWorkspaceBuildByJobID → GetWorkspaceByID`. The handler then
re-fetches resource and job again for the surviving agent.

Net effect on the agent-auth happy path:

| | SQL | RBAC |
|---|---|---|
| rc.2 baseline | 13 | 5 |
| rc.3 today, 1 agent | 19 | 7 |
| rc.3 today, 2 agents | 26 | 9 |
| **After this PR, 1 agent** | **6** | **3** |
| **After this PR, 2 agents** | **7** | **3** |

Under load, the rc.3 chain blocks on pool acquire and the request blows
past the 30s HTTP write timeout.

(`coderd/database/dbauthz/dbauthz.go`)

Add an `AsSystemRestricted` early-return at the top of
`authorizeProvisionerJob`. Instance-identity auth has already proven
cloud identity before reaching the DB layer, so re-authorizing the
workspace on every provisioner-job lookup is pure overhead. Existing
`GetWorkspaceAgentsByInstanceID` already uses the same fast-path
pattern.

```go
if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err == nil {
    return nil
}
```

(`coderd/workspaceresourceauth.go`)

Capture the provisioner job alongside each candidate during the filter
loop so the survivor lookup does not re-fetch resource and job after
selection. The previous code fired the resource→job→build→workspace
cascade twice for the surviving agent.

Adds `TestAuthorizeProvisionerJob_SystemFastPath` in
`coderd/database/dbauthz/dbauthz_test.go` with two sub-tests:
- `AsSystemRestricted/SkipsCascade` — strict mock fails the test if
`GetWorkspaceBuildByJobID` or `GetWorkspaceByID` is called.
- `NonSystemActor/StillCascades` — auditor (no `ResourceSystem`) still
pays the cascade and produces a `NotAuthorized` error, proving the
fast-path is gated correctly.

Updates 12 existing dbauthz suite cases to expect the new
`ResourceSystem.Read` check ahead of the workspace/template-version
check, with `FailSystemObjectChecks()` to force the slow path.

Existing integration coverage in
`TestPostWorkspaceAuthAWSInstanceIdentity/Ambiguous/{SingleAgent,
MultipleAgentsWithSelector, MultipleAgentsNoSelector, SubAgentExcluded,
...}` exercises Part 2 end-to-end and continues to pass.

- 3 files changed, +166/-48
- No SQL changes
- No `make gen`
- No migrations
- No audit-table updates

- [x] `go test ./coderd/database/dbauthz/` — full suite, ~6s
- [x] `go test -run TestPostWorkspaceAuth ./coderd/` — instance-identity
handler tests
- [x] `go test -run TestProvisionerJob ./coderd/`
- [x] `go test -run TestWorkspaceAgent ./coderd/`
- [x] `go test ./coderd/provisionerdserver/`
- [x] `gofmt -l` clean

- **SQL-side filter:** rewrite `GetWorkspaceAgentsByInstanceID` to join
`workspace_resources`/`provisioner_jobs` and filter `job.type =
'workspace_build'` server-side, eliminating the filter loop entirely.
Cleaner long-term, but changes generated SQL and is too much surface for
a release-branch hotfix. Worth doing as a follow-up.
- **Full revert of #24325:** removes the multi-agent feature outright;
conflicts with downstream commits
([#24441](#24441),
[#24438](#24438),
[#24313](#24313)). Reserved as
fallback if the surgical fix doesn't hold under load testing.
@f0ssel f0ssel enabled auto-merge (squash) May 5, 2026 19:44
@ibetitsmike ibetitsmike self-requested a review May 5, 2026 19:53
@f0ssel f0ssel merged commit f009c17 into release/2.33 May 5, 2026
29 checks passed
@f0ssel f0ssel deleted the backport/pr-24973-to-release-2.33 branch May 5, 2026 19:54
@github-actions github-actions Bot locked and limited conversation to collaborators May 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants