Skip to content

fix(coderd): only send prebuild claim reinit for the claim build#26644

Open
rowansmithau wants to merge 1 commit into
release/2.34from
rowan/backport_26548_234
Open

fix(coderd): only send prebuild claim reinit for the claim build#26644
rowansmithau wants to merge 1 commit into
release/2.34from
rowan/backport_26548_234

Conversation

@rowansmithau

Copy link
Copy Markdown
Member

backport of #26548 for 2.34

Problem

#23108 made prebuild claim delivery durable: when an agent connects to /api/v2/workspaceagents/me/reinit?wait=true, the handler checks whether the workspace's first build was created by the prebuilds system user and whether its latest build succeeded, and if so pre-seeds a prebuild_claimed reinitialization event in case the original pubsub event was missed.

The check does not verify that the latest build is the claim build, so it keeps firing for the rest of the workspace's life. Any workspace that was claimed from a prebuild receives a spurious "prebuild claimed" reinit every time its agent (re)opens the /reinit connection: after every agent restart, every coderd deploy or replica restart, and every dropped SSE connection. Each one shuts the agent down and reinitializes it, killing SSH/IDE sessions and re-running startup scripts. In our deployment, where most workspaces are claimed from prebuilds, this caused fleet-wide "agent disconnected" blips whenever a coderd replica restarted, and a few workspaces whose container exits when the agent restarts went into a restart loop every 15-60 minutes. The agent-side dedup (lastOwnerID in cli/agent.go) only suppresses the second event within one agent process, so every new agent process takes at least one spurious restart.

Fix

Only seed the reinitialization event while the latest build is the claim build itself, determined from the build job's input (prebuilt_workspace_stage), the same signal provisionerdserver uses when publishing the claim event:

  • Latest build is the claim build: behavior unchanged (seed when the job succeeded, 409 when it failed permanently, wait on pubsub while it is in progress).
  • Latest build is still a prebuilds-initiated build (claim build not created yet): fall through to the pubsub subscription, which delivers the claim event when the claim build completes.
  • Latest build is any later user-initiated build: the claim was already handled, so return 409 and the agent stops polling, the same as a regular workspace.

dbfake gains a MarkPrebuiltWorkspaceClaim() builder option so tests can model claim builds' job input, and the existing TestReinit claim subtests now use it. A new subtest covers the long-claimed workspace case.

One deliberate behavior change worth calling out: if a claim build fails and the owner retries with another start build, the handler now returns 409 for that retry build rather than seeding a reinit. This matches the existing treatment of failed claim builds as terminal for the reinit poller.

Verification

  • go test ./coderd/ -run TestReinit against Postgres 17: all subtests pass, including the new workspace claimed in the past gets 409 case.
  • gofmt, go vet, and golangci-lint (v1.64.8) are clean on the touched packages.
  • The fix mirrors behavior validated by hand against an affected deployment: for a long-claimed workspace, /reinit?wait=true returned the seeded prebuild_claimed event on every connection before the change and a 409 afterwards.

Note: this branch was prepared in an environment without the full local toolchain, so the repo's pre-commit hook (make pre-commit) was not run locally; relying on CI for the full gen/fmt/lint suite. Opening as a draft mainly to report the issue and propose a fix; happy to rework it to the maintainers' preferred approach.

)

## Problem

#23108 made prebuild claim delivery durable: when an agent connects to
`/api/v2/workspaceagents/me/reinit?wait=true`, the handler checks
whether the workspace's first build was created by the prebuilds system
user and whether its latest build succeeded, and if so pre-seeds a
`prebuild_claimed` reinitialization event in case the original pubsub
event was missed.

The check does not verify that the latest build is the claim build, so
it keeps firing for the rest of the workspace's life. Any workspace that
was claimed from a prebuild receives a spurious "prebuild claimed"
reinit every time its agent (re)opens the `/reinit` connection: after
every agent restart, every coderd deploy or replica restart, and every
dropped SSE connection. Each one shuts the agent down and reinitializes
it, killing SSH/IDE sessions and re-running startup scripts. In our
deployment, where most workspaces are claimed from prebuilds, this
caused fleet-wide "agent disconnected" blips whenever a coderd replica
restarted, and a few workspaces whose container exits when the agent
restarts went into a restart loop every 15-60 minutes. The agent-side
dedup (`lastOwnerID` in `cli/agent.go`) only suppresses the second event
within one agent process, so every new agent process takes at least one
spurious restart.

## Fix

Only seed the reinitialization event while the latest build is the claim
build itself, determined from the build job's input
(`prebuilt_workspace_stage`), the same signal `provisionerdserver` uses
when publishing the claim event:

- Latest build is the claim build: behavior unchanged (seed when the job
succeeded, 409 when it failed permanently, wait on pubsub while it is in
progress).
- Latest build is still a prebuilds-initiated build (claim build not
created yet): fall through to the pubsub subscription, which delivers
the claim event when the claim build completes.
- Latest build is any later user-initiated build: the claim was already
handled, so return 409 and the agent stops polling, the same as a
regular workspace.

`dbfake` gains a `MarkPrebuiltWorkspaceClaim()` builder option so tests
can model claim builds' job input, and the existing `TestReinit` claim
subtests now use it. A new subtest covers the long-claimed workspace
case.

One deliberate behavior change worth calling out: if a claim build fails
and the owner retries with another start build, the handler now returns
409 for that retry build rather than seeding a reinit. This matches the
existing treatment of failed claim builds as terminal for the reinit
poller.

## Verification

- `go test ./coderd/ -run TestReinit` against Postgres 17: all subtests
pass, including the new `workspace claimed in the past gets 409` case.
- `gofmt`, `go vet`, and `golangci-lint` (v1.64.8) are clean on the
touched packages.
- The fix mirrors behavior validated by hand against an affected
deployment: for a long-claimed workspace, `/reinit?wait=true` returned
the seeded `prebuild_claimed` event on every connection before the
change and a 409 afterwards.

Note: this branch was prepared in an environment without the full local
toolchain, so the repo's pre-commit hook (`make pre-commit`) was not run
locally; relying on CI for the full gen/fmt/lint suite. Opening as a
draft mainly to report the issue and propose a fix; happy to rework it
to the maintainers' preferred approach.

---------

Co-authored-by: Sas Swart <sas.swart.cdk@gmail.com>
@rowansmithau rowansmithau self-assigned this Jun 24, 2026
@rowansmithau rowansmithau added the cherry-pick/v2.34 Cherry-pick PR targeting release/2.34 label Jun 24, 2026
@rowansmithau rowansmithau requested review from SasSwart and f0ssel June 24, 2026 00:30
@rowansmithau rowansmithau marked this pull request as ready for review June 24, 2026 00:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick/v2.34 Cherry-pick PR targeting release/2.34

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants