fix: make prebuild claiming durable and idempotent by SasSwart · Pull Request #23108 · coder/coder

SasSwart · 2026-03-16T13:06:20Z

Problem

When a prebuilt workspace is claimed, the agent reinitializes via a single fire-and-forget pubsub event over SSE. If the agent's SSE connection is interrupted at claim time, the event is permanently lost — the workspace is stuck with no self-healing path.

Additionally, regular (non-prebuild) workspaces had no way to opt out of the /reinit polling loop — agents would reconnect indefinitely to an endpoint that would never send them anything useful.

Root Cause

workspaceAgentReinit fetches the workspace (with its current owner_id) via GetWorkspaceByAgentID, but never checked whether a claim already happened. It only subscribed to pubsub for future events. The database already has durable claim state (owner_id changes from PrebuildsSystemUserID to the real user), but no layer ever consulted it on reconnection.

Solution

Server-side durable check with first-build-initiator gating

TOCTOU-safe ordering: Subscribe to pubsub claim events before any durable checks, so a claim that fires during the check is buffered in the channel rather than lost.

First-build-initiator gating: When !workspace.IsPrebuild() (owner is no longer the system user), look up the first build's InitiatorID. The prebuild reconciler always uses PrebuildsSystemUserID as the initiator. This distinguishes claimed prebuilds from regular workspaces without any SQL schema changes.

Regular workspace (first build initiator ≠ system user) → 409 Conflict, agent stops reconnecting
Claimed prebuild, build completed → pre-seed channel with reinit event and close it, transmitter delivers one-shot then exits
Claimed prebuild, build in-progress → fall through to pubsub subscription, agent waits for completion event
Unclaimed prebuild → pubsub subscription (existing happy path)

Declarative reinit events (defense-in-depth)

Added UserID field to ReinitializationEvent with JSON tags
Switched pubsub serialization from raw string to JSON (with backward-compat fallback for rolling upgrades)
Populated UserID at both the publish site and the durable check

Agent SDK: 409 handling

WaitForReinitLoop detects 409 Conflict from the server and closes the reinitEvents channel, cleanly exiting the retry goroutine.

Agent CLI: fixed two bugs + added reinitCtx

Closed channel (!ok): now blocks on <-ctx.Done() instead of continue, keeping the current agent running. Previously this would leak agents by skipping agnt.Close() and re-entering the loop.
Duplicate owner reinit: cancels reinitCtx (stops the reinit goroutine), then blocks on <-ctx.Done(). Previously continue would skip cleanup and create a new agent on the next loop iteration.
reinitCtx: a cancellable child of ctx passed to WaitForReinitLoop, allowing the agent to stop the reinit HTTP polling after reinit completes.

Agent-side idempotency

Tracks lastOwnerID in the agent reinit loop — duplicate events for the same owner are skipped.

Testing

"unclaimed prebuild receives reinit via pubsub": prebuild owned by system user, pubsub event triggers reinit
"claimed prebuild receives one-shot reinit on reconnect": first build by system user, owner changed, build completed → immediate reinit (no pubsub needed)
"claimed prebuild waits during in-progress claim build": claimed but build still running → no reinit until build completes
"regular workspace gets 409": first build by real user → 409 Conflict, agent stops polling
Updated claim publisher/listener tests: verify UserID survives JSON round-trip + backward compat with raw string payloads
Updated SSE round-trip test: verify UserID survives transmit → receive cycle

Fixes #22359

Rolling upgrade note

During a rolling deploy where old coderd instances coexist with new ones, the pubsub ReinitializationEvent has a new workspace_id field (JSON key workspace_id). Old publishers send a raw reason string instead of JSON; the new listener gracefully falls back by treating the entire payload as the reason and filling in WorkspaceID from context. The only visible effect during the upgrade window is that WorkspaceID may be the zero UUID in agent-side logs — this is cosmetic and resolves once all instances are updated.

mafredri

🤖 Agent on behalf of mafredri

The subscribe-before-check design for TOCTOU prevention is the right approach, the 409 contract for non-prebuilds is clean, and the test structure covers the important state transitions well. The backward-compat fallback in the pubsub listener is thoughtful. Good fix for the missing return after InternalServerError on line 1480.

Two related P1s in the pre-seed-and-close block, two P2s, and a few minor items across 6 inline comments.

Seven reviewers, one channel, zero buffer headroom. Hyvää yötä!

mafredri

🤖 Generated by Coder deep-review. 10 Tier 1 + 3 Tier 2 reviewers.

The architecture is sound. The subscribe-before-check ordering eliminates the TOCTOU window correctly, the durable state check covers the agent-reconnect scenario that motivated this PR, and the backward-compat JSON/raw-string fallback in the pubsub listener is clean. Fixing the pre-existing missing return after GetWorkspaceByAgentID failure is a good catch.

The main gap: the "claim build in progress" fall-through is the only path where the TOCTOU-safe subscription is actually load-bearing for claimed prebuilds, and it has no test. The helper infrastructure is already in place (claimPrebuild(..., false)). One P1, two P2s, two P3s, one observation, and four nits across 9 inline comments.

🤖 Reviewers: Bisky, Hisoka, Mafu-san, Mafuuu, Killua, Kurapika, Meruem, Ryosuke, Takumi, Robin, Ging (Go), Gon, Leorio

mafredri

🤖 Generated by Coder deep-review v2. 10 Tier 1 + 3 Tier 2 reviewers.

Good progress since v1. The defer close, the 409 for failed claim builds, the structured logger fix, and the rolling-upgrade comment rewording all address previous findings cleanly.

New panel caught two issues the first round missed: a rolling-upgrade infinite reinit loop (old agent + new server), and a cancelled claim build that passes the success condition. 1 P1, 3 P2s, 2 P3s, and 4 nits across 9 inline comments.

🤖 Reviewers: Bisky, Hisoka, Mafu-san, Mafuuu, Killua, Kurapika, Meruem, Ryosuke, Takumi, Robin, Ging (Go), Gon, Leorio

mafredri

The subscribe-before-check TOCTOU pattern is well-designed, the ?wait rolling-upgrade gating is a clean solution to the old-agent infinite-reinit problem, and the channel lifecycle (defer close, !ok handling, nil-channel-in-select) is textbook correct Go concurrency. The test suite is genuine: four real subtests exercising distinct server-side paths with proper DB state setup.

One state machine gap to look at (P2), one minor API contract issue (P3), and three nits across 6 inline comments.

Meruem: "A completed build with no error is successful regardless of cancellation status. Change the success condition to eliminate the class of bug."

Ryosuke: "The handler inlines ~60 lines of claim status business logic. A ClaimStatus method on a prebuilds service would make the gap visible as a missing enum case."

🤖 Generated by Coder deep-review v3. 9 Tier 1 + 3 Tier 2 reviewers.

mafredri

Except for the question regarding reinitEvents safety (I'd like to see human confirmation on it), all my remaining requests are simple enough to address, so pre-approving! 🏂🏻

When an agent connects to the /reinit SSE endpoint, the server now checks if the workspace was already claimed (owner differs from the prebuilds system user). If the claim build completed successfully, a reinit event is sent immediately so the agent picks up the new owner even when the original pubsub notification was missed. Also fixes a missing return after the GetWorkspaceByAgentID error path that previously caused fallthrough, and buffers the reinitEvents channel (size 1) so the durable check can write without blocking.

The pubsub publisher now JSON-marshals the full ReinitializationEvent instead of sending only the reason string. This allows additional fields (such as UserID) to survive the pubsub round-trip. The listener attempts JSON unmarshal first and falls back to treating the payload as a raw reason string for backward compatibility with older publishers. The publish call site in provisionerdserver now populates UserID with workspace.OwnerID.

Add two new subtests to TestReinit: 1. "agent receives reinit on reconnect after missed event" - Verifies the durable claim check: when an agent reconnects to a workspace that is already claimed (owner != PrebuildsSystemUserID) and the latest build completed successfully, the handler sends a reinit event immediately without requiring a pubsub message. 2. "agent does not receive premature reinit during in-progress claim build" - Verifies that the durable claim check does NOT fire when the latest build's provisioner job has not completed yet. This prevents premature reinit during an in-progress claim build. The existing test is wrapped in a subtest ("agent receives reinit via pubsub") to maintain consistent structure.

Replace the durable claim check in workspaceAgentReinit with full gating logic that distinguishes post-claim/never-prebuild agents from pre-claim agents reconnecting after a missed pubsub event. The previous check sent reinit to ANY agent when !workspace.IsPrebuild() and the build had completed. This caused an infinite reinit loop: after reinit, the new agent reconnects, the durable check fires again, and so on. Now we compare the agent's build (via resource.JobID) against the latest build's JobID: - If they match, this is a post-claim or never-a-prebuild agent: return 409 Conflict immediately. - If different and the claim build completed successfully, send a one-shot reinit event via a closed channel, then return. - If different and the claim build is still in progress (or failed), fall through to the pubsub subscription so the agent is notified when it completes.

When the server returns HTTP 409 from /reinit, it means the workspace is not an unclaimed prebuild (either post-claim or never a prebuild). In this case, WaitForReinitLoop now stops retrying, closes the channel, and returns instead of retrying indefinitely.

When WaitForReinitLoop closes the reinitEvents channel (indicating the workspace is not an unclaimed prebuild), the agent now detects the closure via the two-value channel receive form and sets the channel to nil so it blocks forever in the select. This allows the agent to continue running normally without reinit capability.

Fix 3 existing subtests: - 'agent receives reinit via pubsub': use PrebuildsSystemUserID as owner so workspace.IsPrebuild() is true and handler goes to pubsub wait path - 'agent receives reinit on reconnect after missed event': create two-build prebuild claim pattern (Build 1 prebuild + Build 2 claim) - 'agent does not receive premature reinit during in-progress claim build': same two-build pattern with Starting() for in-progress Build 2 Add 2 new subtests: - 'post-claim agent gets 409': agent on latest build of non-prebuild workspace gets 409 Conflict - 'pre-claim agent receives one-shot reinit and stream closes': verify one-shot reinit and that calling WaitForReinit again also succeeds Note: Tests 2 and 5 (reconnect after missed event, one-shot reinit) are blocked by GetAuthenticatedWorkspaceAgentAndBuildByAuthToken not allowing pre-claim agents from older builds to authenticate when a newer completed START build exists. The auth query needs a 'prebuild claim' case added.

… claiming

The previous Phase 5 broadened the SQL auth query to let pre-claim agents authenticate, which was unnecessary (token reuse already handles this) and broke 3 CI tests. This replaces it with a simpler approach: - Check if the workspace's first build was initiated by the prebuilds system user (PrebuildsSystemUserID) to determine if it was originally a prebuild. - Regular workspaces get a 409 Conflict, stopping the agent's reinit retry loop. - Claimed prebuilds with a completed claim build get a one-shot reinit event (channel is pre-seeded and closed). - Claimed prebuilds with in-progress builds fall through to the pubsub subscription path. Also fixes two bugs in cli/agent.go: - Closed reinit channel (from 409) now blocks on ctx.Done() instead of continue, preventing agent leaks. - Duplicate reinit for same owner cancels the reinit goroutine and blocks on ctx.Done() instead of continue, preventing repeated agent creation.

Let the completed-claim-build path fall through to the shared transmitter + error-logging switch instead of creating its own transmitter and silently returning.

- Separate DB errors from non-prebuild check (500 vs 409) - Handle failed claim builds with 500 instead of hanging - Fix race conditions: cancelSub() + non-blocking send, no close - Rewrite fragile in-progress claim test with positive assertions - Add format:"uuid" tags to ReinitializationEvent fields

- Return 409 (not 500) for permanently failed claim builds so the agent's WaitForReinitLoop treats them as terminal instead of retrying every ~10s forever. - Unify channel close in WaitForReinitLoop with defer close() at the top of the goroutine, removing two explicit close() calls and covering the retrier-exit path that previously leaked. - Add workspace_id to the structured logger after GetWorkspaceByAgentID so all downstream log messages are greppable. - Reword backward-compat comment in claim.go to clarify this PR introduces JSON encoding (not that it existed before).

…d check - Add 'wait' query parameter to WaitForReinit; server only performs the durable claim check when this param is present. Old agents (pre-this-PR) lack the duplicate-reinit guard, so they would enter an infinite reinit loop if we pre-seeded the channel on every connection. - Guard the claim-build success path against cancelled jobs: job.CanceledAt.Valid is now an explicit condition so a cancelled build is not misclassified as a success. - Add swagger annotation for the new 'wait' query parameter. - Add doc comment to WaitForReinitLoop explaining channel-close semantics (ctx cancel or terminal 409). - Fix misleading 409 log message in WaitForReinitLoop. - Fix channel-close comment in cli/agent.go to cover both close reasons (terminal 409 and context expiry). - Document the latest-build assumption and the in-progress comment in workspaceAgentReinit.

Rename ReinitializationEvent.UserID to OwnerID to align the struct field with its semantic meaning (every usage site populates it from workspace.OwnerID) and with the tracking variable in cli/agent.go (lastOwnerID). Replace the omitempty JSON tag with omitzero. encoding/json does not consider array zero values empty, so omitempty never omits uuid.Nil. omitzero (Go 1.24+) correctly handles [16]byte zero values.

…vent handling - Updated the ListenForWorkspaceClaims method to return a receive-only channel for reinitialization events, removing the need for the caller to manage the channel. - Adjusted related tests to accommodate the new method signature and ensure proper event handling. - Enhanced comments for clarity on channel behavior and subscription management.

github-actions Bot assigned SasSwart Mar 16, 2026

SasSwart marked this pull request as ready for review March 20, 2026 13:28

SasSwart force-pushed the database-abstraction-cjvy branch from afdaaee to 2858d68 Compare March 20, 2026 13:29

SasSwart requested a review from mafredri March 20, 2026 14:45

mafredri requested changes Mar 20, 2026

View reviewed changes

SasSwart force-pushed the database-abstraction-cjvy branch 2 times, most recently from e848e80 to e89d110 Compare March 23, 2026 08:54

SasSwart requested a review from mafredri March 23, 2026 09:02

mafredri requested changes Mar 23, 2026

View reviewed changes

mafredri reviewed Mar 27, 2026

View reviewed changes

Comment thread coderd/workspaceagents.go Outdated

Comment thread coderd/workspaceagents.go

Comment thread cli/agent.go

Comment thread codersdk/agentsdk/agentsdk.go

Comment thread coderd/workspaceagents.go Outdated

Comment thread codersdk/agentsdk/agentsdk.go

mafredri approved these changes Mar 27, 2026

View reviewed changes

Comment thread coderd/prebuilds/claim.go

Comment thread cli/agent.go

Comment thread coderd/workspaceagents.go Outdated

Comment thread coderd/workspaceagents.go Outdated

Comment thread coderd/workspaceagents.go Outdated

Comment thread codersdk/agentsdk/agentsdk.go

SasSwart added 18 commits April 2, 2026 21:07

feat(agentsdk): add UserID field and JSON tags to ReinitializationEvent

ed8ff0c

feat(agent): skip redundant reinit events with same owner

66ac5d0

test(prebuilds): verify UserID in claim and SSE round-trip tests

bf81376

fix(database): allow pre-claim agents to authenticate during prebuild…

89654e6

… claiming

revert: remove broken Phase 5 (SQL change + resource.JobID gating)

886b8a0

refactor(coderd): remove duplicate transmitter in one-shot reinit path

91af405

Let the completed-claim-build path fall through to the shared transmitter + error-logging switch instead of creating its own transmitter and silently returning.

remove todo

d284ff6

SasSwart added 2 commits April 2, 2026 21:07

SasSwart force-pushed the database-abstraction-cjvy branch from 1195c77 to a0963d7 Compare April 2, 2026 21:35

SasSwart merged commit 5b6b771 into main Apr 2, 2026
29 checks passed

SasSwart deleted the database-abstraction-cjvy branch April 2, 2026 21:51

github-actions Bot locked and limited conversation to collaborators Apr 2, 2026

Conversation

SasSwart commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

Solution

Server-side durable check with first-build-initiator gating

Declarative reinit events (defense-in-depth)

Agent SDK: 409 handling

Agent CLI: fixed two bugs + added reinitCtx

Agent-side idempotency

Testing

Rolling upgrade note

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mafredri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SasSwart commented Mar 16, 2026 •

edited

Loading