fix: make prebuild claiming durable and idempotent#23108
Conversation
afdaaee to
2858d68
Compare
mafredri
left a comment
There was a problem hiding this comment.
🤖 Agent on behalf of mafredri
The subscribe-before-check design for TOCTOU prevention is the right approach, the 409 contract for non-prebuilds is clean, and the test structure covers the important state transitions well. The backward-compat fallback in the pubsub listener is thoughtful. Good fix for the missing return after InternalServerError on line 1480.
Two related P1s in the pre-seed-and-close block, two P2s, and a few minor items across 6 inline comments.
Seven reviewers, one channel, zero buffer headroom. Hyvää yötä!
e848e80 to
e89d110
Compare
mafredri
left a comment
There was a problem hiding this comment.
🤖 Generated by Coder deep-review. 10 Tier 1 + 3 Tier 2 reviewers.
The architecture is sound. The subscribe-before-check ordering eliminates the TOCTOU window correctly, the durable state check covers the agent-reconnect scenario that motivated this PR, and the backward-compat JSON/raw-string fallback in the pubsub listener is clean. Fixing the pre-existing missing return after GetWorkspaceByAgentID failure is a good catch.
The main gap: the "claim build in progress" fall-through is the only path where the TOCTOU-safe subscription is actually load-bearing for claimed prebuilds, and it has no test. The helper infrastructure is already in place (claimPrebuild(..., false)). One P1, two P2s, two P3s, one observation, and four nits across 9 inline comments.
🤖 Reviewers: Bisky, Hisoka, Mafu-san, Mafuuu, Killua, Kurapika, Meruem, Ryosuke, Takumi, Robin, Ging (Go), Gon, Leorio
mafredri
left a comment
There was a problem hiding this comment.
🤖 Generated by Coder deep-review v2. 10 Tier 1 + 3 Tier 2 reviewers.
Good progress since v1. The defer close, the 409 for failed claim builds, the structured logger fix, and the rolling-upgrade comment rewording all address previous findings cleanly.
New panel caught two issues the first round missed: a rolling-upgrade infinite reinit loop (old agent + new server), and a cancelled claim build that passes the success condition. 1 P1, 3 P2s, 2 P3s, and 4 nits across 9 inline comments.
🤖 Reviewers: Bisky, Hisoka, Mafu-san, Mafuuu, Killua, Kurapika, Meruem, Ryosuke, Takumi, Robin, Ging (Go), Gon, Leorio
mafredri
left a comment
There was a problem hiding this comment.
The subscribe-before-check TOCTOU pattern is well-designed, the ?wait rolling-upgrade gating is a clean solution to the old-agent infinite-reinit problem, and the channel lifecycle (defer close, !ok handling, nil-channel-in-select) is textbook correct Go concurrency. The test suite is genuine: four real subtests exercising distinct server-side paths with proper DB state setup.
One state machine gap to look at (P2), one minor API contract issue (P3), and three nits across 6 inline comments.
Meruem: "A completed build with no error is successful regardless of cancellation status. Change the success condition to eliminate the class of bug."
Ryosuke: "The handler inlines ~60 lines of claim status business logic. A
ClaimStatusmethod on a prebuilds service would make the gap visible as a missing enum case."
🤖 Generated by Coder deep-review v3. 9 Tier 1 + 3 Tier 2 reviewers.
mafredri
left a comment
There was a problem hiding this comment.
Except for the question regarding reinitEvents safety (I'd like to see human confirmation on it), all my remaining requests are simple enough to address, so pre-approving! 🏂🏻
When an agent connects to the /reinit SSE endpoint, the server now checks if the workspace was already claimed (owner differs from the prebuilds system user). If the claim build completed successfully, a reinit event is sent immediately so the agent picks up the new owner even when the original pubsub notification was missed. Also fixes a missing return after the GetWorkspaceByAgentID error path that previously caused fallthrough, and buffers the reinitEvents channel (size 1) so the durable check can write without blocking.
The pubsub publisher now JSON-marshals the full ReinitializationEvent instead of sending only the reason string. This allows additional fields (such as UserID) to survive the pubsub round-trip. The listener attempts JSON unmarshal first and falls back to treating the payload as a raw reason string for backward compatibility with older publishers. The publish call site in provisionerdserver now populates UserID with workspace.OwnerID.
Add two new subtests to TestReinit:
1. "agent receives reinit on reconnect after missed event" - Verifies
the durable claim check: when an agent reconnects to a workspace
that is already claimed (owner != PrebuildsSystemUserID) and the
latest build completed successfully, the handler sends a reinit
event immediately without requiring a pubsub message.
2. "agent does not receive premature reinit during in-progress claim
build" - Verifies that the durable claim check does NOT fire when
the latest build's provisioner job has not completed yet. This
prevents premature reinit during an in-progress claim build.
The existing test is wrapped in a subtest ("agent receives reinit via
pubsub") to maintain consistent structure.
Replace the durable claim check in workspaceAgentReinit with full gating logic that distinguishes post-claim/never-prebuild agents from pre-claim agents reconnecting after a missed pubsub event. The previous check sent reinit to ANY agent when !workspace.IsPrebuild() and the build had completed. This caused an infinite reinit loop: after reinit, the new agent reconnects, the durable check fires again, and so on. Now we compare the agent's build (via resource.JobID) against the latest build's JobID: - If they match, this is a post-claim or never-a-prebuild agent: return 409 Conflict immediately. - If different and the claim build completed successfully, send a one-shot reinit event via a closed channel, then return. - If different and the claim build is still in progress (or failed), fall through to the pubsub subscription so the agent is notified when it completes.
When the server returns HTTP 409 from /reinit, it means the workspace is not an unclaimed prebuild (either post-claim or never a prebuild). In this case, WaitForReinitLoop now stops retrying, closes the channel, and returns instead of retrying indefinitely.
When WaitForReinitLoop closes the reinitEvents channel (indicating the workspace is not an unclaimed prebuild), the agent now detects the closure via the two-value channel receive form and sets the channel to nil so it blocks forever in the select. This allows the agent to continue running normally without reinit capability.
Fix 3 existing subtests: - 'agent receives reinit via pubsub': use PrebuildsSystemUserID as owner so workspace.IsPrebuild() is true and handler goes to pubsub wait path - 'agent receives reinit on reconnect after missed event': create two-build prebuild claim pattern (Build 1 prebuild + Build 2 claim) - 'agent does not receive premature reinit during in-progress claim build': same two-build pattern with Starting() for in-progress Build 2 Add 2 new subtests: - 'post-claim agent gets 409': agent on latest build of non-prebuild workspace gets 409 Conflict - 'pre-claim agent receives one-shot reinit and stream closes': verify one-shot reinit and that calling WaitForReinit again also succeeds Note: Tests 2 and 5 (reconnect after missed event, one-shot reinit) are blocked by GetAuthenticatedWorkspaceAgentAndBuildByAuthToken not allowing pre-claim agents from older builds to authenticate when a newer completed START build exists. The auth query needs a 'prebuild claim' case added.
The previous Phase 5 broadened the SQL auth query to let pre-claim agents authenticate, which was unnecessary (token reuse already handles this) and broke 3 CI tests. This replaces it with a simpler approach: - Check if the workspace's first build was initiated by the prebuilds system user (PrebuildsSystemUserID) to determine if it was originally a prebuild. - Regular workspaces get a 409 Conflict, stopping the agent's reinit retry loop. - Claimed prebuilds with a completed claim build get a one-shot reinit event (channel is pre-seeded and closed). - Claimed prebuilds with in-progress builds fall through to the pubsub subscription path. Also fixes two bugs in cli/agent.go: - Closed reinit channel (from 409) now blocks on ctx.Done() instead of continue, preventing agent leaks. - Duplicate reinit for same owner cancels the reinit goroutine and blocks on ctx.Done() instead of continue, preventing repeated agent creation.
Let the completed-claim-build path fall through to the shared transmitter + error-logging switch instead of creating its own transmitter and silently returning.
- Separate DB errors from non-prebuild check (500 vs 409) - Handle failed claim builds with 500 instead of hanging - Fix race conditions: cancelSub() + non-blocking send, no close - Rewrite fragile in-progress claim test with positive assertions - Add format:"uuid" tags to ReinitializationEvent fields
- Return 409 (not 500) for permanently failed claim builds so the agent's WaitForReinitLoop treats them as terminal instead of retrying every ~10s forever. - Unify channel close in WaitForReinitLoop with defer close() at the top of the goroutine, removing two explicit close() calls and covering the retrier-exit path that previously leaked. - Add workspace_id to the structured logger after GetWorkspaceByAgentID so all downstream log messages are greppable. - Reword backward-compat comment in claim.go to clarify this PR introduces JSON encoding (not that it existed before).
…d check - Add 'wait' query parameter to WaitForReinit; server only performs the durable claim check when this param is present. Old agents (pre-this-PR) lack the duplicate-reinit guard, so they would enter an infinite reinit loop if we pre-seeded the channel on every connection. - Guard the claim-build success path against cancelled jobs: job.CanceledAt.Valid is now an explicit condition so a cancelled build is not misclassified as a success. - Add swagger annotation for the new 'wait' query parameter. - Add doc comment to WaitForReinitLoop explaining channel-close semantics (ctx cancel or terminal 409). - Fix misleading 409 log message in WaitForReinitLoop. - Fix channel-close comment in cli/agent.go to cover both close reasons (terminal 409 and context expiry). - Document the latest-build assumption and the in-progress comment in workspaceAgentReinit.
Rename ReinitializationEvent.UserID to OwnerID to align the struct field with its semantic meaning (every usage site populates it from workspace.OwnerID) and with the tracking variable in cli/agent.go (lastOwnerID). Replace the omitempty JSON tag with omitzero. encoding/json does not consider array zero values empty, so omitempty never omits uuid.Nil. omitzero (Go 1.24+) correctly handles [16]byte zero values.
…vent handling - Updated the ListenForWorkspaceClaims method to return a receive-only channel for reinitialization events, removing the need for the caller to manage the channel. - Adjusted related tests to accommodate the new method signature and ensure proper event handling. - Enhanced comments for clarity on channel behavior and subscription management.
1195c77 to
a0963d7
Compare
Problem
When a prebuilt workspace is claimed, the agent reinitializes via a single fire-and-forget pubsub event over SSE. If the agent's SSE connection is interrupted at claim time, the event is permanently lost — the workspace is stuck with no self-healing path.
Additionally, regular (non-prebuild) workspaces had no way to opt out of the
/reinitpolling loop — agents would reconnect indefinitely to an endpoint that would never send them anything useful.Root Cause
workspaceAgentReinitfetches the workspace (with its currentowner_id) viaGetWorkspaceByAgentID, but never checked whether a claim already happened. It only subscribed to pubsub for future events. The database already has durable claim state (owner_idchanges fromPrebuildsSystemUserIDto the real user), but no layer ever consulted it on reconnection.Solution
Server-side durable check with first-build-initiator gating
TOCTOU-safe ordering: Subscribe to pubsub claim events before any durable checks, so a claim that fires during the check is buffered in the channel rather than lost.
First-build-initiator gating: When
!workspace.IsPrebuild()(owner is no longer the system user), look up the first build'sInitiatorID. The prebuild reconciler always usesPrebuildsSystemUserIDas the initiator. This distinguishes claimed prebuilds from regular workspaces without any SQL schema changes.Declarative reinit events (defense-in-depth)
UserIDfield toReinitializationEventwith JSON tagsUserIDat both the publish site and the durable checkAgent SDK: 409 handling
WaitForReinitLoopdetects 409 Conflict from the server and closes thereinitEventschannel, cleanly exiting the retry goroutine.Agent CLI: fixed two bugs + added reinitCtx
!ok): now blocks on<-ctx.Done()instead ofcontinue, keeping the current agent running. Previously this would leak agents by skippingagnt.Close()and re-entering the loop.reinitCtx(stops the reinit goroutine), then blocks on<-ctx.Done(). Previouslycontinuewould skip cleanup and create a new agent on the next loop iteration.reinitCtx: a cancellable child ofctxpassed toWaitForReinitLoop, allowing the agent to stop the reinit HTTP polling after reinit completes.Agent-side idempotency
Tracks
lastOwnerIDin the agent reinit loop — duplicate events for the same owner are skipped.Testing
UserIDsurvives JSON round-trip + backward compat with raw string payloadsUserIDsurvives transmit → receive cycleFixes #22359
Rolling upgrade note
During a rolling deploy where old coderd instances coexist with new ones, the pubsub
ReinitializationEventhas a newworkspace_idfield (JSON keyworkspace_id). Old publishers send a raw reason string instead of JSON; the new listener gracefully falls back by treating the entire payload as the reason and filling inWorkspaceIDfrom context. The only visible effect during the upgrade window is thatWorkspaceIDmay be the zero UUID in agent-side logs — this is cosmetic and resolves once all instances are updated.