Skip to content

fix: make prebuild claiming durable and idempotent#23108

Merged
SasSwart merged 20 commits into
mainfrom
database-abstraction-cjvy
Apr 2, 2026
Merged

fix: make prebuild claiming durable and idempotent#23108
SasSwart merged 20 commits into
mainfrom
database-abstraction-cjvy

Conversation

@SasSwart

@SasSwart SasSwart commented Mar 16, 2026

Copy link
Copy Markdown
Contributor

Problem

When a prebuilt workspace is claimed, the agent reinitializes via a single fire-and-forget pubsub event over SSE. If the agent's SSE connection is interrupted at claim time, the event is permanently lost — the workspace is stuck with no self-healing path.

Additionally, regular (non-prebuild) workspaces had no way to opt out of the /reinit polling loop — agents would reconnect indefinitely to an endpoint that would never send them anything useful.

Root Cause

workspaceAgentReinit fetches the workspace (with its current owner_id) via GetWorkspaceByAgentID, but never checked whether a claim already happened. It only subscribed to pubsub for future events. The database already has durable claim state (owner_id changes from PrebuildsSystemUserID to the real user), but no layer ever consulted it on reconnection.

Solution

Server-side durable check with first-build-initiator gating

TOCTOU-safe ordering: Subscribe to pubsub claim events before any durable checks, so a claim that fires during the check is buffered in the channel rather than lost.

First-build-initiator gating: When !workspace.IsPrebuild() (owner is no longer the system user), look up the first build's InitiatorID. The prebuild reconciler always uses PrebuildsSystemUserID as the initiator. This distinguishes claimed prebuilds from regular workspaces without any SQL schema changes.

  • Regular workspace (first build initiator ≠ system user) → 409 Conflict, agent stops reconnecting
  • Claimed prebuild, build completed → pre-seed channel with reinit event and close it, transmitter delivers one-shot then exits
  • Claimed prebuild, build in-progress → fall through to pubsub subscription, agent waits for completion event
  • Unclaimed prebuild → pubsub subscription (existing happy path)

Declarative reinit events (defense-in-depth)

  • Added UserID field to ReinitializationEvent with JSON tags
  • Switched pubsub serialization from raw string to JSON (with backward-compat fallback for rolling upgrades)
  • Populated UserID at both the publish site and the durable check

Agent SDK: 409 handling

WaitForReinitLoop detects 409 Conflict from the server and closes the reinitEvents channel, cleanly exiting the retry goroutine.

Agent CLI: fixed two bugs + added reinitCtx

  • Closed channel (!ok): now blocks on <-ctx.Done() instead of continue, keeping the current agent running. Previously this would leak agents by skipping agnt.Close() and re-entering the loop.
  • Duplicate owner reinit: cancels reinitCtx (stops the reinit goroutine), then blocks on <-ctx.Done(). Previously continue would skip cleanup and create a new agent on the next loop iteration.
  • reinitCtx: a cancellable child of ctx passed to WaitForReinitLoop, allowing the agent to stop the reinit HTTP polling after reinit completes.

Agent-side idempotency

Tracks lastOwnerID in the agent reinit loop — duplicate events for the same owner are skipped.

Testing

  • "unclaimed prebuild receives reinit via pubsub": prebuild owned by system user, pubsub event triggers reinit
  • "claimed prebuild receives one-shot reinit on reconnect": first build by system user, owner changed, build completed → immediate reinit (no pubsub needed)
  • "claimed prebuild waits during in-progress claim build": claimed but build still running → no reinit until build completes
  • "regular workspace gets 409": first build by real user → 409 Conflict, agent stops polling
  • Updated claim publisher/listener tests: verify UserID survives JSON round-trip + backward compat with raw string payloads
  • Updated SSE round-trip test: verify UserID survives transmit → receive cycle

Fixes #22359

Rolling upgrade note

During a rolling deploy where old coderd instances coexist with new ones, the pubsub ReinitializationEvent has a new workspace_id field (JSON key workspace_id). Old publishers send a raw reason string instead of JSON; the new listener gracefully falls back by treating the entire payload as the reason and filling in WorkspaceID from context. The only visible effect during the upgrade window is that WorkspaceID may be the zero UUID in agent-side logs — this is cosmetic and resolves once all instances are updated.

@SasSwart SasSwart marked this pull request as ready for review March 20, 2026 13:28
@SasSwart SasSwart force-pushed the database-abstraction-cjvy branch from afdaaee to 2858d68 Compare March 20, 2026 13:29
@SasSwart SasSwart requested a review from mafredri March 20, 2026 14:45

@mafredri mafredri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Agent on behalf of mafredri

The subscribe-before-check design for TOCTOU prevention is the right approach, the 409 contract for non-prebuilds is clean, and the test structure covers the important state transitions well. The backward-compat fallback in the pubsub listener is thoughtful. Good fix for the missing return after InternalServerError on line 1480.

Two related P1s in the pre-seed-and-close block, two P2s, and a few minor items across 6 inline comments.

Seven reviewers, one channel, zero buffer headroom. Hyvää yötä!

Comment thread coderd/workspaceagents.go Outdated
Comment thread coderd/workspaceagents.go Outdated
Comment thread coderd/workspaceagents.go Outdated
Comment thread coderd/workspaceagents_test.go Outdated
Comment thread codersdk/agentsdk/agentsdk.go Outdated
Comment thread coderd/prebuilds/claim.go
@SasSwart SasSwart force-pushed the database-abstraction-cjvy branch 2 times, most recently from e848e80 to e89d110 Compare March 23, 2026 08:54
@SasSwart SasSwart requested a review from mafredri March 23, 2026 09:02

@mafredri mafredri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Generated by Coder deep-review. 10 Tier 1 + 3 Tier 2 reviewers.

The architecture is sound. The subscribe-before-check ordering eliminates the TOCTOU window correctly, the durable state check covers the agent-reconnect scenario that motivated this PR, and the backward-compat JSON/raw-string fallback in the pubsub listener is clean. Fixing the pre-existing missing return after GetWorkspaceByAgentID failure is a good catch.

The main gap: the "claim build in progress" fall-through is the only path where the TOCTOU-safe subscription is actually load-bearing for claimed prebuilds, and it has no test. The helper infrastructure is already in place (claimPrebuild(..., false)). One P1, two P2s, two P3s, one observation, and four nits across 9 inline comments.

🤖 Reviewers: Bisky, Hisoka, Mafu-san, Mafuuu, Killua, Kurapika, Meruem, Ryosuke, Takumi, Robin, Ging (Go), Gon, Leorio

Comment thread coderd/workspaceagents_test.go
Comment thread codersdk/agentsdk/agentsdk.go
Comment thread codersdk/agentsdk/agentsdk.go
Comment thread codersdk/agentsdk/agentsdk.go Outdated
Comment thread coderd/workspaceagents.go
Comment thread coderd/workspaceagents.go
Comment thread codersdk/agentsdk/agentsdk.go Outdated
Comment thread coderd/prebuilds/claim.go Outdated
Comment thread coderd/workspaceagents.go

@mafredri mafredri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Generated by Coder deep-review v2. 10 Tier 1 + 3 Tier 2 reviewers.

Good progress since v1. The defer close, the 409 for failed claim builds, the structured logger fix, and the rolling-upgrade comment rewording all address previous findings cleanly.

New panel caught two issues the first round missed: a rolling-upgrade infinite reinit loop (old agent + new server), and a cancelled claim build that passes the success condition. 1 P1, 3 P2s, 2 P3s, and 4 nits across 9 inline comments.

🤖 Reviewers: Bisky, Hisoka, Mafu-san, Mafuuu, Killua, Kurapika, Meruem, Ryosuke, Takumi, Robin, Ging (Go), Gon, Leorio

Comment thread coderd/workspaceagents.go Outdated
Comment thread coderd/workspaceagents.go
Comment thread coderd/workspaceagents_test.go
Comment thread coderd/workspaceagents.go
Comment thread codersdk/agentsdk/agentsdk.go
Comment thread codersdk/agentsdk/agentsdk.go
Comment thread codersdk/agentsdk/agentsdk.go Outdated
Comment thread codersdk/agentsdk/agentsdk.go
Comment thread cli/agent.go Outdated

@mafredri mafredri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The subscribe-before-check TOCTOU pattern is well-designed, the ?wait rolling-upgrade gating is a clean solution to the old-agent infinite-reinit problem, and the channel lifecycle (defer close, !ok handling, nil-channel-in-select) is textbook correct Go concurrency. The test suite is genuine: four real subtests exercising distinct server-side paths with proper DB state setup.

One state machine gap to look at (P2), one minor API contract issue (P3), and three nits across 6 inline comments.

Meruem: "A completed build with no error is successful regardless of cancellation status. Change the success condition to eliminate the class of bug."

Ryosuke: "The handler inlines ~60 lines of claim status business logic. A ClaimStatus method on a prebuilds service would make the gap visible as a missing enum case."

🤖 Generated by Coder deep-review v3. 9 Tier 1 + 3 Tier 2 reviewers.

Comment thread coderd/workspaceagents.go Outdated
Comment thread coderd/workspaceagents.go
Comment thread cli/agent.go
Comment thread codersdk/agentsdk/agentsdk.go
Comment thread coderd/workspaceagents.go Outdated
Comment thread codersdk/agentsdk/agentsdk.go

@mafredri mafredri left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for the question regarding reinitEvents safety (I'd like to see human confirmation on it), all my remaining requests are simple enough to address, so pre-approving! 🏂🏻

Comment thread coderd/prebuilds/claim.go
Comment thread cli/agent.go
Comment thread coderd/workspaceagents.go Outdated
Comment thread coderd/workspaceagents.go Outdated
Comment thread coderd/workspaceagents.go Outdated
Comment thread codersdk/agentsdk/agentsdk.go
SasSwart added 18 commits April 2, 2026 21:07
When an agent connects to the /reinit SSE endpoint, the server now
checks if the workspace was already claimed (owner differs from the
prebuilds system user). If the claim build completed successfully, a
reinit event is sent immediately so the agent picks up the new owner
even when the original pubsub notification was missed.

Also fixes a missing return after the GetWorkspaceByAgentID error
path that previously caused fallthrough, and buffers the reinitEvents
channel (size 1) so the durable check can write without blocking.
The pubsub publisher now JSON-marshals the full ReinitializationEvent
instead of sending only the reason string. This allows additional
fields (such as UserID) to survive the pubsub round-trip.

The listener attempts JSON unmarshal first and falls back to treating
the payload as a raw reason string for backward compatibility with
older publishers.

The publish call site in provisionerdserver now populates UserID with
workspace.OwnerID.
Add two new subtests to TestReinit:

1. "agent receives reinit on reconnect after missed event" - Verifies
   the durable claim check: when an agent reconnects to a workspace
   that is already claimed (owner != PrebuildsSystemUserID) and the
   latest build completed successfully, the handler sends a reinit
   event immediately without requiring a pubsub message.

2. "agent does not receive premature reinit during in-progress claim
   build" - Verifies that the durable claim check does NOT fire when
   the latest build's provisioner job has not completed yet. This
   prevents premature reinit during an in-progress claim build.

The existing test is wrapped in a subtest ("agent receives reinit via
pubsub") to maintain consistent structure.
Replace the durable claim check in workspaceAgentReinit with full gating
logic that distinguishes post-claim/never-prebuild agents from pre-claim
agents reconnecting after a missed pubsub event.

The previous check sent reinit to ANY agent when !workspace.IsPrebuild()
and the build had completed. This caused an infinite reinit loop: after
reinit, the new agent reconnects, the durable check fires again, and so
on.

Now we compare the agent's build (via resource.JobID) against the latest
build's JobID:

- If they match, this is a post-claim or never-a-prebuild agent: return
  409 Conflict immediately.
- If different and the claim build completed successfully, send a
  one-shot reinit event via a closed channel, then return.
- If different and the claim build is still in progress (or failed),
  fall through to the pubsub subscription so the agent is notified
  when it completes.
When the server returns HTTP 409 from /reinit, it means the workspace
is not an unclaimed prebuild (either post-claim or never a prebuild).
In this case, WaitForReinitLoop now stops retrying, closes the channel,
and returns instead of retrying indefinitely.
When WaitForReinitLoop closes the reinitEvents channel (indicating
the workspace is not an unclaimed prebuild), the agent now detects
the closure via the two-value channel receive form and sets the
channel to nil so it blocks forever in the select. This allows the
agent to continue running normally without reinit capability.
Fix 3 existing subtests:
- 'agent receives reinit via pubsub': use PrebuildsSystemUserID as owner
  so workspace.IsPrebuild() is true and handler goes to pubsub wait path
- 'agent receives reinit on reconnect after missed event': create
  two-build prebuild claim pattern (Build 1 prebuild + Build 2 claim)
- 'agent does not receive premature reinit during in-progress claim build':
  same two-build pattern with Starting() for in-progress Build 2

Add 2 new subtests:
- 'post-claim agent gets 409': agent on latest build of non-prebuild
  workspace gets 409 Conflict
- 'pre-claim agent receives one-shot reinit and stream closes': verify
  one-shot reinit and that calling WaitForReinit again also succeeds

Note: Tests 2 and 5 (reconnect after missed event, one-shot reinit) are
blocked by GetAuthenticatedWorkspaceAgentAndBuildByAuthToken not allowing
pre-claim agents from older builds to authenticate when a newer completed
START build exists. The auth query needs a 'prebuild claim' case added.
The previous Phase 5 broadened the SQL auth query to let pre-claim
agents authenticate, which was unnecessary (token reuse already
handles this) and broke 3 CI tests.

This replaces it with a simpler approach:
- Check if the workspace's first build was initiated by the prebuilds
  system user (PrebuildsSystemUserID) to determine if it was originally
  a prebuild.
- Regular workspaces get a 409 Conflict, stopping the agent's reinit
  retry loop.
- Claimed prebuilds with a completed claim build get a one-shot reinit
  event (channel is pre-seeded and closed).
- Claimed prebuilds with in-progress builds fall through to the pubsub
  subscription path.

Also fixes two bugs in cli/agent.go:
- Closed reinit channel (from 409) now blocks on ctx.Done() instead of
  continue, preventing agent leaks.
- Duplicate reinit for same owner cancels the reinit goroutine and
  blocks on ctx.Done() instead of continue, preventing repeated agent
  creation.
Let the completed-claim-build path fall through to the shared
transmitter + error-logging switch instead of creating its own
transmitter and silently returning.
- Separate DB errors from non-prebuild check (500 vs 409)
- Handle failed claim builds with 500 instead of hanging
- Fix race conditions: cancelSub() + non-blocking send, no close
- Rewrite fragile in-progress claim test with positive assertions
- Add format:"uuid" tags to ReinitializationEvent fields
- Return 409 (not 500) for permanently failed claim builds so the
  agent's WaitForReinitLoop treats them as terminal instead of
  retrying every ~10s forever.
- Unify channel close in WaitForReinitLoop with defer close() at
  the top of the goroutine, removing two explicit close() calls
  and covering the retrier-exit path that previously leaked.
- Add workspace_id to the structured logger after GetWorkspaceByAgentID
  so all downstream log messages are greppable.
- Reword backward-compat comment in claim.go to clarify this PR
  introduces JSON encoding (not that it existed before).
…d check

- Add 'wait' query parameter to WaitForReinit; server only performs
  the durable claim check when this param is present. Old agents
  (pre-this-PR) lack the duplicate-reinit guard, so they would enter
  an infinite reinit loop if we pre-seeded the channel on every
  connection.
- Guard the claim-build success path against cancelled jobs:
  job.CanceledAt.Valid is now an explicit condition so a cancelled
  build is not misclassified as a success.
- Add swagger annotation for the new 'wait' query parameter.
- Add doc comment to WaitForReinitLoop explaining channel-close
  semantics (ctx cancel or terminal 409).
- Fix misleading 409 log message in WaitForReinitLoop.
- Fix channel-close comment in cli/agent.go to cover both close
  reasons (terminal 409 and context expiry).
- Document the latest-build assumption and the in-progress comment
  in workspaceAgentReinit.
SasSwart added 2 commits April 2, 2026 21:07
Rename ReinitializationEvent.UserID to OwnerID to align the struct
field with its semantic meaning (every usage site populates it from
workspace.OwnerID) and with the tracking variable in cli/agent.go
(lastOwnerID).

Replace the omitempty JSON tag with omitzero. encoding/json does not
consider array zero values empty, so omitempty never omits uuid.Nil.
omitzero (Go 1.24+) correctly handles [16]byte zero values.
…vent handling

- Updated the ListenForWorkspaceClaims method to return a receive-only channel for reinitialization events, removing the need for the caller to manage the channel.
- Adjusted related tests to accommodate the new method signature and ensure proper event handling.
- Enhanced comments for clarity on channel behavior and subscription management.
@SasSwart SasSwart force-pushed the database-abstraction-cjvy branch from 1195c77 to a0963d7 Compare April 2, 2026 21:35
@SasSwart SasSwart merged commit 5b6b771 into main Apr 2, 2026
29 checks passed
@SasSwart SasSwart deleted the database-abstraction-cjvy branch April 2, 2026 21:51
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 2, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make prebuild claiming durable and idempotent

2 participants