Skip to content

fix(cli): eliminate race in PausedDuringWaitForReady test#25858

Draft
mafredri wants to merge 1 commit into
mainfrom
fix/codagt-482-flake-test-task-send-paused-during-wait
Draft

fix(cli): eliminate race in PausedDuringWaitForReady test#25858
mafredri wants to merge 1 commit into
mainfrom
fix/codagt-482-flake-test-task-send-paused-during-wait

Conversation

@mafredri
Copy link
Copy Markdown
Member

@mafredri mafredri commented May 29, 2026

Problem

Test_TaskSend/PausedDuringWaitForReady flakes on macOS CI with:

task entered unknown state while waiting for it to become idle

Two prior fixes (PR #25648, PR #25811) added quartz mock clocks and
traps. The flake recurred on CI run 26634123539 (includes both fixes).

Root cause

The quartz resetTrap was released immediately after catching
ticker.Reset (line 174 of task_send.go), then closed. This allowed
client.TaskByID (line 175, one line later) to race with the test's
subsequent DB mutation (pauseTask / PatchAppStatus).

CI log evidence (run 26634123539, post-both-prior-fixes):

11:25:29.643707  GET  /tasks/me/<id>   took=116ms  (poll's TaskByID)
11:25:29.643763  POST /tasks/.../pause took=29ms   (test's pauseTask)

The GET and POST started 56us apart. The poll's 116ms DB query
straddled the stop build's commit, seeing (stop, pending) which the
tasks_with_status view maps to unknown.

WaitsForWorkingAppState has the same structural race (release trap,
then mutate, then advance), though its race is benign since an early
mutation still makes the command succeed.

Fix

Test-only. Both PausedDuringWaitForReady and sibling
WaitsForWorkingAppState are fixed with the same pattern:

  1. Keep the resetTrap open across both poll iterations (do not
    close after the first poll)
  2. First poll: advance clock, trap catches ticker.Reset freezing the
    goroutine, release immediately. Goroutine proceeds to
    client.TaskByID, sees initial state ("initializing" / "working"),
    continues polling
  3. Second poll: advance clock, trap catches ticker.Reset again,
    goroutine is frozen before client.TaskByID
  4. Mutate state while goroutine is frozen (pauseTask /
    PatchAppStatus). The DB mutation completes with no concurrent
    reader
  5. Release the trap. Goroutine unfreezes, client.TaskByID
    deterministically sees the mutated state
  6. Close the trap

No race because the goroutine cannot execute client.TaskByID while
trapped at ticker.Reset.

Verification
$ go test -run 'Test_TaskSend/(PausedDuringWaitForReady|WaitsForWorkingAppState)' -v ./cli/
--- PASS: Test_TaskSend/PausedDuringWaitForReady (2.31s)
--- PASS: Test_TaskSend/WaitsForWorkingAppState (7.29s)
PASS

$ go test -run 'Test_TaskSend' -race -count=3 ./cli/ -timeout 300s
ok  cli  58.070s
dlv session confirming the ordering gap (from investigation)

Breakpoints at ticker.Reset (line 174) and client.TaskByID (line
175) confirm the trap fires between them:

> [Breakpoint 1] waitForTaskIdle() ./cli/task_send.go:174
=> 174:   ticker.Reset(pollInterval, "task_send", "poll")
   175:   task, err := client.TaskByID(ctx, task.ID)
(dlv) task.Status = "initializing"

    Mock Clock - Ticker.Reset(5s, [task_send poll]) call, matched 1 traps

> [Breakpoint 2] waitForTaskIdle() ./cli/task_send.go:175
=> 175:   task, err := client.TaskByID(ctx, task.ID)
(dlv) task.Status = "initializing"

After continuing, server logs show the concurrent requests that cause
the flake when the trap is released too early:

POST /tasks/.../pause  start="12:31:30.830"  took=111ms
GET  /tasks/me/<id>    start="12:31:30.832"  took=118ms

The fix eliminates this by holding the goroutine frozen at
ticker.Reset while the mutation runs.

Closes CODAGT-482

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

@mafredri mafredri force-pushed the fix/codagt-482-flake-test-task-send-paused-during-wait branch from 7c82faf to a4219b9 Compare May 29, 2026 15:31
The PausedDuringWaitForReady and WaitsForWorkingAppState tests flaked
because the quartz resetTrap was released immediately after catching
ticker.Reset (line 174), allowing client.TaskByID (line 175) to race
with the subsequent DB mutation (pauseTask / PatchAppStatus).

Fix: keep the resetTrap open across both poll iterations. On the first
poll, release the trap so the goroutine sees the initial state and
continues. On the second poll, hold the goroutine frozen at
ticker.Reset while mutating state. Then release; client.TaskByID
deterministically sees the mutated state. No race because the
goroutine cannot execute client.TaskByID while trapped.

Closes CODAGT-482
@mafredri mafredri force-pushed the fix/codagt-482-flake-test-task-send-paused-during-wait branch from a4219b9 to ec5a110 Compare May 29, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant