test(provisionerd): deflake TestProvisionerd/CloseCancelsJob#26386
test(provisionerd): deflake TestProvisionerd/CloseCancelsJob#26386jscottmiller wants to merge 1 commit into
Conversation
The test closes the daemon inside the provisioner parse/init callback, which cancels the in-flight AcquireJobWithCancel RPC immediately after the job is delivered. dRPC can report context.Canceled from the server-side stream.Send even after the message was successfully sent, causing a spurious assert failure. Tolerate context.Canceled from Send in the acquire handler, mirroring the existing acquireOne test helper. If the job was never delivered the test still fails waiting on completeChan. Fixes PLAT-172.
|
/coder-agents-review |
|
Chat: Review posted | View chat Review history
deep-review v0.7.1 | Round 1 | Last posted: Round 1, 1 findings (1 P3), COMMENT. Review Finding inventoryFindings
Contested and acknowledged(none) Round logRound 1Panel. 1 P3, 1 Nit (dropped). Reviewed against e18c863..4eb861c. Netero: no findings. 12-reviewer panel: Bisky, Hisoka, Mafu-san, Mafuuu, Pariston, Gon, Leorio, Takumi, Ging-Go, Meruem, Chopper, Razor. 10 reviewers found no issues. Meruem raised P3 on sibling handlers (converged with 3 Notes from Mafu-san, Mafuuu, Leorio). Gon raised P2 on comment wordiness, downgraded to Nit and dropped (Leorio contradicts, praising the comment). About deep-reviewCRF = Coder Review Finding (P0-P4, Nit, Note)
|
There was a problem hiding this comment.
Clean change. The dRPC race is well-understood, the fix mirrors established prior art (acquireOne.acquireWithCancel), the comment documents both the mechanism and the safety net, and the completeChan guard preserves the test's ability to catch genuine delivery failures. Investigation and validation in the PR description are thorough.
One P3 on sibling handlers that share the same latent race.
"I tried to build a case against this and couldn't. The problem is correctly understood, the solution is proportional, and the fix is at the right causal level." (Pariston)
🤖 This review was automatically generated with Coder Agents.
| // from Send even after the job was successfully delivered, so | ||
| // tolerate it. If the job was never delivered, the test fails | ||
| // waiting on completeChan instead. | ||
| if !xerrors.Is(err, context.Canceled) { |
There was a problem hiding this comment.
P3 [CRF-1] Six sibling inline acquireJobWithCancel handlers use bare assert.NoError(t, err) after stream.Send without the context.Canceled tolerance:
RunningPeriodicUpdate(line 266)Shutdown(line 744)ShutdownFromJob(line 827)ReconnectAndFail(line 930)ReconnectAndComplete(line 1025)UpdatesBeforeComplete(line 1124)
"The dRPC race is inherent to the transport layer. The
acquireOnehelper's own comment (line 1404) says it can happen 'in unit tests that complete,' not only when the daemon is closed immediately.CloseCancelsJobmaximizes the race window by closing inside the parse/init callback, but any test whose cleanup cancels the stream while dRPC is flushing the Send response can hit it."
The fix correctly targets the one test with a documented flake. The other six have a much smaller race window because they don't close the daemon during acquisition. Shutdown and ShutdownFromJob are next-most-likely since they explicitly shut down during the test.
A small helper like assertSendNoError(t, err) that encapsulates the tolerance would eliminate the class rather than waiting for individual flake reports, but that is a separate scope decision. (Meruem P3, Mafu-san Note, Mafuuu Note, Leorio Note)
🤖
TestProvisionerd/CloseCancelsJobcloses the daemon inside the provisioner parse/init callback, which cancels the in-flightAcquireJobWithCancelRPC immediately after the job is delivered. dRPC can reportcontext.Canceledfrom the server-sidestream.Sendeven after the message was successfully sent, causing a spuriousassert.NoErrorfailure.Tolerate
context.CanceledfromSendin the acquire handler, mirroring the existingacquireOnetest helper. If the job was never delivered the test still fails waiting oncompleteChan.This is the same dRPC race that #17448 fixed for the shared
acquireOnehelper (tracked in coder/internal#584,flake: TestProvisionerd/MaliciousTar). That fix did not cover the bespoke inline handler inCloseCancelsJob, which is the gap this PR closes.Fixes PLAT-172. Refs coder/internal#1478
Investigation and validation
assert.NoErroratprovisionerd_test.go:112(commit 615be17) receivingcontext canceled. The same dRPC behavior is already documented and handled inacquireOne.acquireWithCancel(which swallowscontext.CanceledfromSend) and inretryable()inprovisionerd.go. TheCloseCancelsJobinline handler never adopted that tolerance.acquireOne.acquireWithCancelto fix flake: TestProvisionerd/MaliciousTar internal#584, which failed with the same signature (Received unexpected error: context canceledat theassert.NoErrorfollowingstream.Send).Sendgoroutine to be descheduled past cancel propagation, which only manifests under CI load.context.Canceledonce the stream context cancelled after a successfulSend. With the original bareassert.NoError, this reproduced the exact CI failure (context canceledat line 112); with the fix applied, the test passed. The instrumentation was reverted; only the tolerance change ships.