Skip to content

test(provisionerd): deflake TestProvisionerd/CloseCancelsJob#26386

Draft
jscottmiller wants to merge 1 commit into
mainfrom
plat-172-deflake-close-cancels-job
Draft

test(provisionerd): deflake TestProvisionerd/CloseCancelsJob#26386
jscottmiller wants to merge 1 commit into
mainfrom
plat-172-deflake-close-cancels-job

Conversation

@jscottmiller

@jscottmiller jscottmiller commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

TestProvisionerd/CloseCancelsJob closes the daemon inside the provisioner parse/init callback, which cancels the in-flight AcquireJobWithCancel RPC immediately after the job is delivered. dRPC can report context.Canceled from the server-side stream.Send even after the message was successfully sent, causing a spurious assert.NoError failure.

Tolerate context.Canceled from Send in the acquire handler, mirroring the existing acquireOne test helper. If the job was never delivered the test still fails waiting on completeChan.

This is the same dRPC race that #17448 fixed for the shared acquireOne helper (tracked in coder/internal#584, flake: TestProvisionerd/MaliciousTar). That fix did not cover the bespoke inline handler in CloseCancelsJob, which is the gap this PR closes.

Fixes PLAT-172. Refs coder/internal#1478

Investigation and validation
  • The reported failure was assert.NoError at provisionerd_test.go:112 (commit 615be17) receiving context canceled. The same dRPC behavior is already documented and handled in acquireOne.acquireWithCancel (which swallows context.Canceled from Send) and in retryable() in provisionerd.go. The CloseCancelsJob inline handler never adopted that tolerance.
  • Prior art: PR test: ignore context.Canceled in acquireWithCancel #17448 (merged 2025-04-17) added the identical tolerance to acquireOne.acquireWithCancel to fix flake: TestProvisionerd/MaliciousTar internal#584, which failed with the same signature (Received unexpected error: context canceled at the assert.NoError following stream.Send).
  • The natural timing race did not reproduce locally (~96,000 iterations, including single-threaded and race-detector runs). It requires the server Send goroutine to be descheduled past cancel propagation, which only manifests under CI load.
  • To validate the fix logic deterministically, the handler was temporarily instrumented to surface context.Canceled once the stream context cancelled after a successful Send. With the original bare assert.NoError, this reproduced the exact CI failure (context canceled at line 112); with the fix applied, the test passed. The instrumentation was reverted; only the tolerance change ships.

Generated by Coder Agents on behalf of @jscottmiller.

The test closes the daemon inside the provisioner parse/init callback,
which cancels the in-flight AcquireJobWithCancel RPC immediately after
the job is delivered. dRPC can report context.Canceled from the
server-side stream.Send even after the message was successfully sent,
causing a spurious assert failure.

Tolerate context.Canceled from Send in the acquire handler, mirroring
the existing acquireOne test helper. If the job was never delivered the
test still fails waiting on completeChan.

Fixes PLAT-172.
@linear-code

linear-code Bot commented Jun 15, 2026

Copy link
Copy Markdown

PLAT-172

@jscottmiller

Copy link
Copy Markdown
Contributor Author

/coder-agents-review

@coder-agents-review

coder-agents-review Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Chat: Review posted | View chat
Requested: 2026-06-15 21:37 UTC by @jscottmiller
Spend: $10.30 / $100.00

Review history
  • R1 (2026-06-15): 12 reviewers, 1 P3, COMMENT. Review

deep-review v0.7.1 | Round 1 | e18c863..4eb861c

Last posted: Round 1, 1 findings (1 P3), COMMENT. Review

Finding inventory

Findings

# Sev Status Location Summary Round Reviewer Posted
CRF-1 P3 Open provisionerd_test.go:118 Six sibling inline acquireJobWithCancel handlers carry the same latent dRPC context.Canceled race on stream.Send R1 Meruem P3, Mafu-san Note, Mafuuu Note, Leorio Note Yes
CRF-2 Nit Dropped by orchestrator (marginal wordiness; Leorio praises the comment as better than prior art) provisionerd_test.go:113 Comment has minor restatement of code behavior R1 Gon No

Contested and acknowledged

(none)

Round log

Round 1

Panel. 1 P3, 1 Nit (dropped). Reviewed against e18c863..4eb861c. Netero: no findings. 12-reviewer panel: Bisky, Hisoka, Mafu-san, Mafuuu, Pariston, Gon, Leorio, Takumi, Ging-Go, Meruem, Chopper, Razor. 10 reviewers found no issues. Meruem raised P3 on sibling handlers (converged with 3 Notes from Mafu-san, Mafuuu, Leorio). Gon raised P2 on comment wordiness, downgraded to Nit and dropped (Leorio contradicts, praising the comment).

About deep-review

CRF = Coder Review Finding (P0-P4, Nit, Note)

Reviewer Focus
Bisky tests
Chopper ops/errors
Churn-guard change verification
Ging language modernization
Gon naming
Hisoka edge cases
Killua perf
Kite change integrity
Knov contracts
Knuckle SQL
Kurapika security
Law decomposition
Leorio docs
Luffy product
Mafu-san process
Mafuuu contracts
Melody dispatch/pairing
Meruem structural
Nami frontend
Netero mechanical checks
Pariston premise testing
Pen-botter product gaps
Razor verification
Robin duplication
Ryosuke Go arch
Takumi concurrency
Zoro shape

🤖 Managed by Coder Agents.

@coder-agents-review coder-agents-review Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean change. The dRPC race is well-understood, the fix mirrors established prior art (acquireOne.acquireWithCancel), the comment documents both the mechanism and the safety net, and the completeChan guard preserves the test's ability to catch genuine delivery failures. Investigation and validation in the PR description are thorough.

One P3 on sibling handlers that share the same latent race.

"I tried to build a case against this and couldn't. The problem is correctly understood, the solution is proportional, and the fix is at the right causal level." (Pariston)

🤖 This review was automatically generated with Coder Agents.

// from Send even after the job was successfully delivered, so
// tolerate it. If the job was never delivered, the test fails
// waiting on completeChan instead.
if !xerrors.Is(err, context.Canceled) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 [CRF-1] Six sibling inline acquireJobWithCancel handlers use bare assert.NoError(t, err) after stream.Send without the context.Canceled tolerance:

  • RunningPeriodicUpdate (line 266)
  • Shutdown (line 744)
  • ShutdownFromJob (line 827)
  • ReconnectAndFail (line 930)
  • ReconnectAndComplete (line 1025)
  • UpdatesBeforeComplete (line 1124)

"The dRPC race is inherent to the transport layer. The acquireOne helper's own comment (line 1404) says it can happen 'in unit tests that complete,' not only when the daemon is closed immediately. CloseCancelsJob maximizes the race window by closing inside the parse/init callback, but any test whose cleanup cancels the stream while dRPC is flushing the Send response can hit it."

The fix correctly targets the one test with a documented flake. The other six have a much smaller race window because they don't close the daemon during acquisition. Shutdown and ShutdownFromJob are next-most-likely since they explicitly shut down during the test.

A small helper like assertSendNoError(t, err) that encapsulates the tolerance would eliminate the class rather than waiting for individual flake reports, but that is a separate scope decision. (Meruem P3, Mafu-san Note, Mafuuu Note, Leorio Note)

🤖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flake: TestProvisionerd/MaliciousTar

1 participant