test(testutil): retry terraform provider cache population on transient failures#26196
Draft
jscottmiller wants to merge 3 commits into
Draft
test(testutil): retry terraform provider cache population on transient failures#26196jscottmiller wants to merge 3 commits into
jscottmiller wants to merge 3 commits into
Conversation
…t failures The provider cache population path in DownloadTFProviders shells out to terraform init and terraform providers mirror against the live registry on a cache miss. These intermittently fail with transient registry/GitHub 5xx errors, failing the whole test. Retry runCmd up to 3 times with exponential backoff on any non-zero exit. Re-running the commands in the same working directory is safe because both are idempotent.
Use 5 attempts with retry.New(5s, 30s) (~1 minute) instead of 3 attempts over ~4s. Registry/GitHub incidents typically last seconds to minutes, so a short window only survives an isolated blip. The wait is only incurred on a cache miss, so the cost is negligible against the per-package test timeout.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Tests that run real Terraform (e.g.
enterprise/coderd.TestWorkspaceTemplateParamsChange,provisioner/terraform.TestProvision) intermittently fail when populating the shared provider cache. On a cache miss,DownloadTFProvidersshells out toterraform initandterraform providers mirroragainst the live registry, which periodically returns transient 5xx errors from the registry/GitHub (504, 500). The cache-population helper had no application-level retry, so a single transient failure failed the whole test. Terraform's own registry client only retries each request once ("the request failed after 2 attempts"), which is insufficient for these bursts.Fix
runCmdnow retries on any non-zero exit usinggithub.com/coder/retry, logging each failed attempt and preserving the original failure message format. Retry-all is safe here becauseterraform initandterraform providers mirrorare idempotent: each run reconciles the existing state in the working directory.The backoff window is deliberately wide: 5 attempts over roughly a minute (
retry.New(5s, 30s)). Registry/GitHub incidents typically last seconds to minutes rather than a single unlucky request, so a narrow window would only survive an isolated blip. This is affordable because the network path runs only on a cache miss, not on every test: a populated cache short-circuits viaos.Statand is reused within and across runs (persisted by.github/actions/test-cache). The wait is therefore rarely incurred and is negligible against the 20m per-package test timeout. The only downside is a slightly slower failure on a genuinely doomed run.This only affects the test provider-cache helper. Production provisioner code, the Windows no-op path, and the CI cache strategy are unchanged.
Refs coder/internal#1201
Investigation notes
The CI cache (
~/.cache/coderv2-test, via.github/actions/test-cache) is persisted across runs and keyed by a hash of a stable caller-supplied label + template file contents, so cache hits avoid the network entirely. The flake only surfaces on a cache miss (provider version bump, monthly cache reset, or new label/template), where the populatingterraform initwas the sole unprotected network call. This change closes that gap without weakening the "use real Terraform" intent of the tests.🤖 Generated with Coder Agents on behalf of @jscottmiller.