Skip to content

test(testutil): retry terraform provider cache population on transient failures#26196

Draft
jscottmiller wants to merge 3 commits into
mainfrom
test/terraform-cache-init-retry
Draft

test(testutil): retry terraform provider cache population on transient failures#26196
jscottmiller wants to merge 3 commits into
mainfrom
test/terraform-cache-init-retry

Conversation

@jscottmiller

@jscottmiller jscottmiller commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Problem

Tests that run real Terraform (e.g. enterprise/coderd.TestWorkspaceTemplateParamsChange, provisioner/terraform.TestProvision) intermittently fail when populating the shared provider cache. On a cache miss, DownloadTFProviders shells out to terraform init and terraform providers mirror against the live registry, which periodically returns transient 5xx errors from the registry/GitHub (504, 500). The cache-population helper had no application-level retry, so a single transient failure failed the whole test. Terraform's own registry client only retries each request once ("the request failed after 2 attempts"), which is insufficient for these bursts.

Fix

runCmd now retries on any non-zero exit using github.com/coder/retry, logging each failed attempt and preserving the original failure message format. Retry-all is safe here because terraform init and terraform providers mirror are idempotent: each run reconciles the existing state in the working directory.

The backoff window is deliberately wide: 5 attempts over roughly a minute (retry.New(5s, 30s)). Registry/GitHub incidents typically last seconds to minutes rather than a single unlucky request, so a narrow window would only survive an isolated blip. This is affordable because the network path runs only on a cache miss, not on every test: a populated cache short-circuits via os.Stat and is reused within and across runs (persisted by .github/actions/test-cache). The wait is therefore rarely incurred and is negligible against the 20m per-package test timeout. The only downside is a slightly slower failure on a genuinely doomed run.

This only affects the test provider-cache helper. Production provisioner code, the Windows no-op path, and the CI cache strategy are unchanged.

Refs coder/internal#1201

Investigation notes

The CI cache (~/.cache/coderv2-test, via .github/actions/test-cache) is persisted across runs and keyed by a hash of a stable caller-supplied label + template file contents, so cache hits avoid the network entirely. The flake only surfaces on a cache miss (provider version bump, monthly cache reset, or new label/template), where the populating terraform init was the sole unprotected network call. This change closes that gap without weakening the "use real Terraform" intent of the tests.


🤖 Generated with Coder Agents on behalf of @jscottmiller.

…t failures

The provider cache population path in DownloadTFProviders shells out to
terraform init and terraform providers mirror against the live registry on
a cache miss. These intermittently fail with transient registry/GitHub 5xx
errors, failing the whole test.

Retry runCmd up to 3 times with exponential backoff on any non-zero exit.
Re-running the commands in the same working directory is safe because both
are idempotent.
Use 5 attempts with retry.New(5s, 30s) (~1 minute) instead of 3 attempts
over ~4s. Registry/GitHub incidents typically last seconds to minutes, so a
short window only survives an isolated blip. The wait is only incurred on a
cache miss, so the cost is negligible against the per-package test timeout.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant