test(testutil): retry terraform provider cache population on transient failures by jscottmiller · Pull Request #26196 · coder/coder

jscottmiller · 2026-06-09T20:18:50Z

Problem

Tests that run real Terraform (e.g. enterprise/coderd.TestWorkspaceTemplateParamsChange, provisioner/terraform.TestProvision) intermittently fail when populating the shared provider cache. On a cache miss, DownloadTFProviders shells out to terraform init and terraform providers mirror against the live registry, which periodically returns transient 5xx errors from the registry/GitHub (504, 500). The cache-population helper had no application-level retry, so a single transient failure failed the whole test. Terraform's own registry client only retries each request once ("the request failed after 2 attempts"), which is insufficient for these bursts.

Fix

runCmd now retries on any non-zero exit using github.com/coder/retry, logging each failed attempt and preserving the original failure message format. Retry-all is safe here because terraform init and terraform providers mirror are idempotent: each run reconciles the existing state in the working directory.

The backoff window is deliberately wide: 5 attempts over roughly a minute (retry.New(5s, 30s)). Registry/GitHub incidents typically last seconds to minutes rather than a single unlucky request, so a narrow window would only survive an isolated blip. This is affordable because the network path runs only on a cache miss, not on every test: a populated cache short-circuits via os.Stat and is reused within and across runs (persisted by .github/actions/test-cache). The wait is therefore rarely incurred and is negligible against the 20m per-package test timeout. The only downside is a slightly slower failure on a genuinely doomed run.

This only affects the test provider-cache helper. Production provisioner code, the Windows no-op path, and the CI cache strategy are unchanged.

Refs coder/internal#1201

Investigation notes

The CI cache (~/.cache/coderv2-test, via .github/actions/test-cache) is persisted across runs and keyed by a hash of a stable caller-supplied label + template file contents, so cache hits avoid the network entirely. The flake only surfaces on a cache miss (provider version bump, monthly cache reset, or new label/template), where the populating terraform init was the sole unprotected network call. This change closes that gap without weakening the "use real Terraform" intent of the tests.

🤖 Generated with Coder Agents on behalf of @jscottmiller.

…t failures The provider cache population path in DownloadTFProviders shells out to terraform init and terraform providers mirror against the live registry on a cache miss. These intermittently fail with transient registry/GitHub 5xx errors, failing the whole test. Retry runCmd up to 3 times with exponential backoff on any non-zero exit. Re-running the commands in the same working directory is safe because both are idempotent.

Use 5 attempts with retry.New(5s, 30s) (~1 minute) instead of 3 attempts over ~4s. Registry/GitHub incidents typically last seconds to minutes, so a short window only survives an isolated blip. The wait is only incurred on a cache miss, so the cost is negligible against the per-package test timeout.

github-actions Bot assigned jscottmiller Jun 9, 2026

jscottmiller added 2 commits June 9, 2026 20:38

test(testutil): tighten cache retry comment

b5582fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(testutil): retry terraform provider cache population on transient failures#26196

test(testutil): retry terraform provider cache population on transient failures#26196
jscottmiller wants to merge 3 commits into
mainfrom
test/terraform-cache-init-retry

jscottmiller commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jscottmiller commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jscottmiller commented Jun 9, 2026 •

edited

Loading