Skip to content

fix(scaletest/prebuilds): fix Runner.Cleanup() to delete workspaces#23627

Merged
cstyan merged 3 commits into
mainfrom
callum/prebuild-scaletest-cleanup
Apr 22, 2026
Merged

fix(scaletest/prebuilds): fix Runner.Cleanup() to delete workspaces#23627
cstyan merged 3 commits into
mainfrom
callum/prebuild-scaletest-cleanup

Conversation

@cstyan
Copy link
Copy Markdown
Contributor

@cstyan cstyan commented Mar 25, 2026

This PR makes the Cleanup function for prebuild scaletests more reliable. Currently, Cleanup just attempts to delete the prebuilds templates, but templates cannot be deleted if there are any workspaces created from them remaining.

This is problematic because we don't currently generate unique template names on each new prebuilds scaletest run, so this leaves the cluster in a state where another prebuild scaletest cannot be executed without some manual cleanup first.

So, in this PR we're modifying Cleanup to first check for workspaces for the given runners template. It then attempts to delete those workspaces, retrying up to 3x, before finally moving on to the template deletion.

This also means we're properly deleting workspaces for a template if we fail during a call to Run.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 25, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@github-actions github-actions Bot added the community Pull Requests and issues created by the community. label Mar 25, 2026
@matifali matifali removed the community Pull Requests and issues created by the community. label Mar 26, 2026
@github-actions github-actions Bot added the stale This issue is like stale bread. label Apr 6, 2026
@github-actions github-actions Bot closed this Apr 10, 2026
@cstyan cstyan reopened this Apr 14, 2026
@cstyan cstyan force-pushed the callum/prebuild-scaletest-cleanup branch from edf99bb to 860891d Compare April 14, 2026 21:58
cstyan and others added 2 commits April 15, 2026 19:36
…efore template

The API rejects template deletion with a 400 while workspaces still exist.
Cleanup() now lists all workspaces for the template, deletes each via
workspacebuild.NewCleanupRunner with up to 3 retries, then deletes the
template. If Run() fails with a 409 conflict (template already exists from a
prior failed run), the existing template is looked up so Cleanup() can remove
it rather than silently skipping it.
… deletion retries

Log every deletion attempt (not just retries) and return an error if
workspaces remain after all attempts, rather than logging an error and
proceeding to template deletion which would fail anyway.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cstyan cstyan force-pushed the callum/prebuild-scaletest-cleanup branch from 860891d to 3a55622 Compare April 15, 2026 19:36
@cstyan cstyan requested review from spikecurtis and sreya April 15, 2026 20:22
slog.F("template_name", r.template.Name),
)
var failed []codersdk.Workspace
for _, ws := range remaining {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In your experience, are there a lot of these? Do we need to think about running in parallel?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per runner there is at most 50 workspaces that would need to be deleted, because we create a runner per template and all of our prebuilds scenarios create 5 presets per template and 10 workspaces per preset. I don't think a max of 50 is too much to delete serially per runner.

We're already running each runner in a separate go routine so we're already running num_templates (defined in each prebuild scenario) go routines.

If we switch up our scenario sizing or something else in the future that would result in potentially many more workspaces per runner that would need to be cleaned up we can make changes here at that point.

// Retry failed workspace deletions up to maxDeletionAttempts times to
// handle transient errors (e.g. a delete build that fails due to a
// provisioner hiccup).
const maxDeletionAttempts = 3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be worrying about the case where the template is left in a state with a positive number of prebuilds, and thus this routine and the reconciler get into a fight --- it creating more as these are deleted?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I never ran into this case when running the prebuilds tests myself, but I think it is possible.

Previously we would only ever push a new template version with no presets in the Run function after being able to successfully configure and run some prebuilds first. In that case we also had, via the CLI code, already set the global prebuilds reconciler flag to block new prebuilds anyways.

In the edge case(s) this PR is trying to resolve, where old templates still exist during a subsequent run of the prebuilds load generator we also need to push a new version of the template in that error case. I've updated the relevant block in Run in the latest commit.

…all it on orphaned templates

Extracts the empty-template-version logic into a reusable helper and calls it when an orphaned template is detected, preventing the reconciler from spawning new workspaces while Cleanup() drains them.

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@cstyan cstyan force-pushed the callum/prebuild-scaletest-cleanup branch from 2e84556 to 7563426 Compare April 21, 2026 21:49
@cstyan cstyan merged commit 7d044fa into main Apr 22, 2026
26 checks passed
@cstyan cstyan deleted the callum/prebuild-scaletest-cleanup branch April 22, 2026 20:11
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 22, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

stale This issue is like stale bread.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants