fix(scaletest/prebuilds): fix Runner.Cleanup() to delete workspaces#23627
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
edf99bb to
860891d
Compare
…efore template The API rejects template deletion with a 400 while workspaces still exist. Cleanup() now lists all workspaces for the template, deletes each via workspacebuild.NewCleanupRunner with up to 3 retries, then deletes the template. If Run() fails with a 409 conflict (template already exists from a prior failed run), the existing template is looked up so Cleanup() can remove it rather than silently skipping it.
… deletion retries Log every deletion attempt (not just retries) and return an error if workspaces remain after all attempts, rather than logging an error and proceeding to template deletion which would fail anyway. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
860891d to
3a55622
Compare
| slog.F("template_name", r.template.Name), | ||
| ) | ||
| var failed []codersdk.Workspace | ||
| for _, ws := range remaining { |
There was a problem hiding this comment.
In your experience, are there a lot of these? Do we need to think about running in parallel?
There was a problem hiding this comment.
Per runner there is at most 50 workspaces that would need to be deleted, because we create a runner per template and all of our prebuilds scenarios create 5 presets per template and 10 workspaces per preset. I don't think a max of 50 is too much to delete serially per runner.
We're already running each runner in a separate go routine so we're already running num_templates (defined in each prebuild scenario) go routines.
If we switch up our scenario sizing or something else in the future that would result in potentially many more workspaces per runner that would need to be cleaned up we can make changes here at that point.
| // Retry failed workspace deletions up to maxDeletionAttempts times to | ||
| // handle transient errors (e.g. a delete build that fails due to a | ||
| // provisioner hiccup). | ||
| const maxDeletionAttempts = 3 |
There was a problem hiding this comment.
Should we be worrying about the case where the template is left in a state with a positive number of prebuilds, and thus this routine and the reconciler get into a fight --- it creating more as these are deleted?
There was a problem hiding this comment.
Good question. I never ran into this case when running the prebuilds tests myself, but I think it is possible.
Previously we would only ever push a new template version with no presets in the Run function after being able to successfully configure and run some prebuilds first. In that case we also had, via the CLI code, already set the global prebuilds reconciler flag to block new prebuilds anyways.
In the edge case(s) this PR is trying to resolve, where old templates still exist during a subsequent run of the prebuilds load generator we also need to push a new version of the template in that error case. I've updated the relevant block in Run in the latest commit.
…all it on orphaned templates Extracts the empty-template-version logic into a reusable helper and calls it when an orphaned template is detected, preventing the reconciler from spawning new workspaces while Cleanup() drains them. Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2e84556 to
7563426
Compare
This PR makes the
Cleanupfunction for prebuild scaletests more reliable. Currently,Cleanupjust attempts to delete the prebuilds templates, but templates cannot be deleted if there are any workspaces created from them remaining.This is problematic because we don't currently generate unique template names on each new prebuilds scaletest run, so this leaves the cluster in a state where another prebuild scaletest cannot be executed without some manual cleanup first.
So, in this PR we're modifying
Cleanupto first check for workspaces for the given runners template. It then attempts to delete those workspaces, retrying up to 3x, before finally moving on to the template deletion.This also means we're properly deleting workspaces for a template if we fail during a call to
Run.