Skip to content

fix(scaletest/prebuilds): make measureDeletion more reliable/less brittle#23614

Merged
cstyan merged 1 commit into
mainfrom
callum/prebuild-scaletest-measure-deletion
Apr 22, 2026
Merged

fix(scaletest/prebuilds): make measureDeletion more reliable/less brittle#23614
cstyan merged 1 commit into
mainfrom
callum/prebuild-scaletest-measure-deletion

Conversation

@cstyan
Copy link
Copy Markdown
Contributor

@cstyan cstyan commented Mar 25, 2026

Currently measureDeletion uses the exact targetNumWorkspaces, however due to the way the prebuild reconciler works we may be created a few more/less than the exact targetNumWorkspaces. The changes in this PR are:

  • capture the exact workspace count for the given runners template
  • retry the deletions of each workspace up to 3 times

The scaletest isn't a measurement of whether the workspace deletion/template deletion is successful, so retries are reasonable here IMO. Also, trying a few times means we have a lower chance of leaving the cluster in a transient state, which makes re-running a prebuild scaletest not possible. This is due to a few more cleanup procedures that will be improved in other PRs.

@github-actions github-actions Bot added the community Pull Requests and issues created by the community. label Mar 25, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 25, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@matifali matifali removed the community Pull Requests and issues created by the community. label Mar 26, 2026
@github-actions github-actions Bot added the stale This issue is like stale bread. label Apr 6, 2026
@github-actions github-actions Bot closed this Apr 10, 2026
@cstyan cstyan reopened this Apr 14, 2026
@cstyan cstyan force-pushed the callum/prebuild-scaletest-measure-deletion branch from 56d8ab9 to 8581202 Compare April 14, 2026 21:59
…n tracking

Retries failed or canceled delete builds up to 3 times per workspace and
exits early if all remaining workspaces exhaust retries. Snapshots the
initial workspace count at deletion start as the completion denominator,
since the reconciler may create more workspaces than the configured target.
@cstyan cstyan force-pushed the callum/prebuild-scaletest-measure-deletion branch from 8581202 to dd16016 Compare April 15, 2026 19:41
@cstyan cstyan requested review from spikecurtis and sreya April 15, 2026 20:20
Copy link
Copy Markdown
Contributor

Prebuilds should retry the deletion after a backoff, right? Don't we want to allow the system to behave as it normally would? Us injecting additional builds into the system is sort of exactly the thing that you don't want to do when you are overwhelming some upstream system with requests it can't fulfill (K8s in our case, presumably).

@cstyan
Copy link
Copy Markdown
Contributor Author

cstyan commented Apr 21, 2026

Prebuilds should retry the deletion after a backoff, right? Don't we want to allow the system to behave as it normally would? Us injecting additional builds into the system is sort of exactly the thing that you don't want to do when you are overwhelming some upstream system with requests it can't fulfill (K8s in our case, presumably).

My understanding is that the existing backoff in the reconciler is only for creation, not for deletion, and they reconciler does not retry failed deletes. This is a result of it using queries that grab only the latest build for a given workspace. So we do need our own retry mechanism here, but it could still be that we're overloading something (infra or coder related code) and so a backoff on the retries is warranted 👍

Or we could just not retry at all.

@spikecurtis
Copy link
Copy Markdown
Contributor

Yeah, I see now that the reconciler only looks at failed starts for retry. Us retrying deletes is reasonable here.

Prebuilds discussions are coming back to me and we wanted to notify operators of failures, but not clean up so that they have a chance to investigate and fix.

@cstyan cstyan merged commit b714fe8 into main Apr 22, 2026
25 of 26 checks passed
@cstyan cstyan deleted the callum/prebuild-scaletest-measure-deletion branch April 22, 2026 20:10
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 22, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

stale This issue is like stale bread.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants