fix(scaletest/prebuilds): make measureDeletion more reliable/less brittle#23614
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
56d8ab9 to
8581202
Compare
…n tracking Retries failed or canceled delete builds up to 3 times per workspace and exits early if all remaining workspaces exhaust retries. Snapshots the initial workspace count at deletion start as the completion denominator, since the reconciler may create more workspaces than the configured target.
8581202 to
dd16016
Compare
|
Prebuilds should retry the deletion after a backoff, right? Don't we want to allow the system to behave as it normally would? Us injecting additional builds into the system is sort of exactly the thing that you don't want to do when you are overwhelming some upstream system with requests it can't fulfill (K8s in our case, presumably). |
My understanding is that the existing backoff in the reconciler is only for creation, not for deletion, and they reconciler does not retry failed deletes. This is a result of it using queries that grab only the latest build for a given workspace. So we do need our own retry mechanism here, but it could still be that we're overloading something (infra or coder related code) and so a backoff on the retries is warranted 👍 Or we could just not retry at all. |
|
Yeah, I see now that the reconciler only looks at failed starts for retry. Us retrying deletes is reasonable here. Prebuilds discussions are coming back to me and we wanted to notify operators of failures, but not clean up so that they have a chance to investigate and fix. |
Currently
measureDeletionuses the exacttargetNumWorkspaces, however due to the way the prebuild reconciler works we may be created a few more/less than the exacttargetNumWorkspaces. The changes in this PR are:The scaletest isn't a measurement of whether the workspace deletion/template deletion is successful, so retries are reasonable here IMO. Also, trying a few times means we have a lower chance of leaving the cluster in a transient state, which makes re-running a prebuild scaletest not possible. This is due to a few more cleanup procedures that will be improved in other PRs.