feat: cancel pending prebuilds from non-active template versions#20387
Conversation
f1a8e32 to
9cfb23a
Compare
9cfb23a to
29cc88e
Compare
73cf135 to
51fa3ef
Compare
johnstcn
left a comment
There was a problem hiding this comment.
Mostly LGTM, but I'm not 100% convinced about the row locking?
| func (q *querier) UpdatePrebuildProvisionerJobWithCancel(ctx context.Context, arg database.UpdatePrebuildProvisionerJobWithCancelParams) ([]uuid.UUID, error) { | ||
| // This is a system-only operation for canceling pending prebuild-related jobs | ||
| // when a new template version is promoted. | ||
| if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err != nil { |
There was a problem hiding this comment.
note(non-blocking): just want to note rbac.ResourceSystem usage here
There was a problem hiding this comment.
If we give subjectPrebuildsOrchestrator the ability to read/update provisioner jobs, we should be able to use rbac.ResourceProvisionerJob here.
There was a problem hiding this comment.
Yeah, permissions are a bit tricky here.
There is a ResourceProvisionerJobs that we could use here, but that would mean that all users with this permissions would be abel to execute it.
The way it is done for UpdateProvisionerJobWithCancelByID, for instance, is to check the Update permissions on the workspace. Since it is a prebuild, we could also do a
if err := q.authorizeContext(ctx, policy.ActionUpdate, rbac.ResourcePrebuiltWorkspace); err != nil { Maybe this one makes the most sense 🤔
There was a problem hiding this comment.
That's not a bad approximation actually 👍
There was a problem hiding this comment.
Added the rbac.ResourcePrebuiltWorkspace update check in dcb6de1
|
Note: Moving this PR back to draft. After internal discussion, we've decided to move the logic to cancel pending prebuilds to the prebuilds reconciliation loop rather than handling it in the This PR will be updated to reflect these changes:
The PR will be marked ready for review once the new implementation is in place. |
…cel-pending-prebuilds
…cel-pending-prebuilds
| func (q *querier) UpdatePrebuildProvisionerJobWithCancel(ctx context.Context, arg database.UpdatePrebuildProvisionerJobWithCancelParams) ([]uuid.UUID, error) { | ||
| // This is a system-only operation for canceling pending prebuild-related jobs | ||
| // when a new template version is promoted. | ||
| if err := q.authorizeContext(ctx, policy.ActionRead, rbac.ResourceSystem); err != nil { |
There was a problem hiding this comment.
If we give subjectPrebuildsOrchestrator the ability to read/update provisioner jobs, we should be able to use rbac.ResourceProvisionerJob here.
| -- Cancels all pending provisioner jobs for prebuilt workspaces on a specific preset from an | ||
| -- inactive template version. |
There was a problem hiding this comment.
nit: This comment is slightly misleading, the query doesn't check that the preset relates to an inactive template version. It's currently the responsibility of the caller to verify that @preset_id is not related to the current active version ID.
There was a problem hiding this comment.
That is a good catch! I've updated the query to filter out active template versions: 0497f8b
| func TestCancelPendingPrebuilds(t *testing.T) { | ||
| t.Parallel() | ||
|
|
||
| for _, tt := range []struct { |
There was a problem hiding this comment.
nit, non-blocking: Since these tests all boil down to "only these jobs should be canceled and nothing else", I think we could collapse this to a single test case with all of the "should not cancel" data in one. The rationale for this is twofold:
- It reduces the number of test databases we have to create
- It gives us better assurance that only the jobs we care about are being canceled in the presence of more antagonist data.
SasSwart
left a comment
There was a problem hiding this comment.
One nit/question. Otherwise, looks good!
Description
This PR introduces an optimization to automatically cancel pending prebuild-related jobs from non-active template versions in the reconciliation loop.
Problem
Currently, when a template is configured with more prebuild instances than available provisioners, the provisioner queue can become flooded with pending prebuild jobs. This issue is worsened when provisioning/deprovisioning operations take a long time.
When the prebuild reconciliation loop generates jobs faster than provisioners can process them, pending jobs accumulate in the queue. Since prebuilt workspaces should always run the latest active template version, pending prebuild jobs from non-active versions become obsolete once a new version is promoted.
Solution
The reconciliation loop cancels pending prebuild-related jobs from non-active template versions that match the following criteria:
pendingworker_idisNULL)startThis prevents the queue from being cluttered with stale prebuild jobs that would provision workspaces on an outdated template version that would consequently need to be deprovisioned.
Changes
CountPendingNonActivePrebuildsto identify presets with pending jobs from non-active versionsUpdatePrebuildProvisionerJobWithCancelto cancel jobs for a specific presetActionTypeCancelPendinghandles the cancellation logicFollow-up PR
Canceling pending prebuild jobs leaves workspaces in a Canceled state. While no Terraform resources need to be destroyed (since jobs were canceled before provisioning started), these database records should still be cleaned up. This will be addressed in a follow-up PR.
Closes: #20242