provisionerd sends failed or complete last#2732
Conversation
Signed-off-by: Spike Curtis <spike@coder.com>
| return false | ||
| } | ||
|
|
||
| func ForkReap(opt ...Option) error { |
There was a problem hiding this comment.
This trips up the linter on my box, so I fixed. Missed by CI because CI linting happens on linux.
| for _, log := range logs { | ||
| select { | ||
| case bufferedLogs <- log: | ||
| api.Logger.Debug(r.Context(), "subscribe buffered log", slog.F("job_id", job.ID), slog.F("stage", log.Stage)) |
There was a problem hiding this comment.
Is this going to be noisy in our logs?
| type messageSender interface { | ||
| updateJob(ctx context.Context, in *proto.UpdateJobRequest) (*proto.UpdateJobResponse, error) | ||
| failJob(ctx context.Context, in *proto.FailedJob) error | ||
| completeJob(ctx context.Context, in *proto.CompletedJob) error | ||
| } |
There was a problem hiding this comment.
Could this just use the DRPCProvisionerDaemonServer interface instead? Provisionerd seems to be wrapping those, which is fine, but that's just additional functionality and not required to fulfill the interface. This could reduce misdirection.
There was a problem hiding this comment.
That interface is larger, giving access to the Conn and to the ability to acquire a new job. Don't want the runner doing that.
There was a problem hiding this comment.
The provisionerd.Server already has an acquireJob() method and it doesn't have this signature, nor should it.
Also, DRPCProvisionerDaemonServer has these as methods visible outside the package and we don't want that.
There was a problem hiding this comment.
I'm not sure creating a single interface that's used in a single place improves clarity around that. I feel like a comment explaining why acquireJob shouldn't be passed through would do just as well, because a consumer would just add it if they really wanted to call it anyways.
There was a problem hiding this comment.
So, with the new structure of Runner in a separate package, we definitely need an interface here: *Runner can't use the *Server type because it would cause an import loop.
I'm pretty strongly against reusing the DRPCProvisionerDaemonServer interface for this purpose because it requires implementing and exposing the AcquireJob() method.
"Keep interfaces small" is generally considered a best practice.
More importantly in this context, though, is that we don't want the Runner acquiring jobs. That's not the contract: Server acquires the job, Runner updates/fails/completes it.
Moreover, the Server is presently designed with the assumption that there is only one acquired job at a time. If the Server exposes an external method to allow other things to acquire jobs, that assumption could easily be violated. We should not leave guns like that lying around for our future selves to get shot in the foot.
mafredri
left a comment
There was a problem hiding this comment.
Some questions and minor stylistic comments, but in general this looks good to me! I'm deferring approval to someone more familiar with the previous logic, though.
Side-note: Being somewhat unfamiliar with provisionerd, it'd have been nice to have this as two PRs, one breaking out the functionality from one file, and another implementing the fix so it's clearer what's new and old code.
| // are canceled but complete asynchronously (although they are prevented from further updating the job to the coder | ||
| // server). The provided context sets how long to keep trying to send the FailJob. | ||
| func (r *runner) fail(ctx context.Context, f *proto.FailedJob) error { | ||
| f.JobId = r.job.JobId |
There was a problem hiding this comment.
Should this be protected by mutex? Also set in setFail.
There was a problem hiding this comment.
r.job is never changed, and so doesn't need to be protected.
There was a problem hiding this comment.
I was referring to f.JobId which is re-assigned.
There was a problem hiding this comment.
Ah, I just noticed that f won't actually be re-used between function invocations, so not a problem. Curious why this uses r.job.JobId and the other r.job.GetJobId()?
There was a problem hiding this comment.
that's on the passed in proto, and isn't protected by any mutex
There was a problem hiding this comment.
this uses r.job.JobId and the other r.job.GetJobId()
that's just me being inconsistent for no reason.
| case <-r.forceStopContext.Done(): | ||
| return | ||
| case <-r.gracefulContext.Done(): | ||
| _ = stream.Send(&sdkproto.Provision_Request{ |
There was a problem hiding this comment.
I'm not sure what's supposed to happen here. Since we're closing the stream on function exit (above), we may do stream.Send on a closed stream here. I'm guessing this should be OK, but a comment explaining it may be good.
There was a problem hiding this comment.
@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)
There was a problem hiding this comment.
Sorry, I kinda mentally skipped over it because that's existing code, other than the names of the contexts.
But yeah, I think your analysis is right, the stream might be closed at the point this case fires, but sending the cancel is best effort anyway.
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
Signed-off-by: Spike Curtis <spike@coder.com>
mafredri
left a comment
There was a problem hiding this comment.
Did a second pass on this and it looks good! I like the refactor into a separate package, turned out well. 👍🏻
(Approving, although I still think approval from someone more familiar with provisionerd might be good.)
| case <-r.forceStopContext.Done(): | ||
| return | ||
| case <-r.gracefulContext.Done(): | ||
| _ = stream.Send(&sdkproto.Provision_Request{ |
There was a problem hiding this comment.
@spikecurtis did you see this comment? (Might've gotten lost in the noise, won't object if you feel a comment is not needed.)
kylecarbs
left a comment
There was a problem hiding this comment.
I'm nervous about the lack of tests on the runner package. What do you think about moving those over from provisionerd_test.go?
| CompleteJob(ctx context.Context, in *proto.CompletedJob) error | ||
| } | ||
|
|
||
| func NewRunner( |
There was a problem hiding this comment.
This could be called New to eliminate redundancy in the naming.
@kylecarbs From what I looked, there's relatively high coverage via existing |
|
@mafredri I dislike us separating tests from the package at hand. We have a unique case with codersdk, but I think that should stay isolated. |
|
We can't just move the tests in It's a good idea, and I'm happy to do it, but I'm not convinced it's the best use of my (or anyone's, really) time right now. WDYT @kylecarbs ? |
|
Seems fine to me. I use VS Code's test gutters to check what is/isn't covered, and it works pretty well, so I'm sad this won't appear there. I think we make an issue for it, then this looks good to me! |
|
Issue is #2780 |

Substantially narrows the race window on #2603 but doesn't close it.
This PR refactors provisionerd to include a job runner that is designed to ensure that we get at most 1 FailedJob or CompletedJob message per job, and that we cease any updates for jobs once this is sent.
It uses unexported interfaces to distinguish between methods that can be called by things like the main provisionerd goroutines (things like canceling the current job), and methods internal to the runner. Similarly, we abstract the provisionerd server from the runner so that we ensure that sending messages about jobs happens with consistent code.