Conversation
The balancer used to trigger a rebalancing on a restart from a checkpoint. This was a bad idea in hindsight. Not only it slowed down failover, it also caused the initial step executed before the pipeline transitioned to the RUNNING state potentially take a very long time. During this time none of the monitoring features are available, making this a very long an awkward silence. This commit disables this behavior. It also introduces a new DBSP-level API that disables automatic rebalancing altogether. The plan is to expose this API externally, allowing the user to control when rebalancings are allowed to happen, e.g., a latency-sensitive workload may disable rebalancing after initial backfill. This will be combined with an API to request a rebalancing on demand. For now this API is only used to disable rebalancing during the initial step, making sure that a regular (non-forced) rebalancing doesn't trigger accidentally at that point. This commit additionally skips the initial step when the pipeline is bootstrapping. The reason is similar to the above: bootstrapping steps can be expensive, so we don't want to perform them while the pipeline is initializing. This does mean that the pipeline won't initialize output snapshots until bootstrapping completes, but that was already the case previously. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
blp
left a comment
There was a problem hiding this comment.
Can not rebalancing on the initial step make that step take much longer?
The initial step doesn't get any input and should normally complete instantaneously. We've identified two cases where this was not true: (1) rebalancing kicks in, which can be as bad as complete backfill in the worst case, (2) bootstrapping kicks -- less dramatic, but can still be bad. This commit should eliminate both of these situations. With this the initial step should be fast regardless of skew or any other inefficiencies in the pipeline, because it doesn't do any actual work. Rebalancing can still kick in during the following step. But at least the pipeline will be in the RUNNING state and display some metrics. In a follow-up fix I want convert all steps into transactions, so we can at least monitor the progress of a slow step (e.g., if rebalancing kicks in), which today looks like the pipeline is stuck without any indication of what's going on. I also want to make rebalancing events more controllable, so the user can block rebalancing or run it on demand. |
|
PS. I need to add an integration test before marking this PR ready. |
Oh, I get it now, I think: without rebalancing, there's no work to do because there's no input; with rebalancing, the entire integral up to this point ends up being processed. Is that right? |
correct! |
The balancer used to trigger a rebalancing on a restart from a checkpoint. This was a bad idea in hindsight. Not only it slowed down failover, it also caused the initial step executed before the pipeline transitioned to the RUNNING state potentially take a very long time. During this time none of the monitoring features are available, making this a potentially very long and awkward silence.
This commit disables this behavior. It also introduces a new DBSP-level API that disables automatic rebalancing altogether. The plan is to expose this API externally, allowing the user to control when rebalancings are allowed to happen, e.g., a latency-sensitive workload may disable rebalancing after initial backfill. This will be combined with an API to request a rebalancing on demand.
For now this API is only used to disable rebalancing during the initial step, making sure that a regular (non-forced) rebalancing doesn't trigger accidentally at that point.
This commit additionally skips the initial step when the pipeline is bootstrapping. The reason is similar to the above: bootstrapping steps can be expensive, so we don't want to perform them while the pipeline is initializing. This does mean that the pipeline won't initialize output snapshots until bootstrapping completes, but that was already the case previously.
Describe Manual Test Plan
Checklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes