Skip to content

[dbsp] Don't rebalance during initial step.#6013

Draft
ryzhyk wants to merge 1 commit intomainfrom
init_step
Draft

[dbsp] Don't rebalance during initial step.#6013
ryzhyk wants to merge 1 commit intomainfrom
init_step

Conversation

@ryzhyk
Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk commented Apr 8, 2026

The balancer used to trigger a rebalancing on a restart from a checkpoint. This was a bad idea in hindsight. Not only it slowed down failover, it also caused the initial step executed before the pipeline transitioned to the RUNNING state potentially take a very long time. During this time none of the monitoring features are available, making this a potentially very long and awkward silence.

This commit disables this behavior. It also introduces a new DBSP-level API that disables automatic rebalancing altogether. The plan is to expose this API externally, allowing the user to control when rebalancings are allowed to happen, e.g., a latency-sensitive workload may disable rebalancing after initial backfill. This will be combined with an API to request a rebalancing on demand.

For now this API is only used to disable rebalancing during the initial step, making sure that a regular (non-forced) rebalancing doesn't trigger accidentally at that point.

This commit additionally skips the initial step when the pipeline is bootstrapping. The reason is similar to the above: bootstrapping steps can be expensive, so we don't want to perform them while the pipeline is initializing. This does mean that the pipeline won't initialize output snapshots until bootstrapping completes, but that was already the case previously.

Describe Manual Test Plan

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

The balancer used to trigger a rebalancing on a restart from a checkpoint.
This was a bad idea in hindsight. Not only it slowed down failover, it also
caused the initial step executed before the pipeline transitioned to the
RUNNING state potentially take a very long time. During this time none of the
monitoring features are available, making this a very long an awkward silence.

This commit disables this behavior. It also introduces a new DBSP-level API
that disables automatic rebalancing altogether. The plan is to expose this API
externally, allowing the user to control when rebalancings are allowed to
happen, e.g., a latency-sensitive workload may disable rebalancing after
initial backfill. This will be combined with an API to request a rebalancing on
demand.

For now this API is only used to disable rebalancing during the initial step,
making sure that a regular (non-forced) rebalancing doesn't trigger
accidentally at that point.

This commit additionally skips the initial step when the pipeline is
bootstrapping. The reason is similar to the above: bootstrapping steps can be
expensive, so we don't want to perform them while the pipeline is initializing.
This does mean that the pipeline won't initialize output snapshots until
bootstrapping completes, but that was already the case previously.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
@ryzhyk ryzhyk requested a review from blp April 8, 2026 22:03
@ryzhyk ryzhyk added the DBSP core Related to the core DBSP library label Apr 8, 2026
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can not rebalancing on the initial step make that step take much longer?

@ryzhyk
Copy link
Copy Markdown
Contributor Author

ryzhyk commented Apr 9, 2026

Can not rebalancing on the initial step make that step take much longer?

The initial step doesn't get any input and should normally complete instantaneously. We've identified two cases where this was not true: (1) rebalancing kicks in, which can be as bad as complete backfill in the worst case, (2) bootstrapping kicks -- less dramatic, but can still be bad.

This commit should eliminate both of these situations. With this the initial step should be fast regardless of skew or any other inefficiencies in the pipeline, because it doesn't do any actual work.

Rebalancing can still kick in during the following step. But at least the pipeline will be in the RUNNING state and display some metrics.

In a follow-up fix I want convert all steps into transactions, so we can at least monitor the progress of a slow step (e.g., if rebalancing kicks in), which today looks like the pipeline is stuck without any indication of what's going on.

I also want to make rebalancing events more controllable, so the user can block rebalancing or run it on demand.

@ryzhyk
Copy link
Copy Markdown
Contributor Author

ryzhyk commented Apr 9, 2026

PS. I need to add an integration test before marking this PR ready.

@blp
Copy link
Copy Markdown
Member

blp commented Apr 9, 2026

Can not rebalancing on the initial step make that step take much longer?

The initial step doesn't get any input and should normally complete instantaneously. We've identified two cases where this was not true: (1) rebalancing kicks in, which can be as bad as complete backfill in the worst case, (2) bootstrapping kicks -- less dramatic, but can still be bad.

Oh, I get it now, I think: without rebalancing, there's no work to do because there's no input; with rebalancing, the entire integral up to this point ends up being processed. Is that right?

@ryzhyk
Copy link
Copy Markdown
Contributor Author

ryzhyk commented Apr 9, 2026

Can not rebalancing on the initial step make that step take much longer?

The initial step doesn't get any input and should normally complete instantaneously. We've identified two cases where this was not true: (1) rebalancing kicks in, which can be as bad as complete backfill in the worst case, (2) bootstrapping kicks -- less dramatic, but can still be bad.

Oh, I get it now, I think: without rebalancing, there's no work to do because there's no input; with rebalancing, the entire integral up to this point ends up being processed. Is that right?

correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

DBSP core Related to the core DBSP library

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants