[dbsp, adapters] Transactional bootstrapping by ryzhyk · Pull Request #6223 · feldera/feldera

ryzhyk · 2026-05-12T23:30:54Z

Fix #6091, #4736

Several important improvements to the bootstrapping mechanism:

Run bootstrapping more efficiently as a transaction, produce a single output batch at the end.
Fix the race where bootstrapping completed, but output snapshots weren't updated
Improved TPC-H test to torture-test the bootstrapping mechanisms.

TODO: run the test as part of QA

Fix a bug discovered while working on the above where a table with a PK couldn't be used for bootstrapping

See individual commits.

Describe Manual Test Plan

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

OpenAPI / REST HTTP API / feldera-types / manager (What is a breaking change?)
Feldera SQL (Syntax, Semantics)
feldera-sqllib (incl. dependencies fxp, etc.) (What is a breaking change?)
Python SDK (What is a breaking change?)
fda (CLI arguments)
Adapters (including configuration)
Storage Format / Checkpoints
Others (specify)

Describe Incompatible Changes

mihaibudiu

Can't say I spotted the bug that is being fixed...

mihaibudiu · 2026-05-13T02:17:10Z

                self.last_checkpoint_sync(),
                self.replaying(),
-                self.circuit.bootstrap_in_progress(),
+                self.controller.status.bootstrap_in_progress(),


any reason not to reuse self.bootstrapping?
if there is one maybe this needs an explanation

Added a comment explaining this.

mythical-fred

Approving — solid set of fixes to bootstrapping, well-tested.

The four threads are easy to follow individually and each has matching test coverage: the PK regression test in replay_tests.rs::regression1, the new test_integrate_trace_bootstrap_is_transactional / test_accumulate_trace_bootstrap_is_transactional cases for splitter-style replay, the is_commit_complete() tightening in CircuitHandle, and the previously_non_empty_views assertion in the TPC-H segment test for the #6091 snapshot race.

A few small nits — none blocking, please clean up when convenient:

crates/adapters/src/controller.rs — in the new bootstrapping handling, the else branch calls self.circuit.bootstrap_in_progress() again instead of reusing the local bootstrapping you just captured a few lines above. Minor, but the duplication invites future drift.
crates/dbsp/src/circuit/circuit_builder.rs — the comment in register_replay_stream calls the change a "workaround" for the input-integral-plus-downstream-integral case. Worth filing a tracking issue to actually de-duplicate the integrals so we don't rely on registration order forever (and so a future refactor that reorders calls doesn't silently break replay).
crates/dbsp/src/circuit/replay_tests.rs::regression1 — two commented-out lines (//let input_stream = input_stream.apply_owned(...) and //let output = output_handle2.take_from_all()...) should be deleted before merge.
python/tests/workloads/test_tpch.py — the 4-line # log("Waiting for outputs to flush") block near the end is commented out; remove it or add a comment explaining why it's parked.
crates/dbsp/src/circuit/schedule/dynamic_scheduler.rs — initializing transaction_phase to CommitComplete works, but the name now reads oddly for "no transaction has ever started." Consider renaming the variant to something neutral (e.g. Inactive) in a follow-up.
Commit body of Fix bootstrapping of tables with a PK: typo — simle -> simple.

Doc/changelog boxes are unchecked, but I don't see anything user-visible here (the removed set_replay_step_size/get_replay_step_size are internal Rust APIs on DBSPHandle), so I'd leave that to your judgment.

The "TODO: run the test as part of QA" in the PR description — please file that as a follow-up issue so it doesn't get forgotten.

mythical-fred

Adding the nits inline as I should have done in the first review — same content, properly anchored.

mythical-fred · 2026-05-13T18:35:12Z

-            Box::new(replay_stream.clone()),
-        );
+        // If a replay source already exists, don't overwrite it. This normally shouldn't
+        // happen as we should not have more than one integral for each stream. One situation


The comment correctly calls this a workaround — could you file a follow-up issue to actually de-duplicate the integrals? Relying on registration order is fragile; a future refactor that swaps the order of input_upsert vs. the downstream join/aggregate construction would silently break replay again, and the symptom ("not materialized") is misleading.

mythical-fred · 2026-05-13T18:35:12Z

            handles: JoinSet::new(),
            waiting: false,
-            transaction_phase: TransactionPhase::Idle,
+            transaction_phase: TransactionPhase::CommitComplete,


Initializing to CommitComplete reads as "a commit just finished" when in fact no transaction has ever started. Consider renaming the variant to something neutral like Inactive (or NoTransaction) in a follow-up — semantics are fine, just the name.

mythical-fred · 2026-05-13T18:35:12Z

+
+        pipeline.pause()
+
+        # log("Waiting for outputs to flush")


Four-line commented-out wait_for_completion block — either delete or add a comment explaining why it's parked, otherwise the next person to touch this file won't know whether to revive it.

blp

Thank you!

I'm going to have to try the TPC-H checkpoint mode. I didn't know about that.

This fixes an issue when bootstrapping a table with a PK when there is a downstream operator attached to it that creates an integral of the same table. We ended up with two integrals that can both be used to replay the same stream, one with and one without an accumulator. We used to replay from the last registered replay source, which meant that if the second integral was added in the modified version of the program, it was empty and replay failed, despite the fact that the input integral could be used for replay. To make things worse, we report this error as the input table not being materialized, which is simply wrong. This commit adds a simle workaround that uses the first registered replay source (by refusing to register new replay sources for the same stream) and a regression test for this. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Fixes #4736 Bootstrapping used to be performed in a sequence of transactions. Depending on the program this could be inefficient due to redundant recomputation. In addition this produced multiple small batches of potentialy mutually canceling outputs. We now have all the infra needed to change this. This commit changes Z1Trace and AccumulateZ1Trace operator to behave as splitters, i.e., they replay their entire contents across multiple steps within the same transaction. We also get rid of the replay_step_size knob. Instead we use the existing splitter_chunk_size setting, which controls the number of records produced by splitters per step. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Improve TPC-H test in `checkpoint` mode, which can be used to torture-test bootstrapping: - Configurable number and size of test segments. With these new options we can scale the test up and down using the same dataset (typically TPC-H). - Check that views are initialized after bootstrapping. Example: Split views into 2 groups 10M records each. ``` uv run test_tpch.py --s3-bucket feldera-qa-data --s3-prefix tpc-h-100 --s3-region us-west-1 --mode checkpoint --num-segments 2 --segment-size 10000000 ``` Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Remove redundant TransactionPhase::Idle status. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Bootstrapping is considered completed when all replay sources are complete and the bootstrapping transaction has committed. The latter condition was missing. This did not cause any issues because we only called this function between transactions. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Fix #6091. Views that don't participate in bootstrapping don't update their snapshots until the first post-bootstrap transaction. As a result, a client making ad hoc queries right after bootstrapping completes could observe empty views. We fix this by: 1. Forcing an extra transaction after bootstrapping completes (this was already the case). 2. Maintaining `bootstrap_in_progress` status until the extra transaction commits. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

Don't update the snapshot until bootstrapping is complete (including the additional post-bootstrap transaction). This guarantees that: 1. Ad hoc queries observe a consistent snapshot of all views. 2. Connectors configured with `send_snapshot=true` don't receive empty . Extended the Python output snapshot test with an extra step that fails without this fix. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

ryzhyk requested review from blp and gz May 12, 2026 23:30

ryzhyk force-pushed the transactional_bootstrapping branch from 94add9f to d33c656 Compare May 12, 2026 23:51

mihaibudiu approved these changes May 13, 2026

View reviewed changes

mythical-fred approved these changes May 13, 2026

View reviewed changes

mythical-fred reviewed May 13, 2026

View reviewed changes

blp approved these changes May 13, 2026

View reviewed changes

ryzhyk force-pushed the transactional_bootstrapping branch 2 times, most recently from 41d9eb6 to 7148c81 Compare May 14, 2026 19:13

ryzhyk added 8 commits May 14, 2026 17:17

[qa] testutils.py: support dev_tweaks.

941e099

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

[dbsp] Simplify transaction state tracking in scheduler.

89f8276

Remove redundant TransactionPhase::Idle status. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>

ryzhyk force-pushed the transactional_bootstrapping branch from 7148c81 to dffc5b1 Compare May 15, 2026 00:20

gz approved these changes May 15, 2026

View reviewed changes

ryzhyk added this pull request to the merge queue May 15, 2026

Any commits made after this event will not be merged.

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 15, 2026

ryzhyk added this pull request to the merge queue May 15, 2026

Any commits made after this event will not be merged.

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 15, 2026

gz added this pull request to the merge queue May 15, 2026

Any commits made after this event will not be merged.

Conversation

ryzhyk commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Manual Test Plan

Checklist

Breaking Changes?

Describe Incompatible Changes

Uh oh!

mihaibudiu left a comment

Choose a reason for hiding this comment

Uh oh!

mihaibudiu May 13, 2026

Choose a reason for hiding this comment

Uh oh!

ryzhyk May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythical-fred May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mythical-fred May 13, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 13, 2026

Choose a reason for hiding this comment

Uh oh!

blp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ryzhyk commented May 12, 2026 •

edited

Loading