Skip to content

[adapters] add multihost S3 checkpoint sync coordination endpoints#6195

Open
abhizer wants to merge 2 commits into
mainfrom
multihost-s3-sync
Open

[adapters] add multihost S3 checkpoint sync coordination endpoints#6195
abhizer wants to merge 2 commits into
mainfrom
multihost-s3-sync

Conversation

@abhizer
Copy link
Copy Markdown
Contributor

@abhizer abhizer commented May 6, 2026

  • Add POST /coordination/checkpoint/push for the coordinator to direct each pod to sync its checkpoint to S3.
  • Add POST /coordination/checkpoint/pull (non-blocking, 202 Accepted) and GET /coordination/checkpoint/pull_status so the coordinator can trigger and poll checkpoint pulls without blocking.
  • Guard POST /checkpoint/sync against multihost pipelines; all multihost sync requests must go through /coordination/checkpoint/push.
  • Simplify solo-only RunningCheckpointSync to use host_info: None.
  • Refactor test_checkpoint_sync into focused per-scenario test methods with shared helper utilities.

Describe Manual Test Plan

  • Compile with the multihost supported checkpoint-sync crate. Start a pipeline with sync config, with fault tolerance (checkpointing every 5s) and automated push every 5s. Verify that the checkpoints are getting pushed to the S3 bucket.
  • Stop this pipeline and restart it from the latest S3 checkpoint.
  • Stop this pipeline and restart it in standby mode.
  • Run the S3 sync tests.

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

  • The standby field in sync config in runtime config is now deprecated. To start a pipeline in standby mode, one must the /start api with initial=standby.

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM test

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: GitHub's inline-comment API is rejecting all inline comments on this PR right now (pull_request_review_thread.base — An internal error occurred), reproduced on a different PR in the same repo. Falling back to a body-only review; happy to repost as inline threads once the API recovers.

Blockers

1. crates/feldera-types/src/config.rs line 533 — silently-ignored standby field.
Deprecating to a no-op is a behavior change with no user-visible feedback. Pipelines that today set standby: true will silently not enter standby. The previous validate() rejection of standby + start_from_checkpoint=None is also gone (line ~609), so misconfigurations now pass quietly. Two requests:

  • Emit tracing::warn! when standby is set, telling the user to migrate to initial=standby.
  • Add a test that pins both the deprecated-but-set path and the now-passing config that used to be rejected.

Also: the feldera-types breaking-changes box is unchecked despite this being a published-config change. Please flip it and document the migration in the PR description.

2. crates/adapters/src/server.rs lines 2607–2660 — coordination_checkpoint_pull has no concurrency guard.
Each call unconditionally writes pull_state = InProgress and spawns a fresh background pull. Two concurrent invocations both write to the same local storage and clobber each other's pull_state, so the polled status no longer reflects the latest request. Even if the coordinator promises serial calls, defending here is cheap: check the current state, return 409 (or short-circuit) if already InProgress. Same applies to coordination_checkpoint_push (line 1979) — async_sync_checkpoint spawned twice will race on the shared sync_checkpoint_state.

3. crates/adapters/src/server.rs line 2596 — coordinator-supplied host_info is taken at face value.
No cross-check against the pod's own layout. A misconfigured coordinator can direct pod 0 to pull from host3/; the pod will comply, and the mismatch will only surface (badly) at activation. If controller.layout().host_info() is available here, assert/compare and 400 on mismatch. If not (pre-init), at least log the supplied host_info and the eventual layout decision.

4. Manual Test Plan is TODO.
Integration tests cover the existing sync paths well, but the new /coordination/checkpoint/{push,pull,pull_status} endpoints are not exercised by the Python suite — those still drive sync_checkpoint/start_standby against the manager. Either describe the multihost end-to-end run you actually performed, or add a test that hits the coordinator endpoints directly.

Soft findings

crates/adapters/src/server.rs line 2598 — standby defaults to false.
The trait doc says standby=true bypasses the local cache and treats a missing remote checkpoint as an error — i.e. the safer mode for a coordinator-initiated pull. A coordinator that omits the field gets the less-safe default. Either default to true, or require it explicitly.

crates/adapters/src/server.rs line 2659 — CheckpointPullStatus::Error { error: String }.
This is the only forensic surface the coordinator gets. Please confirm S3 errors percolating up include bucket, key/prefix, and HTTP status; if they don't today, this is the place to add that context.

crates/feldera-types/src/checkpoint.rs line 146 — HostInfo::prefix() doesn't validate host_idx < n_hosts.
A bogus HostInfo { host_idx: 42, n_hosts: 2 } quietly produces "host42". The struct has both fields because the relationship matters — enforce in a constructor or debug_assert! in prefix().

crates/adapters/src/controller/sync.rs line 224 — removed if !sync.standby { return; } early-return in continuous_pull.
Functionally fine because continuous_pull is only invoked from the standby path, but the loop now relies on that external invariant. A short comment at the top would be cheap insurance against a future caller.

Test refactor: the _wait_for_automated_checkpoint helper sums processed_records across pods grouped by steps. That assumes per-pod processed_records is partitioned (sums to the global total) rather than replicated — worth a one-line comment explaining the assumption so it doesn't silently drift.

@abhizer abhizer force-pushed the multihost-s3-sync branch 3 times, most recently from fd508ab to 69eee38 Compare May 7, 2026 09:19
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review on 69eee38. Two prior blockers fixed; two still open. Holding RC.

Resolved

  • standby deprecation: SyncConfig::validate() now returns a hard error when standby=true (config.rs:608–612), and the field carries #[deprecated]. No more silent no-op — good. (Note: this is now a hard rejection of previously-accepted config; the PR description still has the Breaking Changes box unchecked under Adapters/Configuration. Please tick it and add a one-liner to the Unreleased changelog so operators carrying old config see the migration step.)
  • coordination_checkpoint_pull (server.rs ~2671): now guards with check-and-set under pull_state mutex — if InProgress { return 200 } else *pull_state = InProgress atomically. Race resolved.

Still blocking

  • Manual Test Plan still says TODO and the new coordinator endpoints (/coordination/checkpoint/push, /coordination/checkpoint/pull, /coordination/checkpoint/pull_status) have zero Python integration coverage. python/tests/platform/test_checkpoint_sync.py greps clean for coordination, multihost, host_info, /push, /pull — every new test is solo-mode. Please add at least one multihost test that exercises coordinator-driven push + pull + pull_status (and the new 400 guard on the solo /checkpoint/sync path), and fill in the Manual Test Plan section.

Open concern (downgraded — design question)

  • coordination_checkpoint_pull (server.rs ~2624) still trusts the coordinator-supplied host_info verbatim. I see this runs pre-InitializationComplete, so the pod doesn't have a controller.layout() yet — fair. But the pod does know its identity from startup args (the same source that later feeds Layout::new_multihost). At minimum, please cross-check host_info.n_hosts against whatever the pod knows about its multihost configuration, and fail loudly on mismatch. A coordinator bug that swaps two pods' host_idx will currently cause a silent mis-pull. If you really want to defer this, please leave a comment in the handler stating the trust model explicitly.

Soft findings (not blocking)

  • coordination_checkpoint_push (server.rs ~1983) has no in-handler guard; it relies on Controller::running_checkpoint_sync being a single-slot to serialize the actual S3 sync. That's fine for correctness, but two concurrent pushes for different UUIDs will last-write-wins on sync_checkpoint_state.completed() — please add a brief comment noting this and that the coordinator is expected not to issue overlapping pushes.
  • HostInfo::prefix() (checkpoint.rs:147) now log::warn!s on host_idx >= n_hosts instead of asserting. Better than silent, but a malformed prefix (host42 for a 2-host pipeline) will still hit S3 and produce a confusing 404. Prefer debug_assert! + Result return, or at least a hard error.
  • continuous_pull (controller/sync.rs ~250): the if !sync.standby { return Ok(()) } early-return is now gone, so this loops forever unless the caller stops invoking it. The new caller-side invariant (only call from standby-mode contexts) isn't documented on the function. Please add a # Panics / # Caller invariants rustdoc paragraph.
  • CheckpointPullStatus::Error { error: String } is still stringly. For programmatic clients (e.g., the coordinator deciding whether to retry), an enum or at least an error_kind discriminant would be friendlier. Non-blocking.

@abhizer abhizer force-pushed the multihost-s3-sync branch from 69eee38 to 3fffb3e Compare May 7, 2026 12:01
- Add POST /coordination/checkpoint/push for the coordinator to direct
  each pod to sync its checkpoint to S3.
- Add POST /coordination/checkpoint/pull (non-blocking, 202 Accepted)
  and GET /coordination/checkpoint/pull_status so the coordinator can
  trigger and poll checkpoint pulls without blocking.
- Guard POST /checkpoint/sync against multihost pipelines; all multihost
  sync requests must go through /coordination/checkpoint/push.
- Refactor test_checkpoint_sync into focused per-scenario test methods
  with shared helper utilities.

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
@abhizer abhizer force-pushed the multihost-s3-sync branch from 3fffb3e to 388b21e Compare May 7, 2026 12:53
@abhizer abhizer requested a review from mythical-fred May 7, 2026 15:04
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$Both prior blockers resolved on 388b21ec:\n\n1. Manual Test Plan now describes the multihost run, restart from latest, standby restart, and the S3 sync test sweep.\n2. @single_host_only removed from the bulk of test_checkpoint_sync.py (test_checkpoint_sync, test_without_clearing_storage, test_automated_checkpoint, test_automated_checkpoint_sync{,1}, test_autherr{,_fail}, test_standby_activation, test_local_checkpoint_priority, test_read_bucket{,_standby}, test_bucket_preferred_over_read_bucket, test_standby_bucket_takes_over_from_read_bucket, test_local_priority_over_read_bucket, test_read_bucket_strict_fail, test_automated_sync_auth_error). The new /coordination/checkpoint/{push,pull,pull_status} endpoints now have integration coverage via the standard multihost path.\n\nDesign concern about coordination_checkpoint_pull trusting coordinator-supplied host_info and the soft items on coordination_checkpoint_push last-write-wins, HostInfo::prefix() malformed-prefix warning, and CheckpointPullStatus::Error being stringly remain worth tracking but not blockers.\n\nApproving.

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
@abhizer abhizer requested a review from blp May 11, 2026 09:18
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New commit adds multihost S3 sync documentation to checkpoint-sync.md. This was one of the open items from my prior review. Prior approval stands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants