Skip to content

[adapters] Redesign /checkpoint API to tolerate slow checkpoints #4022

@blp

Description

@blp

The pipeline /checkpoint API is synchronous: it triggers a checkpoint and returns only when the checkpoint succeeds or fails. A checkpoint can take a long time, so it would be better to instead have it return when the checkpoint has been initiated and then use an asynchronous mechanism to report that the checkpoint is complete.

Design proposal

Serial numbers can work OK for this kind of API. For example, "/checkpoint" can return the next serial number. Then we'd add a few items to status reports:

  • last_succeeded, the serial number of the last checkpoint operation that succeeded.
  • last_failed, serial number of the last checkpoint operation that failed, as well as the error message associated with that failure.

This enables the usual goal of a caller, which is to find out whether a checkpoint has been successfully written since the time it was requested. If so, then if seq is the sequence number you got, the answer is last_succeeded >= seq.

Metadata

Metadata

Assignees

Labels

connectorsIssues related to the adapters/connectors crateftFault tolerant, distributed, and scale-out implementationrustPull requests that update Rust code

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions