Skip to content

[adapters] Delta output: enable checkpoints.#6006

Merged
gz merged 1 commit intomainfrom
delta_checkpoints
Apr 8, 2026
Merged

[adapters] Delta output: enable checkpoints.#6006
gz merged 1 commit intomainfrom
delta_checkpoints

Conversation

@ryzhyk
Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk commented Apr 7, 2026

The Delta output connector did not create periodic checkpoints.

While this is in itself problematic, it also meant that the connector became slow over time, due to this delta-rs bug, which causes the update_incremental function to scan the entire transaction log on every commit:
delta-io/delta-kernel-rs#2103.

This commit:

  • Introduces the checkpoint_interval option, which tells the connector to configure checkpoint interval when creating the table.
  • Creates a CommitBuilder that is actually setup to create checkpoints.

Describe Manual Test Plan

Tested with a pipeline that writes a stream of small delta transactions. Without this fix the time to create a trivial delta commit increases from 1.5s to 6s after ~1000 commits. With the fix it remains constant at ~2s.

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

Describe Incompatible Changes

@ryzhyk ryzhyk added the connectors Issues related to the adapters/connectors crate label Apr 7, 2026
@ryzhyk ryzhyk requested a review from swanandx April 7, 2026 22:26
Copy link
Copy Markdown
Member

@blp blp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a very straightforward fix to me.

#[serde(default)]
pub mode: DeltaTableWriteMode,

/// Checkpoint interval configured when creating the Delta table.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should mention the unit (seconds? minutes?).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the number of versions between checkpoints. The comment already states that. Can probably be made clearer.

Note, the name checkpoint interval is standard in the delta world.

Copy link
Copy Markdown
Contributor

@lalithsuresh lalithsuresh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this checkpointing option enabled by default? That seems to be the case in other systems' delta connectors: see mdrakiburrahman/stream-processing-benchmark#9

///
/// Default: 10.
#[serde(default, skip_serializing_if = "Option::is_none")]
pub checkpoint_interval: Option<u32>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the unit here (e.g. _ms, _seconds).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the number of versions between checkpoints. The comment already states that. Can probably be made clearer.

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests for checkpoint creation/checkpoint_interval (behavior change).


loop {
let checkpoint_interval = match inner.config.checkpoint_interval {
Some(0) => None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkpoint_interval = 0 is documented as "no checkpoints", but here it maps to None, which leaves the table property unset (default 10). Should this pass Some("0") so we actually disable checkpoints?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that when checkpoint_interval = 0 no checkpoints are created

@ryzhyk
Copy link
Copy Markdown
Contributor Author

ryzhyk commented Apr 7, 2026

Is this checkpointing option enabled by default? That seems to be the case in other systems' delta connectors: see mdrakiburrahman/stream-processing-benchmark#9

yes, it's enabled by default

@ryzhyk ryzhyk force-pushed the delta_checkpoints branch 2 times, most recently from ae55fce to 9c1f715 Compare April 7, 2026 23:29
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still missing tests for checkpoint creation/checkpoint_interval (behavior change).


loop {
let checkpoint_interval = match inner.config.checkpoint_interval {
Some(0) => None,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still mapping checkpoint_interval = 0 to None (default 10). Docs say 0 disables checkpoints, so this should probably pass Some("0") to the table property or otherwise explicitly disable checkpoints.

Copy link
Copy Markdown
Contributor

@swanandx swanandx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, questions:

  • previously no checkpoint was created, so all logs ( versions ) were retained, now that we crate checkpoint ( by using default commit properties ), logRetention would kick in as well right? is that fine?

.to_string()
.ends_with(".checkpoint.parquet")
{
println!("parquet_file: {:?}", parquet_file.display());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: why are we printing file names in this test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will remove it, thanks!

// post-commit hook entirely and never writes `_last_checkpoint` / `*.checkpoint.parquet`,
// regardless of `delta.checkpointInterval`. Use default commit properties so checkpoint
// creation runs when `(version + 1) % checkpoint_interval == 0`.
CommitBuilder::from(CommitProperties::default())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do using default commit properties lead to any other effects apart from calling post-commit hook?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't seem like it.

The Delta output connector did not create periodic checkpoints.

While this is in itself problematic, it also meant that the connector
became slow over time, due to this delta-rs bug, which causes the
`update_incremental` function to scan the entire transaction log on
every commit:
delta-io/delta-kernel-rs#2103.

This commit:
- Introduces the `checkpoint_interval` option, which tells
  the connector to configure checkpoint interval when creating
  the table.
- Creates a CommitBuilder that is actually setup to create
  checkpoints.

Without this fix the time to create a trivial delta commit increases from
1.5s to 6s after ~1000 commits. With the fix it remains constant at
~2s.

Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
@ryzhyk ryzhyk force-pushed the delta_checkpoints branch from 9c1f715 to e20b783 Compare April 8, 2026 15:15
@ryzhyk ryzhyk enabled auto-merge April 8, 2026 15:15
@ryzhyk ryzhyk added this pull request to the merge queue Apr 8, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 8, 2026
@gz gz added this pull request to the merge queue Apr 8, 2026
Merged via the queue into main with commit d2699ce Apr 8, 2026
1 check passed
@gz gz deleted the delta_checkpoints branch April 8, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connectors Issues related to the adapters/connectors crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants