Conversation
blp
left a comment
There was a problem hiding this comment.
Seems like a very straightforward fix to me.
| #[serde(default)] | ||
| pub mode: DeltaTableWriteMode, | ||
|
|
||
| /// Checkpoint interval configured when creating the Delta table. |
There was a problem hiding this comment.
This should mention the unit (seconds? minutes?).
There was a problem hiding this comment.
it's the number of versions between checkpoints. The comment already states that. Can probably be made clearer.
Note, the name checkpoint interval is standard in the delta world.
There was a problem hiding this comment.
Is this checkpointing option enabled by default? That seems to be the case in other systems' delta connectors: see mdrakiburrahman/stream-processing-benchmark#9
| /// | ||
| /// Default: 10. | ||
| #[serde(default, skip_serializing_if = "Option::is_none")] | ||
| pub checkpoint_interval: Option<u32>, |
There was a problem hiding this comment.
Let's put the unit here (e.g. _ms, _seconds).
There was a problem hiding this comment.
it's the number of versions between checkpoints. The comment already states that. Can probably be made clearer.
mythical-fred
left a comment
There was a problem hiding this comment.
Missing tests for checkpoint creation/checkpoint_interval (behavior change).
|
|
||
| loop { | ||
| let checkpoint_interval = match inner.config.checkpoint_interval { | ||
| Some(0) => None, |
There was a problem hiding this comment.
checkpoint_interval = 0 is documented as "no checkpoints", but here it maps to None, which leaves the table property unset (default 10). Should this pass Some("0") so we actually disable checkpoints?
There was a problem hiding this comment.
I verified that when checkpoint_interval = 0 no checkpoints are created
yes, it's enabled by default |
ae55fce to
9c1f715
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Still missing tests for checkpoint creation/checkpoint_interval (behavior change).
|
|
||
| loop { | ||
| let checkpoint_interval = match inner.config.checkpoint_interval { | ||
| Some(0) => None, |
There was a problem hiding this comment.
Still mapping checkpoint_interval = 0 to None (default 10). Docs say 0 disables checkpoints, so this should probably pass Some("0") to the table property or otherwise explicitly disable checkpoints.
swanandx
left a comment
There was a problem hiding this comment.
LGTM, questions:
- previously no checkpoint was created, so all logs ( versions ) were retained, now that we crate checkpoint ( by using default commit properties ), logRetention would kick in as well right? is that fine?
| .to_string() | ||
| .ends_with(".checkpoint.parquet") | ||
| { | ||
| println!("parquet_file: {:?}", parquet_file.display()); |
There was a problem hiding this comment.
nit: why are we printing file names in this test?
There was a problem hiding this comment.
will remove it, thanks!
| // post-commit hook entirely and never writes `_last_checkpoint` / `*.checkpoint.parquet`, | ||
| // regardless of `delta.checkpointInterval`. Use default commit properties so checkpoint | ||
| // creation runs when `(version + 1) % checkpoint_interval == 0`. | ||
| CommitBuilder::from(CommitProperties::default()) |
There was a problem hiding this comment.
do using default commit properties lead to any other effects apart from calling post-commit hook?
There was a problem hiding this comment.
doesn't seem like it.
The Delta output connector did not create periodic checkpoints. While this is in itself problematic, it also meant that the connector became slow over time, due to this delta-rs bug, which causes the `update_incremental` function to scan the entire transaction log on every commit: delta-io/delta-kernel-rs#2103. This commit: - Introduces the `checkpoint_interval` option, which tells the connector to configure checkpoint interval when creating the table. - Creates a CommitBuilder that is actually setup to create checkpoints. Without this fix the time to create a trivial delta commit increases from 1.5s to 6s after ~1000 commits. With the fix it remains constant at ~2s. Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
9c1f715 to
e20b783
Compare
The Delta output connector did not create periodic checkpoints.
While this is in itself problematic, it also meant that the connector became slow over time, due to this delta-rs bug, which causes the
update_incrementalfunction to scan the entire transaction log on every commit:delta-io/delta-kernel-rs#2103.
This commit:
checkpoint_intervaloption, which tells the connector to configure checkpoint interval when creating the table.Describe Manual Test Plan
Tested with a pipeline that writes a stream of small delta transactions. Without this fix the time to create a trivial delta commit increases from 1.5s to 6s after ~1000 commits. With the fix it remains constant at ~2s.
Checklist
Breaking Changes?
Mark if you think the answer is yes for any of these components:
Describe Incompatible Changes