Skip to content

adapters: parallel delta output encoder#5869

Merged
swanandx merged 1 commit intomainfrom
parallel-delta-output
Apr 1, 2026
Merged

adapters: parallel delta output encoder#5869
swanandx merged 1 commit intomainfrom
parallel-delta-output

Conversation

@swanandx
Copy link
Copy Markdown
Contributor

@swanandx swanandx commented Mar 19, 2026

  • Parallelize Delta table output encoding by splitting batches across configurable worker threads, each writing independent Parquet files concurrently
  • Add threads config option (default: 1) to DeltaTableWriterConfig to control the number of parallel encoding threads

Describe Manual Test Plan

Added unit tests and ran locally

for write retry, I changed local permissions for local path to simulate write failure

chmod 000 /tmp/feldera-delta-profiling

and to restore

chmod 755 /tmp/feldera-delta-profiling

& commit retries were tried by renaming the directory

what can be tested further:

  • on fatal error do we want to stop whole pipeline, just stall it or do something else?
  • try this out with IRSA and expiring credentials?

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

we're adding new config options for delta connector but should not be a breaking change.

@swanandx swanandx force-pushed the parallel-delta-output branch 3 times, most recently from 56f4f84 to 52065ab Compare March 23, 2026 13:06
@swanandx swanandx requested a review from mythical-fred March 23, 2026 13:10
@swanandx swanandx marked this pull request as ready for review March 23, 2026 13:10
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three hard blocks, one correctness concern.

Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two code bugs (inline) plus missing docs — blocking on the docs.

threads and fatal_on_max_retries are new user-visible config options but neither appears in docs.feldera.com/docs/connectors/sinks/delta.md. The Documentation checkbox is unchecked. Please add entries to the config table for both options.

@ryzhyk ryzhyk self-requested a review March 23, 2026 19:04
@swanandx swanandx force-pushed the parallel-delta-output branch from 6738817 to ac1dfbd Compare March 24, 2026 06:31
@swanandx swanandx requested a review from mythical-fred March 24, 2026 06:32
@swanandx swanandx force-pushed the parallel-delta-output branch from ac1dfbd to adf3e2c Compare March 24, 2026 06:52
Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't do anything meaningful with fatal errors today, let's not bother adding that option. We'll do another pass to rework connector error handling later.

}

// prints timing for manual comparison, and asserts output correctness
#[test]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's turn this into a proper benchmark under adapters/bench.
What benchmark results did you get?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$cargo bench --package dbsp_adapters --bench delta_encoder --features with-deltalake

     Running benches/delta_encoder.rs (target/release/deps/delta_encoder-836871c7c7c0c9cd)
Gnuplot not found, using plotters backend
delta_indexed_encode/workers/1
                        time:   [35.170 ms 35.201 ms 35.233 ms]
                        thrpt:  [2.8382 Melem/s 2.8408 Melem/s 2.8433 Melem/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
delta_indexed_encode/workers/2
                        time:   [24.756 ms 25.310 ms 25.873 ms]
                        thrpt:  [3.8650 Melem/s 3.9511 Melem/s 4.0394 Melem/s]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
delta_indexed_encode/workers/4
                        time:   [15.615 ms 15.899 ms 16.199 ms]
                        thrpt:  [6.1734 Melem/s 6.2897 Melem/s 6.4042 Melem/s]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
delta_indexed_encode/workers/8
                        time:   [9.1359 ms 9.2460 ms 9.3869 ms]
                        thrpt:  [10.653 Melem/s 10.816 Melem/s 10.946 Melem/s]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  1 (1.00%) high severe

@swanandx swanandx requested review from mythical-fred and ryzhyk and removed request for mythical-fred March 25, 2026 16:47
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous two blockers (schema minimum = 1 and docs for threads) are resolved — thank you. One remaining issue:

Copy link
Copy Markdown
Contributor

@ryzhyk ryzhyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all comments! I only have a couple of questions left (inline)

@swanandx swanandx force-pushed the parallel-delta-output branch from 47b7efc to 74d37da Compare March 30, 2026 08:49
@swanandx swanandx requested a review from mythical-fred March 30, 2026 08:56
Copy link
Copy Markdown

@mythical-fred mythical-fred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My previous blockers are resolved: schema validation, docs, and commit message are all fixed. Upgrading to approve.

Copy link
Copy Markdown
Contributor

@snkas snkas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying on @ryzhyk for the adapters side. The control-plane additional configuration parameter looks good.

@swanandx swanandx force-pushed the parallel-delta-output branch from 74d37da to c7172d7 Compare April 1, 2026 12:00
@swanandx swanandx added this pull request to the merge queue Apr 1, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 1, 2026
Use SplitCursor to split the batch and distribute it across tasks,
each task retries encoding and writing to delta lake and then returns
Add actions which main task retries to commit to delta lake

Signed-off-by: Swanand Mulay <73115739+swanandx@users.noreply.github.com>
@swanandx swanandx force-pushed the parallel-delta-output branch from c7172d7 to b29e7b7 Compare April 1, 2026 17:30
@swanandx swanandx enabled auto-merge April 1, 2026 17:30
@swanandx swanandx disabled auto-merge April 1, 2026 17:33
@swanandx swanandx enabled auto-merge April 1, 2026 17:34
@swanandx swanandx added this pull request to the merge queue Apr 1, 2026
Merged via the queue into main with commit 714a79c Apr 1, 2026
1 check passed
@swanandx swanandx deleted the parallel-delta-output branch April 1, 2026 18:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants