You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[adapters] Auto-tune step size for the number of workers.
Step size is the number of records pushed to the circuit from each connector.
Our previous default of 10,000 records was selected before we introduced
splitters and accumulators, which break up large outputs across multiple steps.
Back then, a large input could easily explode causing performance and OOM
issues.
Nowadays, there is no real reason to keep input steps small. A reasonable
default is to ingest 10K records per worker thread, which approximates how we
split up the work within the circuit.
This commit keeps the old `max_batch_size` setting for backward compatibility.
When not specified, the new `max_worker_batch_size` setting is used to compute
max batch size as `max_worker_batch_size x num_workers`. The defautl value is
10,000, meaning that by default a pipeline with 8 workers will ingest 80K
records per connector per step.
Why not remove input step cap altogether and ingest all buffered data at once
(after all it's already kept in memory anyway)?
- The InputUpsert operator is not yet implemented as a splitter and processes
the entire input in one step, leading to potentially large output batches
(expensive to sort!)
- Very large batches can increase input/output latency, leading to the sawtooth
throughput patter, which users don't like.
The current solution is not ideal. We probably want to use batch size in bytes,
not records as a cap. We may also want to cap input size across all connectors
attached to a table, not per connector. Those improvements will require more
work.
Empirically, this commit improves ingestion speed 2x for pipelines with many
delta connectors.
Signed-off-by: Leonid Ryzhyk <ryzhyk@gmail.com>
0 commit comments