Some connectors (delta, iceberg) use datafusion to access the underlying table. Each connector creates its own datafusion session, which is probably the right thing to do, as they operate on completely disjoint resources. However today all these sessions are configured to evaluate queries in memory. This is usually ok, since most queries issued by connectors don't need much intermediate state. One exception is the ORDER BY query used by the delta connector in CDC mode. This ends up sorting a potentially very large array of all updates in a transaction log entry, in memory.
The solution is to allow datafusion to use storage, at long as storage is configured for the pipeline, similar to the datafusion session used to run ad hoc queries: https://github.com/feldera/feldera/blob/main/crates/adapters/src/adhoc.rs#L34
This will require the connector to access pipeline config to determine storage settings.
Some connectors (delta, iceberg) use datafusion to access the underlying table. Each connector creates its own datafusion session, which is probably the right thing to do, as they operate on completely disjoint resources. However today all these sessions are configured to evaluate queries in memory. This is usually ok, since most queries issued by connectors don't need much intermediate state. One exception is the ORDER BY query used by the delta connector in CDC mode. This ends up sorting a potentially very large array of all updates in a transaction log entry, in memory.
The solution is to allow datafusion to use storage, at long as storage is configured for the pipeline, similar to the datafusion session used to run ad hoc queries: https://github.com/feldera/feldera/blob/main/crates/adapters/src/adhoc.rs#L34
This will require the connector to access pipeline config to determine storage settings.