Skip to content

Configure storage for datafusion instances used by connectors. #6153

@ryzhyk

Description

@ryzhyk

Some connectors (delta, iceberg) use datafusion to access the underlying table. Each connector creates its own datafusion session, which is probably the right thing to do, as they operate on completely disjoint resources. However today all these sessions are configured to evaluate queries in memory. This is usually ok, since most queries issued by connectors don't need much intermediate state. One exception is the ORDER BY query used by the delta connector in CDC mode. This ends up sorting a potentially very large array of all updates in a transaction log entry, in memory.

The solution is to allow datafusion to use storage, at long as storage is configured for the pipeline, similar to the datafusion session used to run ad hoc queries: https://github.com/feldera/feldera/blob/main/crates/adapters/src/adhoc.rs#L34

This will require the connector to access pipeline config to determine storage settings.

Metadata

Metadata

Assignees

Labels

connectorsIssues related to the adapters/connectors cratehigh priorityTask should be tackled first, added in the current sprint if necessaryuser-reportedReported by a user or customer

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions