Is your feature request related to a problem? Please describe.
When running store.materialize_incremental() with a Spark-based offline store, Feast currently converts Spark DataFrames to Arrow tables using toPandas() or collect().
This loads the entire dataset into driver memory, causing OutOfMemoryError or spark.driver.maxResultSize exceeded for large datasets.
Although Feast already supports a staging configuration for exporting data, it is not yet used during materialization. As a result, large-scale jobs cannot leverage staging to spill data to disk or remote storage.
Describe the solution you'd like
Enable the existing staging configuration to be used for materialization in the Spark offline store.
When enabled, Feast should write intermediate Spark DataFrames to the staging location (e.g. local disk or S3) before reading them back as Arrow tables for online ingestion.
Example configuration:
offline_store:
type: spark
staging_location: s3://bucket/tmp/feast_arrow
staging_allow_materialize: true
This would allow Feast to handle large datasets safely without driver OOM, by spilling intermediate data to the configured staging location.
Describe alternatives you've considered
- Increasing driver memory (
spark.driver.memory, spark.driver.maxResultSize) — only delays the problem.
- Using
toLocalIterator() — too slow and still limited by memory.
Additional context
During feast materialize with large Spark datasets, jobs fail due to driver OOM even though executors have available resources.
Example error:
Total size of serialized results (2.1 GiB) is bigger than spark.driver.maxResultSize (2.0 GiB)
Caused by: java.lang.OutOfMemoryError: Java heap space
Allowing materialization to use staging would make Feast’s Spark integration far more scalable and production-ready.
Is your feature request related to a problem? Please describe.
When running
store.materialize_incremental()with a Spark-based offline store, Feast currently converts Spark DataFrames to Arrow tables usingtoPandas()orcollect().This loads the entire dataset into driver memory, causing
OutOfMemoryErrororspark.driver.maxResultSizeexceeded for large datasets.Although Feast already supports a
stagingconfiguration for exporting data, it is not yet used during materialization. As a result, large-scale jobs cannot leverage staging to spill data to disk or remote storage.Describe the solution you'd like
Enable the existing
stagingconfiguration to be used for materialization in the Spark offline store.When enabled, Feast should write intermediate Spark DataFrames to the staging location (e.g. local disk or S3) before reading them back as Arrow tables for online ingestion.
Example configuration:
This would allow Feast to handle large datasets safely without driver OOM, by spilling intermediate data to the configured staging location.
Describe alternatives you've considered
spark.driver.memory,spark.driver.maxResultSize) — only delays the problem.toLocalIterator()— too slow and still limited by memory.Additional context
During
feast materializewith large Spark datasets, jobs fail due to driver OOM even though executors have available resources.Example error:
Allowing materialization to use
stagingwould make Feast’s Spark integration far more scalable and production-ready.