Is your feature request related to a problem? Please describe.
The current spark implementation scans over all parquet files. This process can be made faster and more efficient by specifying a date_partition_column. During execution, this column would be used to filter the data at a file level. Only files who's date is within the range would be scanned.
Describe the solution you'd like
Add date_partition_column to SparkSource. A similar implementation exists for the AthenaSource
Describe alternatives you've considered
None
I have implemented this locally and it works. I'm happy to open a PR
Is your feature request related to a problem? Please describe.
The current spark implementation scans over all parquet files. This process can be made faster and more efficient by specifying a
date_partition_column. During execution, this column would be used to filter the data at a file level. Only files who's date is within the range would be scanned.Describe the solution you'd like
Add
date_partition_columntoSparkSource. A similar implementation exists for theAthenaSourceDescribe alternatives you've considered
None
I have implemented this locally and it works. I'm happy to open a PR