Partitioned (100-file) Parquet dataset
Follow instructions in the datafusion directory.
-
DataFusion follows the SQL standard with case-sensitive identifiers, so all column names in
queries.sqluse double-quoted literals (e.g.EventTime->"EventTime"). -
You must set the
('binary_as_string' 'true')due to an incorrect logical type annotation in the partitioned files. See Issue#7
-
Install/build
datafusion-cli. -
Download the parquet files:
seq 0 99 | xargs -P100 -I{} bash -c 'wget --directory-prefix partitioned --continue --progress=dot:giga https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_{}.parquet'
- Run the queries:
datafusion-cli -f create.sql -f queries.sqlorPATH="$(pwd)/arrow-datafusion/target/release:$PATH" ./run.sh.