This initialization action installs Alluxio (https://www.alluxio.io/) on a Google Cloud Dataproc cluster. The master Cloud Dataproc node will be the Alluxio master and all Cloud Dataproc workers will be Alluxio workers.
You can use this initialization action to create a new Dataproc cluster with Alluxio installed:
-
Using the
gcloudcommand to create a new cluster with this initialization action.REGION=<region> CLUSTER=<cluster_name> gcloud dataproc clusters create ${CLUSTER} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \ --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS>
You can find more information about using initialization actions with Dataproc in the Dataproc documentation.
To run a Spark application accessing data from Alluxio, simply refer to the path
as alluxio://<cluster_name>-m:19998/<path_to_file>; where <cluster_name>-m
is the dataproc master hostname. Refer to Alluxio on Spark
documentation
for additional getting started resources.
If installing the optional Presto component, Presto must be installed before Alluxio. Initialization action are executed sequentially and the Presto action must precede the Alluxio action.
-
This script must be updated to specify the Alluxio version to install.
-
alluxio_versionis an an optional parameter to override the default Alluxio version to install. -
alluxio_root_ufs_uriis a required parameter to specify the root under storage location for Alluxio. -
Additional properties can be specified using the metadata key
alluxio_site_propertiesdelimited using;.REGION=<region> CLUSTER=<cluster_name> gcloud dataproc clusters create ${CLUSTER} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \ --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS> --metadata alluxio_site_properties="alluxio.master.mount.table.root.option.fs.gcs.accessKeyId=<GCS_ACCESS_KEY_ID>;alluxio.master.mount.table.root.option.fs.gcs.secretAccessKey=<GCS_SECRET_ACCESS_KEY>"
-
Additional files can be downloaded into
/opt/alluxio/confusing the metadata keyalluxio_download_files_listby specifyinghttp(s)orgsuris delimited using;.REGION=<region> CLUSTER=<cluster_name> gcloud dataproc clusters create ${CLUSTER} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/alluxio/alluxio.sh \ --metadata alluxio_root_ufs_uri=<UNDERSTORAGE_ADDRESS> \ --metadata alluxio_download_files_list="gs://goog-dataproc-initialization-actions-${REGION}/$my_file;https://$server/$file"