The Jupyter Component is the best way to use Jupyter with Cloud Dataproc. To learn more about Dataproc Components see here.
This initialization action downloads and runs a Google Cloud Datalab Docker container on a Dataproc cluster. You will need to connect to Datalab using an SSH tunnel.
-
Use the
gcloudcommand to create a new cluster with this initialization action.REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/datalab/datalab.sh \ --scopes cloud-platform
-
Once the cluster has been created, Datalab is configured to run on port
8080on the master node in a Dataproc cluster. To connect to the Datalab web interface, you will need to create an SSH tunnel and use a SOCKS5 proxy as described in the Dataproc web interfaces documentation. -
Once you bring up a notebook, you should have the normal PySpark environment configured with
sc,sqlContext, andsparkpredefined.
You can find more information about using initialization actions with Dataproc in the Dataproc documentation.
Datalab (and the Spark driver) can run with Python 2 or Python 3. However,
workers (executors) are configured to use Python 2. To change worker python, use
the
Conda init action.
Note that the driver (PYSPARK_DRIVER_PYTHON) and executors (PYSPARK_PYTHON)
must be at the same minor version. Currently, Datalab uses Python 3.5. Here is
how to set up Python 3.5 on workers:
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
--region ${REGION} \
--metadata 'CONDA_PACKAGES="python==3.5"' \
--scopes cloud-platform \
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/conda/bootstrap-conda.sh,gs://goog-dataproc-initialization-actions-${REGION}/conda/install-conda-env.sh,gs://goog-dataproc-initialization-actions-${REGION}/datalab/datalab.shIn effect, this means that a particular Datalab-on-Dataproc cluster can only run Python 2 or Python 3 kernels, but not both.
- PySpark's
DataFrame.toPandas()method is useful for integrating with Datalab APIs.- Remember that Panda's DataFrames must fit on the master, whereas Spark's can fill a cluster.
- Datalab has a number of notebooks documenting its Pandas' integrations.
- This script requires that Datalab run on port
:8080. If you normally run another server on that port (e.g. Zeppelin), consider moving it. Note running multiple Spark sessions can consume a lot of cluster resources and can cause problems on moderately small clusters. - If you
build your own Datalab images,
you can specify
--metadata docker-image=gcr.io/<PROJECT>/<IMAGE>to point to your image. - If you normally only run Datalab kernels on VMs and connect to them with a
local Docker frontend, set the flag
--metadata docker-image=gcr.io/cloud-datalab/datalab-gatewayand then setGATEWAY_VMto your cluster's master node in your localdockercommand as described here. - You can pass Spark packages as a comma separated list with
--metadata spark-packages=<PACKAGES>e.g.--metadata '^#^spark-packages=com.databricks:spark-avro_2.11:3.2.0,graphframes:graphframes:0.3.0-spark2.0-s_2.11. - This init action runs Datalab in Docker, and installs Docker via the
Docker init action.
To run this script with a modified Docker init action, pass
--metadata "INIT_ACTIONS_REPO=gs://my-forked-dataproc-initialization-actions", or--metadata "INIT_ACTIONS_REPO=https://github.com/myfork/dataproc-initialization-actions"and--metadata "INIT_ACTIONS_BRANCH=branch-on-my-fork"