This folder contains the initialization action sparkmonitor.sh to quickly
setup and launch Jupyter Notebook with
SparkMonitor to show Spark UI
inside your notebook.
Note: This init action uses Conda and Python 3.
Pre-requisites: This initialization action uses the Jupyter Optional Component which requires Cloud Dataproc image version 1.4 and later. The Jupyter Optional Component's web interface can be accessed via Component Gateway without using SSH tunnels. Also, you will need Anaconda as another Optional Component.
You can use this initialization action to create a new Dataproc cluster with SparkMonitor installed:
-
Use the
gcloudcommand to create a new cluster with this initialization action.# Jupyter will run on port 8123 of your master node. REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --optional-components ANACONDA,JUPYTER \ --enable-component-gateway \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/jupyter_sparkmonitor/sparkmonitor.sh
-
To access to the Jupyter web interface, you can just use the Component Gateway on the GCP Dataproc cluster console. Alternatively, you can access following the instructions in connecting to cluster web interfaces.
-
After open Jupyter notebook, make sure you select the
Python3kernel instead of thePySparkkernel. -
Inside your notebook, you can get
SparkContextandSparkSessionusing the following codefrom pyspark import SparkContext from pyspark.sql import SparkSession sc = SparkContext.getOrCreate(conf=conf) spark = SparkSession.builder.appName('YOUR APP NAME').getOrCreate()
sparkmonitor.sh handles installing SparkMonitor, configuring and running
Jupyter on the Dataproc master node by doing the following:
- Check to see if Conda and Jupyter Optional Components installed. Fail if not.
- Installing SparkMonitor using
pip. - Enable the SparkMonitor as Jupyter Extension and configure IPython kernel to load the extension.
- Refresh Jupyter config and restart Jupyter service.
- This initialization action requires that you launch the Dataproc cluster with Jupyter and Conda optional components. The creating process will fail if it does not found them.