jupyter_sparkmonitor

Jupyter Notebook with SparkMonitor

This folder contains the initialization action sparkmonitor.sh to quickly setup and launch Jupyter Notebook with SparkMonitor to show Spark UI inside your notebook.

Note: This init action uses Conda and Python 3.

Pre-requisites: This initialization action uses the Jupyter Optional Component which requires Cloud Dataproc image version 1.4 and later. The Jupyter Optional Component's web interface can be accessed via Component Gateway without using SSH tunnels. Also, you will need Anaconda as another Optional Component.

Using this initialization action

⚠️ NOTICE: See best practices of using initialization actions in production.

You can use this initialization action to create a new Dataproc cluster with SparkMonitor installed:

Use the gcloud command to create a new cluster with this initialization action.

# Jupyter will run on port 8123 of your master node.
REGION=<region>
CLUSTER_NAME=<cluster_name>
gcloud dataproc clusters create ${CLUSTER_NAME} \
    --region ${REGION} \
    --optional-components ANACONDA,JUPYTER \
    --enable-component-gateway \
    --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/jupyter_sparkmonitor/sparkmonitor.sh

To access to the Jupyter web interface, you can just use the Component Gateway on the GCP Dataproc cluster console. Alternatively, you can access following the instructions in connecting to cluster web interfaces.
After open Jupyter notebook, make sure you select the Python3 kernel instead of the PySpark kernel.

Inside your notebook, you can get SparkContext and SparkSession using the following code

from pyspark import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession.builder.appName('YOUR APP NAME').getOrCreate()

Internal details

sparkmonitor.sh

sparkmonitor.sh handles installing SparkMonitor, configuring and running Jupyter on the Dataproc master node by doing the following:

Check to see if Conda and Jupyter Optional Components installed. Fail if not.
Installing SparkMonitor using pip.
Enable the SparkMonitor as Jupyter Extension and configure IPython kernel to load the extension.
Refresh Jupyter config and restart Jupyter service.

Important notes

This initialization action requires that you launch the Dataproc cluster with Jupyter and Conda optional components. The creating process will fail if it does not found them.

Name		Name	Last commit message	Last commit date
parent directory ..
BUILD		BUILD
README.md		README.md
__init__.py		__init__.py
sparkmonitor.sh		sparkmonitor.sh
test_jupyter_sparkmonitor.py		test_jupyter_sparkmonitor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Jupyter Notebook with SparkMonitor

Using this initialization action

Internal details

sparkmonitor.sh

Important notes

FilesExpand file tree

jupyter_sparkmonitor

Directory actions

More options

Directory actions

More options

Latest commit

History

jupyter_sparkmonitor

Folders and files

parent directory

README.md

Jupyter Notebook with SparkMonitor

Using this initialization action

Internal details

sparkmonitor.sh

Important notes