This script installs Prometheus on Dataproc clusters, performs necessary configurations and pulls metrics from Hadoop, Spark and Kafka if installed. Prometheus is a time series database that allows visualizing, querying metrics gathered from different cluster components during job execution.
You can use this initialization action to create a new Dataproc cluster with Prometheus installed on every node:
-
Use the
gcloudcommand to create a new cluster with this initialization action.REGION=<region> CLUSTER_NAME=<cluster_name> gcloud dataproc clusters create ${CLUSTER_NAME} \ --region ${REGION} \ --initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/prometheus/prometheus.sh
-
Prometheus UI on the master node can be accessed after connecting with the command:
gcloud compute ssh <CLUSTER_NAME>-m -- -L 9090:<CLUSTER_NAME>-m:9090
Then just open a browser and type
localhost:9090address. -
Prometheus UI on worker node can be accessed similarly, but just substitute
-msuffix with-w-0,-w-1depending on worker you would like to access. You can also setup ssh tunnel and configure sock proxy to access all master nodes and worker nodes using their internal hostnames.
Prometheus uses StatsD to retrieve metrics for Hadoop and Spark, but the StatsD sink for metrics publishing is available on Apache Spark 2.3.0, so aggregating metrics from Spark on clusters with software version other than 1.3+ will result in an error. StatsD sink for Apache Hadoop metrics is available on Hadoop Versions 2.8+ so will work on clusters with image version 1.2+.
Prometheus uses JMX exporter to retrieve metrics for Kafka.