This guide builds upon the general Getting Started with C++ guide. It deploys the GCS indexing application to GKE (Google Kubernetes Engine) instead of Cloud Run, taking advantage of the long-running servers in GKE to improve throughput.
The steps in this guide are self-contained. It is not necessary to go through the Getting Started with C++ guide to go through these steps. It may be easier to understand the motivation and the main components if you do so. Note that some commands below may create resources (such as the Cloud Spanner instance and database) that are already created in the previous guide.
A common technique to improve throughput in Cloud Spanner is to aggregate multiple changes into a single transaction, minimizing the synchronization and networking overheads. However, applications deployed to Cloud Run cannot assume they will remain running after they respond to a request. This makes it difficult to aggregate work from multiple Pub/Sub messages.
In this guide we will modify the application to:
- Run in GKE, where applications are long-lived and can assume they remain active after handling a message.
- Connect to Cloud Pub/Sub using [pull subscriptions], which have lower overhead and implement a more fine-grained flow control mechanism.
- Use background threads to aggregate the results from multiple Cloud Pub/Sub messages into a single Cloud Spanner transaction.
At a high-level, our plan is to replace "Cloud Run" with "Kubernetes Engine" in the Getting Started with C++ application:
For completeness, the following instructions duplicate some of the steps in the previous guide. We will need to issue a number of commands to create the GKE cluster, the Cloud Pub/Sub topics and subscriptions, as well as the Cloud Spanner instance and database. With this application we will need to create a service account (sometimes called "robot" accounts) to run the application, and grant this service account the necessary permissions.
This example assumes that you have an existing GCP (Google Cloud Platform) project. The project must have billing enabled, as some of the services used in this example require it. If needed, consult:
- the GCP quickstarts to setup a GCP project
- the [GKE quickstart][cloud-gke-quickstart] to setup GKE in your project
Use your workstation, a GCE instance, or the Cloud Shell to get a command-line prompt. If needed, login to GCP using:
gcloud auth loginThroughout the example we will use GOOGLE_CLOUD_PROJECT as an environment
variable containing the name of the project.
export GOOGLE_CLOUD_PROJECT=[PROJECT ID]
⚠️ this guide uses Cloud Spanner and GKE. These services are billed by the hour even if you stop using them. The charges can reach the hundreds or even thousands of dollars per month if you configure a large Cloud Spanner instance or large GKE cluster. Consult the Pricing Calculator for details. Please remember to delete any Cloud Spanner and GKE resources once you no longer need them.
We will issue a number of commands using the [Google Cloud SDK], a
command-line tool to interact with Google Cloud services. Adding the
--project=$GOOGLE_CLOUD_PROJECT to each invocation of this tool quickly
becomes tedious, so we start by configuring the default project:
gcloud config set project $GOOGLE_CLOUD_PROJECT
# Output: Updated property [core/project].Some services are not enabled by default when you create a Google Cloud Project, so we start by enabling all the services we will need.
gcloud services enable cloudbuild.googleapis.com
gcloud services enable containerregistry.googleapis.com
gcloud services enable container.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable spanner.googleapis.com
# Output: nothing if the services are already enabled.
# for services that are not enabled something like this
# Operation "operations/...." finished successfully.So far, we have not created any C++ code. It is time to compile and deploy our application, as we will need the name and URL of the deployment to wire the remaining resources. First obtain the code:
git clone https://github.com/GoogleCloudPlatform/cpp-samples
# Output: Cloning into 'cpp-samples'...
# additional informational messagesChange your working directory to this new workspace:
cd cpp-samples/getting-startedgcloud builds submit \
--async \
--machine-type=e2-highcpu-32 \
--config=gke/cloudbuild.yaml
# Output:
# Creating temporary tarball archive of ... file(s) totalling ... KiB before compression.
# Uploading tarball of [.] to [gs://....tgz]
# Created [https://cloudbuild.googleapis.com/v1/projects/....].
# Logs are available at [...].As mentioned above, this guide uses Cloud Spanner to store the data. We create the smallest possible instance. If needed we will scale up the instance, but this is economical and enough for running small jobs.
⚠️ Creating the Cloud Spanner instance incurs immediate billing costs, even if the instance is not used.
gcloud beta spanner instances create getting-started-cpp \
--config=regional-us-central1 \
--processing-units=100 \
--description="Getting Started with C++"
# Output: Creating instance...done.A Cloud Spanner instance is just the allocation of compute resources for your databases. Think of them as a virtual set of database servers dedicated to your databases. Initially these servers have no databases or tables associated with the resources. We need to create a database and table that will host the data for this demo:
gcloud spanner databases create gcs-index \
--ddl-file=gcs_objects.sql \
--instance=getting-started-cpp
# Output: Creating database...done.Publishers send messages to Cloud Pub/Sub using a topic. These are named, persistent resources. We need to create one to configure the application.
gcloud pubsub topics create gke-gcs-indexing
# Output: Created topic [projects/.../topics/gke-gcs-indexing].Subscribers receive messages from Cloud Pub/Sub using a subscription. These are named, persistent resources. We need to create one to configure the application.
gcloud pubsub subscriptions create --topic=gke-gcs-indexing gke-gcs-indexing
# Output: Created subscription [projects/.../subscriptions/gke-gcs-indexing].We use preemptible nodes (the --preemptible flag) because they have lower
cost, and the application can safely restart. We also configure the cluster to
grow as needed. The maximum number of nodes (in this case 64) should be set
based on your available quota or budget. Note that we enable
workload identity, the recommended way for GKE-based
applications to consume services in Google Cloud.
gcloud container clusters create cpp-samples \
--region="us-central1" \
--preemptible \
--min-nodes=0 \
--max-nodes=64 \
--enable-autoscaling \
--workload-pool="$GOOGLE_CLOUD_PROJECT.svc.id.goog"
# Output: ...
# Creating cluster cpp-samples in us-central1...done
# Created [https://container.googleapis.com/v1/projects/$GOOGLE_CLOUD_PROJECT/zones/us-central1/clusters/cpp-samples].
# To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1/cpp-samples?project=$GOOGLE_CLOUD_PROJECT
# kubeconfig entry generated for cpp-samples.
# NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
# cpp-samples us-central1 .............. ..... ............ .... .. RUNNINGOnce created, we configure the kubectl credentials to use this cluster:
gcloud container clusters --region="us-central1" get-credentials cpp-samples
# Output: Fetching cluster endpoint and auth data.
# kubeconfig entry generated for cpp-samples.GKE recommends configuring a different workload-identity for each GKE workload, and using this identity to access GCP services. To follow these guidelines we start by creating a service account in the Kubernetes Cluster. Note that Kubernetes service accounts are distinct from GCP service accounts, but can be mapped to them (as we do below).
kubectl create serviceaccount worker
# Output: serviceaccount/worker createdPROJECT_NUMBER=$(gcloud projects list \
--filter="project_id=$GOOGLE_CLOUD_PROJECT" \
--format="value(project_number)" \
--limit=1)
gcloud iam service-accounts add-iam-policy-binding \
"--role=roles/iam.workloadIdentityUser" \
"--member=serviceAccount:$GOOGLE_CLOUD_PROJECT.svc.id.goog[default/worker]" \
"$PROJECT_NUMBER-compute@developer.gserviceaccount.com"
# Output: <IAM policy list>kubectl annotate serviceaccount worker \
iam.gke.io/gcp-service-account="$PROJECT_NUMBER-compute@developer.gserviceaccount.com"
# Output: serviceaccount/worker annotatedLook at the status of your build using:
gcloud builds list --ongoing
# Output: the list of running jobsIf your build has completed the list will be empty. If you need to wait for this build to complete (it should take about 15 minutes) use:
gcloud builds log --stream $(gcloud builds list --ongoing --format="value(id)")
# Output: the output from the build, streamed.We can now create a job in GKE. GKE requires its configuration files to be plain YAML, without variables or any other expansion. We use a small script to generate this file:
gke/print-deployment.py --project=$GOOGLE_CLOUD_PROJECT | kubectl apply -f -
# Output: deployment.apps/worker createdThis will request indexing some public data. The prefix contains less than 100 objects:
gcloud pubsub topics publish gke-gcs-indexing \
--attribute=bucket=gcp-public-data-landsat,prefix=LC08/01/006/001
# Output: messageIds:
# - '....'The data should start appearing in the Cloud Spanner database. We can use the
gcloud tool to query this data.
gcloud spanner databases execute-sql gcs-index --instance=getting-started-cpp \
--sql="select * from gcs_objects where name like '%.txt' order by size desc limit 10"
# Output: metadata for the 10 largest objects with names finishing in `.txt`
⚠️ The following steps will incur significant billing costs. Use the Pricing Calculator to estimate the costs. If you are uncertain as to these costs, skip to the Cleanup Section.
To scan a larger prefix we will need to scale up the GKE deployment:
kubectl scale deployment/worker --replicas=128
# Output: deployment.apps/worker scaledGKE has detailed tutorials on how to use Cloud Monitoring metrics, such as the length of the work queue, to autoscale a deployment.
We also need to scale up the Cloud Spanner instance. We use a gcloud command
for this:
gcloud beta spanner instances update getting-started-cpp --processing-units=3000
# Output: Updating instance...done.We can now index a prefix with a few million objects of objects. In our tests this completed in a little over an hour.
gcloud pubsub topics publish gke-gcs-indexing \
--attribute=bucket=gcp-public-data-landsat,prefix=LC08/01
# Output: messageIds:
# - '....'You can monitor the work queue using the console:
google-chrome "https://console.cloud.google.com/cloudpubsub/subscription/detail/indexing-requests-cloud-run-push?project=$GOOGLE_CLOUD_PROJECT"Or count the number of indexed objects:
gcloud spanner databases execute-sql gcs-index --instance=getting-started-cpp \
--sql="select count(*) from gcs_objects"
# Output:
# (Unspecified) --> the count(*) column name
# 49027797 --> the number of rows in the `gcs_objects` table (the actual number may be different)
⚠️ Do not forget to cleanup your billable resources after going through this "Getting Started" guide.
gcloud container clusters --region=us-central1 delete cpp-samples --quiet
# Output: Deleting cluster cpp-samples...done.
# Deleted [https://container.googleapis.com/v1/projects/coryan-test/zones/us-central1/clusters/cpp-samples].gcloud spanner databases delete gcs-index --instance=getting-started-cpp --quiet
# Output: none
gcloud spanner instances delete getting-started-cpp --quiet
# Output: nonegcloud pubsub subscriptions delete gke-gcs-indexing --quiet
# Output: Deleted subscription [projects/$GOOGLE_CLOUD_PROJECT/subscriptions/gke-gcs-indexing].gcloud pubsub topics delete gke-gcs-indexing --quiet
# Output: Deleted topic [projects/$GOOGLE_CLOUD_PROJECT/topics/gke-gcs-indexing].gcloud container images delete gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/gke:latest --quiet
# Output: Deleted [gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/gke:latest]
# Output: Deleted [gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/gke@sha256:....]for tag in $(gcloud container images list-tags gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/ci/cache --format="value(tags)" ); do
gcloud container images delete gcr.io/$GOOGLE_CLOUD_PROJECT/getting-started-cpp/ci/cache:${tag} --quiet
done
# Output: the output from each delete commandThe GKE workload will need a GCP service account to access GCP resources. Pick a name and create the account:
readonly SA_ID="gcs-index-worker-sa"
readonly SA_NAME="$SA_ID@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com"
gcloud iam service-accounts create "$SA_ID" \
--description="C++ Samples Service Account"
# Output: Created service account [gcs-index-worker-sa].gcloud projects add-iam-policy-binding "$GOOGLE_CLOUD_PROJECT" \
--member="serviceAccount:$SA_NAME" \
--role="roles/pubsub.subscriber"
# Output: <IAM policy list (can be very long)>
gcloud projects add-iam-policy-binding "$GOOGLE_CLOUD_PROJECT" \
--member="serviceAccount:$SA_NAME" \
--role="roles/pubsub.publisher"
# Output: <IAM policy list (can be very long)>gcloud projects add-iam-policy-binding "$GOOGLE_CLOUD_PROJECT" \
--member="serviceAccount:$SA_NAME" \
--role="roles/storage.objectViewer"
# Output: <IAM policy list (can be very long)>gcloud spanner databases add-iam-policy-binding gcs-index \
--instance="getting-started-cpp" \
"--member=serviceAccount:$SA_NAME" \
"--role=roles/spanner.databaseUser"
# Output: <IAM policy list (can be very long)>gcloud iam service-accounts add-iam-policy-binding \
"--role=roles/iam.workloadIdentityUser" \
"--member=serviceAccount:$GOOGLE_CLOUD_PROJECT.svc.id.goog[default/worker]" \
"$SA_NAME"
# Output: <IAM policy list>kubectl annotate serviceaccount worker \
iam.gke.io/gcp-service-account=$SA_NAME
# Output: serviceaccount/worker annotated