Samples showing how to create and run an Apache Beam on Google Cloud Dataflow.
-
Install the Cloud SDK.
-
Enable the APIs: Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Pub/Sub, Datastore, and Cloud Resource Manager.
-
Setup the Cloud SDK to your GCP project.
gcloud init
-
Create a service account key as a JSON file. For more information, see Creating and managing service accounts.
-
From the Service account list, select New service account.
-
In the Service account name field, enter a name.
-
From the Role list, select Project > Owner.
Note: The Role field authorizes your service account to access resources. You can view and change this field later by using the GCP Console IAM page. If you are developing a production app, specify more granular permissions than Project > Owner. For more information, see Granting roles to service accounts.
-
Click Create. A JSON file that contains your key downloads to your computer.
-
-
Set your
GOOGLE_APPLICATION_CREDENTIALSenvironment variable to point to your service account key file.export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json -
Create a Cloud Storage bucket.
gsutil mb gs://your-gcs-bucket
The following instructions will help you prepare your development environment.
-
Download and install the Java Development Kit (JDK). Verify that the JAVA_HOME environment variable is set and points to your JDK installation.
-
Download and install Apache Maven by following the Maven installation guide for your specific operating system.
-
Clone the
java-docs-samplesrepository.git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
-
Navigate to the sample code directory.
cd java-docs-samples/dataflow/templates
First, select the project and template location.
PROJECT=$(gcloud config get-value project)
BUCKET=your-gcs-bucket
TEMPLATE_LOCATION=gs://$BUCKET/dataflow/templates/WordCountThen, to create the template in the desired Cloud Storage location.
# Create the template.
mvn compile exec:java \
-Dexec.mainClass=com.example.dataflow.templates.WordCount \
-Dexec.args="\
--isCaseSensitive=false \
--project=$PROJECT \
--templateLocation=$TEMPLATE_LOCATION \
--runner=DataflowRunner"
# Upload the metadata file.
gsutil cp WordCount_metadata "$TEMPLATE_LOCATION"_metadataFor more information, see Creating templates.
Finally, you can run the template via gcloud or through the GCP Console create Dataflow job page.
JOB_NAME=wordcount-$(date +'%Y%m%d-%H%M%S')
INPUT=gs://apache-beam-samples/shakespeare/kinglear.txt
gcloud dataflow jobs run $JOB_NAME \
--gcs-location $TEMPLATE_LOCATION \
--parameters inputFile=$INPUT,outputBucket=$BUCKETFor more information, see Executing templates.
You can check your submitted jobs in the GCP Console Dataflow page.
To avoid incurring charges to your GCP account for the resources used:
# Delete only the files created by this sample.
gsutil -m rm -rf \
"gs://$BUCKET/dataflow/templates/WordCount*" \
"gs://$BUCKET/dataflow/wordcount/"
# [optional] Remove the entire dataflow Cloud Storage directory.
gsutil -m rm -rf gs://$BUCKET/dataflow
# [optional] Remove the Cloud Storage bucket.
gsutil rb gs://$BUCKET