Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

README.md

Cloud Dataflow Templates

Samples showing how to create and run an Apache Beam on Google Cloud Dataflow.

Before you begin

  1. Install the Cloud SDK.

  2. Create a new project.

  3. Enable billing.

  4. Enable the APIs: Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, BigQuery, Pub/Sub, Datastore, and Cloud Resource Manager.

  5. Setup the Cloud SDK to your GCP project.

    gcloud init
  6. Create a service account key as a JSON file. For more information, see Creating and managing service accounts.

    • From the Service account list, select New service account.

    • In the Service account name field, enter a name.

    • From the Role list, select Project > Owner.

      Note: The Role field authorizes your service account to access resources. You can view and change this field later by using the GCP Console IAM page. If you are developing a production app, specify more granular permissions than Project > Owner. For more information, see Granting roles to service accounts.

    • Click Create. A JSON file that contains your key downloads to your computer.

  7. Set your GOOGLE_APPLICATION_CREDENTIALS environment variable to point to your service account key file.

    export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
  8. Create a Cloud Storage bucket.

    gsutil mb gs://your-gcs-bucket

Setup

The following instructions will help you prepare your development environment.

  1. Download and install the Java Development Kit (JDK). Verify that the JAVA_HOME environment variable is set and points to your JDK installation.

  2. Download and install Apache Maven by following the Maven installation guide for your specific operating system.

  3. Clone the java-docs-samples repository.

    git clone https://github.com/GoogleCloudPlatform/java-docs-samples.git
  4. Navigate to the sample code directory.

    cd java-docs-samples/dataflow/templates

Templates

WordCount

First, select the project and template location.

PROJECT=$(gcloud config get-value project)
BUCKET=your-gcs-bucket
TEMPLATE_LOCATION=gs://$BUCKET/dataflow/templates/WordCount

Then, to create the template in the desired Cloud Storage location.

# Create the template.
mvn compile exec:java \
  -Dexec.mainClass=com.example.dataflow.templates.WordCount \
  -Dexec.args="\
    --isCaseSensitive=false \
    --project=$PROJECT \
    --templateLocation=$TEMPLATE_LOCATION \
    --runner=DataflowRunner"

# Upload the metadata file.
gsutil cp WordCount_metadata "$TEMPLATE_LOCATION"_metadata

For more information, see Creating templates.

Finally, you can run the template via gcloud or through the GCP Console create Dataflow job page.

JOB_NAME=wordcount-$(date +'%Y%m%d-%H%M%S')
INPUT=gs://apache-beam-samples/shakespeare/kinglear.txt

gcloud dataflow jobs run $JOB_NAME \
  --gcs-location $TEMPLATE_LOCATION \
  --parameters inputFile=$INPUT,outputBucket=$BUCKET

For more information, see Executing templates.

You can check your submitted jobs in the GCP Console Dataflow page.

Cleanup

To avoid incurring charges to your GCP account for the resources used:

# Delete only the files created by this sample.
gsutil -m rm -rf \
  "gs://$BUCKET/dataflow/templates/WordCount*" \
  "gs://$BUCKET/dataflow/wordcount/"

# [optional] Remove the entire dataflow Cloud Storage directory.
gsutil -m rm -rf gs://$BUCKET/dataflow

# [optional] Remove the Cloud Storage bucket.
gsutil rb gs://$BUCKET