Skip to content

Commit 3168389

Browse files
authored
Merge branch 'main' into python-texttospeech-migration
2 parents f831b88 + 4827801 commit 3168389

23 files changed

Lines changed: 1669 additions & 1 deletion

.github/CODEOWNERS

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,5 +75,6 @@
7575
/talent/**/* @GoogleCloudPlatform/python-samples-reviewers
7676
/vision/**/* @GoogleCloudPlatform/python-samples-reviewers
7777
/workflows/**/* @GoogleCloudPlatform/python-samples-reviewers
78-
/datacatalog/**/* @GoogleCloudPlatform/python-samples-reviewers
78+
/datacatalog/**/* @GoogleCloudPlatform/python-samples-reviewers
7979
/kms/**/** @GoogleCloudPlatform/dee-infra @GoogleCloudPlatform/python-samples-reviewers
80+
/dataproc/**/** @GoogleCloudPlatform/cloud-dpes @GoogleCloudPlatform/python-samples-reviewers

.github/blunderbuss.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -176,6 +176,10 @@ assign_prs_by:
176176
- 'api: cloudtasks'
177177
to:
178178
- GoogleCloudPlatform/infra-db-dpes
179+
- labels:
180+
- 'api: dataproc'
181+
to:
182+
- GoogleCloudPlatform/cloud-dpes
179183

180184
assign_issues:
181185
- GoogleCloudPlatform/python-samples-owners

dataproc/snippets/README.md

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# Cloud Dataproc API Examples
2+
3+
[![Open in Cloud Shell][shell_img]][shell_link]
4+
5+
[shell_img]: http://gstatic.com/cloudssh/images/open-btn.png
6+
[shell_link]: https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataproc/README.md
7+
8+
Sample command-line programs for interacting with the Cloud Dataproc API.
9+
10+
See [the tutorial on the using the Dataproc API with the Python client
11+
library](https://cloud.google.com/dataproc/docs/tutorials/python-library-example)
12+
for information on a walkthrough you can run to try out the Cloud Dataproc API sample code.
13+
14+
Note that while this sample demonstrates interacting with Dataproc via the API, the functionality demonstrated here could also be accomplished using the Cloud Console or the gcloud CLI.
15+
16+
`list_clusters.py` is a simple command-line program to demonstrate connecting to the Cloud Dataproc API and listing the clusters in a region.
17+
18+
`submit_job_to_cluster.py` demonstrates how to create a cluster, submit the
19+
`pyspark_sort.py` job, download the output from Google Cloud Storage, and output the result.
20+
21+
`single_job_workflow.py` uses the Cloud Dataproc InstantiateInlineWorkflowTemplate API to create an ephemeral cluster, run a job, then delete the cluster with one API request.
22+
23+
`pyspark_sort.py_gcs` is the same as `pyspark_sort.py` but demonstrates
24+
reading from a GCS bucket.
25+
26+
## Prerequisites to run locally:
27+
28+
* [pip](https://pypi.python.org/pypi/pip)
29+
30+
Go to the [Google Cloud Console](https://console.cloud.google.com).
31+
32+
Under API Manager, search for the Google Cloud Dataproc API and enable it.
33+
34+
## Set Up Your Local Dev Environment
35+
36+
To install, run the following commands. If you want to use [virtualenv](https://virtualenv.readthedocs.org/en/latest/)
37+
(recommended), run the commands within a virtualenv.
38+
39+
* pip install -r requirements.txt
40+
41+
## Authentication
42+
43+
Please see the [Google cloud authentication guide](https://cloud.google.com/docs/authentication/).
44+
The recommended approach to running these samples is a Service Account with a JSON key.
45+
46+
## Environment Variables
47+
48+
Set the following environment variables:
49+
50+
GOOGLE_CLOUD_PROJECT=your-project-id
51+
REGION=us-central1 # or your region
52+
CLUSTER_NAME=waprin-spark7
53+
ZONE=us-central1-b
54+
55+
## Running the samples
56+
57+
To run list_clusters.py:
58+
59+
python list_clusters.py $GOOGLE_CLOUD_PROJECT --region=$REGION
60+
61+
`submit_job_to_cluster.py` can create the Dataproc cluster or use an existing cluster. To create a cluster before running the code, you can use the [Cloud Console](console.cloud.google.com) or run:
62+
63+
gcloud dataproc clusters create your-cluster-name
64+
65+
To run submit_job_to_cluster.py, first create a GCS bucket (used by Cloud Dataproc to stage files) from the Cloud Console or with gsutil:
66+
67+
gsutil mb gs://<your-staging-bucket-name>
68+
69+
Next, set the following environment variables:
70+
71+
BUCKET=your-staging-bucket
72+
CLUSTER=your-cluster-name
73+
74+
Then, if you want to use an existing cluster, run:
75+
76+
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET
77+
78+
Alternatively, to create a new cluster, which will be deleted at the end of the job, run:
79+
80+
python submit_job_to_cluster.py --project_id=$GOOGLE_CLOUD_PROJECT --zone=us-central1-b --cluster_name=$CLUSTER --gcs_bucket=$BUCKET --create_new_cluster
81+
82+
The script will setup a cluster, upload the PySpark file, submit the job, print the result, then, if it created the cluster, delete the cluster.
83+
84+
Optionally, you can add the `--pyspark_file` argument to change from the default `pyspark_sort.py` included in this script to a new script.
Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
#!/usr/bin/env python
2+
3+
# Copyright 2019 Google LLC
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
# This sample walks a user through creating a Cloud Dataproc cluster using
18+
# the Python client library.
19+
#
20+
# This script can be run on its own:
21+
# python create_cluster.py ${PROJECT_ID} ${REGION} ${CLUSTER_NAME}
22+
23+
24+
import sys
25+
26+
# [START dataproc_create_cluster]
27+
from google.cloud import dataproc_v1 as dataproc
28+
29+
30+
def create_cluster(project_id, region, cluster_name):
31+
"""This sample walks a user through creating a Cloud Dataproc cluster
32+
using the Python client library.
33+
34+
Args:
35+
project_id (string): Project to use for creating resources.
36+
region (string): Region where the resources should live.
37+
cluster_name (string): Name to use for creating a cluster.
38+
"""
39+
40+
# Create a client with the endpoint set to the desired cluster region.
41+
cluster_client = dataproc.ClusterControllerClient(
42+
client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
43+
)
44+
45+
# Create the cluster config.
46+
cluster = {
47+
"project_id": project_id,
48+
"cluster_name": cluster_name,
49+
"config": {
50+
"master_config": {"num_instances": 1, "machine_type_uri": "n1-standard-2"},
51+
"worker_config": {"num_instances": 2, "machine_type_uri": "n1-standard-2"},
52+
},
53+
}
54+
55+
# Create the cluster.
56+
operation = cluster_client.create_cluster(
57+
request={"project_id": project_id, "region": region, "cluster": cluster}
58+
)
59+
result = operation.result()
60+
61+
# Output a success message.
62+
print(f"Cluster created successfully: {result.cluster_name}")
63+
# [END dataproc_create_cluster]
64+
65+
66+
if __name__ == "__main__":
67+
if len(sys.argv) < 4:
68+
sys.exit("python create_cluster.py project_id region cluster_name")
69+
70+
project_id = sys.argv[1]
71+
region = sys.argv[2]
72+
cluster_name = sys.argv[3]
73+
create_cluster(project_id, region, cluster_name)
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Copyright 2019 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
import os
16+
import uuid
17+
18+
from google.api_core.exceptions import NotFound
19+
from google.cloud import dataproc_v1 as dataproc
20+
import pytest
21+
22+
import create_cluster
23+
24+
25+
PROJECT_ID = os.environ["GOOGLE_CLOUD_PROJECT"]
26+
REGION = "us-central1"
27+
CLUSTER_NAME = "py-cc-test-{}".format(str(uuid.uuid4()))
28+
29+
30+
@pytest.fixture(autouse=True)
31+
def teardown():
32+
yield
33+
34+
cluster_client = dataproc.ClusterControllerClient(
35+
client_options={"api_endpoint": f"{REGION}-dataproc.googleapis.com:443"}
36+
)
37+
# Client library function
38+
try:
39+
operation = cluster_client.delete_cluster(
40+
request={
41+
"project_id": PROJECT_ID,
42+
"region": REGION,
43+
"cluster_name": CLUSTER_NAME,
44+
}
45+
)
46+
# Wait for cluster to delete
47+
operation.result()
48+
except NotFound:
49+
print("Cluster already deleted")
50+
51+
52+
def test_cluster_create(capsys):
53+
# Wrapper function for client library function
54+
create_cluster.create_cluster(PROJECT_ID, REGION, CLUSTER_NAME)
55+
56+
out, _ = capsys.readouterr()
57+
assert CLUSTER_NAME in out
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#!/usr/bin/env python
2+
3+
# Copyright 2019 Google LLC
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
""" Integration tests for Dataproc samples.
18+
19+
Creates a Dataproc cluster, uploads a pyspark file to Google Cloud Storage,
20+
submits a job to Dataproc that runs the pyspark file, then downloads
21+
the output logs from Cloud Storage and verifies the expected output."""
22+
23+
import os
24+
25+
import submit_job_to_cluster
26+
27+
PROJECT = os.environ["GOOGLE_CLOUD_PROJECT"]
28+
BUCKET = os.environ["CLOUD_STORAGE_BUCKET"]
29+
CLUSTER_NAME = "testcluster3"
30+
ZONE = "us-central1-b"
31+
32+
33+
def test_e2e():
34+
output = submit_job_to_cluster.main(PROJECT, ZONE, CLUSTER_NAME, BUCKET)
35+
assert b"['Hello,', 'dog', 'elephant', 'panther', 'world!']" in output
Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# Copyright 2020 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# This sample walks a user through instantiating an inline
16+
# workflow for Cloud Dataproc using the Python client library.
17+
#
18+
# This script can be run on its own:
19+
# python instantiate_inline_workflow_template.py ${PROJECT_ID} ${REGION}
20+
21+
22+
import sys
23+
24+
# [START dataproc_instantiate_inline_workflow_template]
25+
from google.cloud import dataproc_v1 as dataproc
26+
27+
28+
def instantiate_inline_workflow_template(project_id, region):
29+
"""This sample walks a user through submitting a workflow
30+
for a Cloud Dataproc using the Python client library.
31+
32+
Args:
33+
project_id (string): Project to use for running the workflow.
34+
region (string): Region where the workflow resources should live.
35+
"""
36+
37+
# Create a client with the endpoint set to the desired region.
38+
workflow_template_client = dataproc.WorkflowTemplateServiceClient(
39+
client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
40+
)
41+
42+
parent = "projects/{}/regions/{}".format(project_id, region)
43+
44+
template = {
45+
"jobs": [
46+
{
47+
"hadoop_job": {
48+
"main_jar_file_uri": "file:///usr/lib/hadoop-mapreduce/"
49+
"hadoop-mapreduce-examples.jar",
50+
"args": ["teragen", "1000", "hdfs:///gen/"],
51+
},
52+
"step_id": "teragen",
53+
},
54+
{
55+
"hadoop_job": {
56+
"main_jar_file_uri": "file:///usr/lib/hadoop-mapreduce/"
57+
"hadoop-mapreduce-examples.jar",
58+
"args": ["terasort", "hdfs:///gen/", "hdfs:///sort/"],
59+
},
60+
"step_id": "terasort",
61+
"prerequisite_step_ids": ["teragen"],
62+
},
63+
],
64+
"placement": {
65+
"managed_cluster": {
66+
"cluster_name": "my-managed-cluster",
67+
"config": {
68+
"gce_cluster_config": {
69+
# Leave 'zone_uri' empty for 'Auto Zone Placement'
70+
# 'zone_uri': ''
71+
"zone_uri": "us-central1-a"
72+
}
73+
},
74+
}
75+
},
76+
}
77+
78+
# Submit the request to instantiate the workflow from an inline template.
79+
operation = workflow_template_client.instantiate_inline_workflow_template(
80+
request={"parent": parent, "template": template}
81+
)
82+
operation.result()
83+
84+
# Output a success message.
85+
print("Workflow ran successfully.")
86+
# [END dataproc_instantiate_inline_workflow_template]
87+
88+
89+
if __name__ == "__main__":
90+
if len(sys.argv) < 3:
91+
sys.exit(
92+
"python instantiate_inline_workflow_template.py " + "project_id region"
93+
)
94+
95+
project_id = sys.argv[1]
96+
region = sys.argv[2]
97+
instantiate_inline_workflow_template(project_id, region)

0 commit comments

Comments
 (0)