Skip to content

Commit 0c7b336

Browse files
authored
feat: modernize Dataflow pipeline_with_dependencies (GoogleCloudPlatform#12458)
* cleanup: structure dependecies using src/ Modernize example by providing a strucutre using src/ folder instead of flat folder structure. This is more common on today's Python Best Practices. See CL 663542630 for more details. * feat: add dataflow example with pyproject.toml The current pipeline_with_dependencies example is using setup.py. A more up-to-date approach would be using pyproject.toml. This folder contains a copy of the original python_with_dependencies, but leverages pyproject.toml for the package setup. See CL 663542630 for additional context. * fix: move _test files to root For the tests to be recognized, they must be on the same folder as the nox configurations. * cleanup: remove pipeline_with_dependencies_toml * feat: use toml file for packaging * cleanup: regenerate requirements.txt * cleanup: remove .egg-info * fix: ignore egg-info * nit: order files alphabetically * docs: update README for requirements.txt * review: address review feedback GoogleCloudPlatform#12458 * cleanup: move main.py to root folder * fix: fix path to main.py * cleanup: add build-system to pyproject.toml
1 parent 4a96281 commit 0c7b336

12 files changed

Lines changed: 138 additions & 78 deletions

File tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
**/*.egg-info

dataflow/flex-templates/pipeline_with_dependencies/Dockerfile

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -41,10 +41,11 @@ COPY --from=gcr.io/dataflow-templates-base/python311-template-launcher-base:2023
4141
ARG WORKDIR=/template
4242
WORKDIR ${WORKDIR}
4343

44+
COPY main.py .
45+
COPY pyproject.toml .
4446
COPY requirements.txt .
4547
COPY setup.py .
46-
COPY main.py .
47-
COPY my_package my_package
48+
COPY src src
4849

4950
# Installing exhaustive list of dependencies from a requirements.txt
5051
# helps to ensure that every time Docker container image is built,
@@ -60,7 +61,7 @@ RUN pip install -e .
6061
# For more informaiton, see: https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates
6162
ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/main.py"
6263

63-
# Because this image will be used as custom sdk container image, and it already
64+
# Because this image will be used as custom sdk container image, and it already
6465
# installs the dependencies from the requirements.txt, we can omit
6566
# the FLEX_TEMPLATE_PYTHON_REQUIREMENTS_FILE directive here
6667
# to reduce pipeline submission time.
Lines changed: 98 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,70 @@
1-
# Dataflow Flex Template: a pipeline with dependencies and a custom container image.
1+
# Dataflow Flex Template: a pipeline with dependencies and a custom container image
22

3-
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataflow/flex-templates/streaming_beam/README.md)
3+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataflow/flex-templates/pipeline_with_dependencies/README.md)
44

55
This project illustrates the following Dataflow Python pipeline setup:
6-
- The pipeline is a package that consists of [multiple files](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).
7-
- The pipeline has at least one dependency that is not provided in the default Dataflow runtime environment.
8-
- The workflow uses a [custom container image](https://cloud.google.com/dataflow/docs/guides/using-custom-containers) to preinstall dependencies and to define the pipeline runtime environment.
9-
- The workflow uses a [Dataflow Flex Template](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates) to control the pipeline submission environment.
10-
- The runtime and submission environment use same set of Python dependencies and can be created in a reproducible manner.
6+
7+
- The pipeline is a package that consists of
8+
[multiple files](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).
9+
10+
- The pipeline has at least one dependency that is not provided in the default
11+
Dataflow runtime environment.
12+
13+
- The workflow uses a
14+
[custom container image](https://cloud.google.com/dataflow/docs/guides/using-custom-containers)
15+
to preinstall dependencies and to define the pipeline runtime environment.
16+
17+
- The workflow uses a
18+
[Dataflow Flex Template](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates)
19+
to control the pipeline submission environment.
20+
21+
- The runtime and submission environment use same set of Python dependencies
22+
and can be created in a reproducible manner.
1123

1224
To illustrate this setup, we use a pipeline that does the following:
1325

14-
1. Finds the longest word in an input file
15-
2. Creates a [FIGLet text banner](https://en.wikipedia.org/wiki/FIGlet) from of it using [pyfiglet](https://pypi.org/project/pyfiglet/)
16-
3. Outputs the text banner in another file
26+
1. Finds the longest word in an input file.
27+
28+
1. Creates a [FIGLet text banner](https://en.wikipedia.org/wiki/FIGlet) from of
29+
it using [pyfiglet](https://pypi.org/project/pyfiglet/).
1730

31+
1. Outputs the text banner in another file.
1832

1933
## The structure of the example
2034

21-
The pipeline package is comprised of the `my_package` directory and the `setup.py` file. The package defines the pipeline, the pipeline dependencies, and the input parameters. You can define multiple pipelines in the same package. The `my_package.launcher` module is used to submit the pipeline to a runner.
35+
The pipeline package is comprised of the `src/my_package` directory, the
36+
`pyproject.toml` file and the `setup.py` file. The package defines the pipeline,
37+
the pipeline dependencies, and the input parameters. You can define multiple
38+
pipelines in the same package. The `my_package.launcher` module is used to
39+
submit the pipeline to a runner.
2240

23-
The `main.py` file provides a top-level entrypoint to trigger the pipeline launcher from a
24-
launch environment.
41+
The `main.py` file provides a top-level entrypoint to trigger the pipeline
42+
launcher from a launch environment.
2543

26-
The `Dockerfile` defines the runtime environment for the pipeline. It also configures the Flex Template, which lets you reuse the runtime image to build the Flex Template.
44+
The `Dockerfile` defines the runtime environment for the pipeline. It also
45+
configures the Flex Template, which lets you reuse the runtime image to build
46+
the Flex Template.
2747

28-
The `requirements.txt` file defines all Python packages in the dependency chain of the pipeline package. Use it to create reproducible Python environments in the Docker image.
48+
The `requirements.txt` file defines all Python packages in the dependency chain
49+
of the pipeline package. Use it to create reproducible Python environments in
50+
the Docker image.
2951

30-
The `metadata.json` file defines Flex Template parameters and their validation rules. It is optional.
52+
The `metadata.json` file defines Flex Template parameters and their validation
53+
rules. It is optional.
3154

3255
## Before you begin
3356

34-
1. Follow the
35-
[Dataflow setup instructions](../../README.md).
57+
1. Follow the
58+
[Dataflow setup instructions](../../README.md).
3659

37-
1. [Enable the Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com).
60+
1. [Enable the Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com).
3861

39-
1. Clone the [`python-docs-samples` repository](https://github.com/GoogleCloudPlatform/python-docs-samples)
40-
and navigate to the code sample.
62+
1. Clone the [`python-docs-samples` repository](https://github.com/GoogleCloudPlatform/python-docs-samples)
63+
and navigate to the code sample.
4164

4265
```sh
4366
git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
44-
cd python-docs-samples/dataflow/flex-templates/streaming_beam
67+
cd python-docs-samples/dataflow/flex-templates/pipeline_with_dependencies
4568
```
4669

4770
## Create a Cloud Storage bucket
@@ -71,40 +94,51 @@ gcloud auth configure-docker $REGION-docker.pkg.dev
7194

7295
## Build a Docker image for the pipeline runtime environment
7396

74-
Using a [custom SDK container image](https://cloud.google.com/dataflow/docs/guides/using-custom-containers)
97+
Using a
98+
[custom SDK container image](https://cloud.google.com/dataflow/docs/guides/using-custom-containers)
7599
allows flexible customizations of the runtime environment.
76100

77-
This example uses the custom container image both to preinstall all of the pipeline dependencies before job submission and to create a reproducible runtime environment.
101+
This example uses the custom container image both to preinstall all of the
102+
pipeline dependencies before job submission and to create a reproducible runtime
103+
environment.
78104

79-
To illustrate customizations, a [custom base base image](https://cloud.google.com/dataflow/docs/guides/build-container-image#use_a_custom_base_image) is used to build the SDK container image.
105+
To illustrate customizations, a
106+
[custom base base image](https://cloud.google.com/dataflow/docs/guides/build-container-image#use_a_custom_base_image)
107+
is used to build the SDK container image.
80108

81-
The Flex Template launcher is included in the SDK container image, which makes it possible to [use the SDK container image to build a Flex Template](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
109+
The Flex Template launcher is included in the SDK container image, which makes
110+
it possible to
111+
[use the SDK container image to build a Flex Template](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
82112

83113
```sh
84114
# Use a unique tag to version the artifacts that are built.
85115
export TAG=`date +%Y%m%d-%H%M%S`
86116
export SDK_CONTAINER_IMAGE="$REGION-docker.pkg.dev/$PROJECT/$REPOSITORY/my_base_image:$TAG"
87117
88-
gcloud builds submit . --tag $SDK_CONTAINER_IMAGE --project $PROJECT
118+
gcloud builds submit . --tag $SDK_CONTAINER_IMAGE --project $PROJECT
89119
```
90120

91121
## Optional: Inspect the Docker image
92122

93-
If you have a local installation of Docker, you can inspect the image and run the pipeline by using the Direct Runner:
94-
```
123+
If you have a local installation of Docker, you can inspect the image and run
124+
the pipeline by using the Direct Runner:
125+
126+
```bash
95127
docker run --rm -it --entrypoint=/bin/bash $SDK_CONTAINER_IMAGE
96128
97129
# Once the container is created, run:
98-
pip list
99-
python main.py --input requirements.txt --output=/tmp/output
130+
python3 -m pip list
131+
python3 ./main.py --input ./requirements.txt --output=/tmp/output
100132
cat /tmp/output*
101133
```
102134

103135
## Build the Flex Template
104136

105-
Build the Flex Template [from the SDK container image](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
106-
Using the runtime image as the Flex Template image reduces the number of Docker images that need to be maintained.
107-
It also ensures that the pipeline uses the same dependencies at submission and at runtime.
137+
Build the Flex Template
138+
[from the SDK container image](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
139+
Using the runtime image as the Flex Template image reduces the number of Docker
140+
images that need to be maintained. It also ensures that the pipeline uses the
141+
same dependencies at submission and at runtime.
108142

109143
```sh
110144
export TEMPLATE_FILE=gs://$BUCKET/longest-word-$TAG.json
@@ -133,40 +167,50 @@ gcloud dataflow flex-template run "flex-`date +%Y%m%d-%H%M%S`" \
133167
```
134168

135169
After the pipeline finishes, use the following command to inspect the output:
136-
```
170+
171+
```bash
137172
gsutil cat gs://$BUCKET/output*
138173
```
139174

140175
## Optional: Update the dependencies in the requirements file and rebuild the Docker images
141176

142-
The top-level pipeline dependencies are defined in the `install_requires` section of the `setup.py` file.
177+
The top-level pipeline dependencies are defined in the `dependencies` section of
178+
the `pyproject.toml` file.
143179

144-
The `requirements.txt` file pins all Python dependencies, that must be installed in the Docker container image, including the transitive dependencies. Listing all packages produces reproducible Python environments every time the image is built.
145-
Version control the `requirements.txt` file together with the rest of pipeline code.
180+
The `requirements.txt` file pins all Python dependencies, that must be installed
181+
in the Docker container image, including the transitive dependencies. Listing
182+
all packages produces reproducible Python environments every time the image is
183+
built. Version control the `requirements.txt` file together with the rest of
184+
pipeline code.
146185

147-
When the dependencies of your pipeline change or when you want to use the latest available versions of packages in the pipeline's dependency chain, regenerate the `requirements.txt` file:
186+
When the dependencies of your pipeline change or when you want to use the latest
187+
available versions of packages in the pipeline's dependency chain, regenerate
188+
the `requirements.txt` file:
148189
149-
```
150-
python3.11 -m pip install pip-tools # Use a consistent minor version of Python throughout the project.
151-
pip-compile ./setup.py
152-
```
190+
```bash
191+
python3 -m pip install pip-tools
192+
python3 -m piptools compile -o requirements.txt pyproject.toml
193+
```
153194
154-
If you base your custom container image on the standard Apache Beam base image, to reduce the image size and to give preference to the versions already installed in the Apache Beam base image, use a constraints file:
195+
If you base your custom container image on the standard Apache Beam base image,
196+
to reduce the image size and to give preference to the versions already
197+
installed in the Apache Beam base image, use a constraints file:
155198
156-
```
157-
wget https://raw.githubusercontent.com/apache/beam/release-2.54.0/sdks/python/container/py311/base_image_requirements.txt
158-
pip-compile --constraint=base_image_requirements.txt ./setup.py
199+
```bash
200+
wget https://raw.githubusercontent.com/apache/beam/release-2.54.0/sdks/python/container/py311/base_image_requirements.txt
201+
python3 -m piptools compile --constraint=base_image_requirements.txt ./pyproject.toml
159202
```
160203
161204
Alternatively, take the following steps:
162205
163-
1. Use an empty `requirements.txt` file.
164-
1. Build the SDK container Docker image from the Docker file.
165-
1. Collect the output of `pip freeze` at the last stage of the Docker build.
166-
1. Seed the `requirements.txt` file with that content.
167-
168-
For more information, see the Apache Beam [reproducible environments](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#create-reproducible-environments) documentation.
206+
1. Use an empty `requirements.txt` file.
207+
1. Build the SDK container Docker image from the Docker file.
208+
1. Collect the output of `pip freeze` at the last stage of the Docker build.
209+
1. Seed the `requirements.txt` file with that content.
169210
211+
For more information, see the Apache Beam
212+
[reproducible environments](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#create-reproducible-environments)
213+
documentation.
170214
171215
## What's next?
172216

@@ -177,4 +221,4 @@ For more information about building and using custom containers, see
177221
📝 [Use custom containers in Dataflow](https://cloud.google.com/dataflow/docs/guides/using-custom-containers).
178222

179223
To reduce Docker image build time, see:
180-
📝 [Using Kaniko Cache](https://cloud.google.com/build/docs/optimize-builds/kaniko-cache).
224+
📝 [Using Kaniko Cache](https://cloud.google.com/build/docs/optimize-builds/kaniko-cache).
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Copyright 2024 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
[build-system]
16+
requires = ["setuptools"]
17+
build-backend = "setuptools.build_meta"
18+
19+
[project]
20+
name = "my_package"
21+
version = "0.1.0"
22+
dependencies = [
23+
"apache-beam[gcp]==2.54.0", # Must match the version in `Dockerfile``.
24+
"pyfiglet", # This is the only non-Beam dependency of this pipeline.
25+
]

dataflow/flex-templates/pipeline_with_dependencies/requirements.txt

Lines changed: 3 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,14 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
15-
16-
# This requirements.txt file lists all Python dependenciences that are installed
17-
# in the Docker image in this example. Listing all dependencies allows
18-
# creating reproducible Python environments. For more information,
19-
# see the "Update the dependencies in the requirements file" section in README.md
20-
2114
#
2215
# This file is autogenerated by pip-compile with Python 3.11
2316
# by the following command:
2417
#
25-
# pip-compile ./setup.py
18+
# pip-compile --output-file=requirements.txt pyproject.toml
2619
#
2720
apache-beam[gcp]==2.54.0
28-
# via my_package (setup.py)
21+
# via my_package (pyproject.toml)
2922
attrs==23.2.0
3023
# via
3124
# jsonschema
@@ -261,7 +254,7 @@ pyasn1-modules==0.3.0
261254
pydot==1.4.2
262255
# via apache-beam
263256
pyfiglet==1.0.2
264-
# via my_package (setup.py)
257+
# via my_package (pyproject.toml)
265258
pyjsparser==2.7.1
266259
# via js2py
267260
pymongo==4.6.3

dataflow/flex-templates/pipeline_with_dependencies/setup.py

Lines changed: 7 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,12 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
"""Defines a Python package for an Apache Beam pipeline."""
15+
"""A setuptools configuration file stub for the pipeline package.
1616
17-
import setuptools
17+
Note that the package is completely defined by pyproject.toml.
18+
This file is optional. It is only necessary if you must use the --setup_file
19+
pipeline option or the FLEX_TEMPLATE_PYTHON_SETUP_FILE configuration option.
20+
"""
1821

19-
setuptools.setup(
20-
name="my_package",
21-
version="0.1.0",
22-
install_requires=[
23-
"apache-beam[gcp]==2.54.0", # Must match the version in `Dockerfile``.
24-
"pyfiglet", # This is the only non-Beam dependency of this pipeline.
25-
],
26-
packages=setuptools.find_packages(),
27-
)
22+
import setuptools
23+
setuptools.setup()

dataflow/flex-templates/pipeline_with_dependencies/my_package/__init__.py renamed to dataflow/flex-templates/pipeline_with_dependencies/src/my_package/__init__.py

File renamed without changes.

dataflow/flex-templates/pipeline_with_dependencies/my_package/launcher.py renamed to dataflow/flex-templates/pipeline_with_dependencies/src/my_package/launcher.py

File renamed without changes.

dataflow/flex-templates/pipeline_with_dependencies/my_package/my_pipeline.py renamed to dataflow/flex-templates/pipeline_with_dependencies/src/my_package/my_pipeline.py

File renamed without changes.

dataflow/flex-templates/pipeline_with_dependencies/my_package/my_transforms.py renamed to dataflow/flex-templates/pipeline_with_dependencies/src/my_package/my_transforms.py

File renamed without changes.

0 commit comments

Comments
 (0)