You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* cleanup: structure dependecies using src/
Modernize example by providing a strucutre using src/
folder instead of flat folder structure. This is more
common on today's Python Best Practices.
See CL 663542630 for more details.
* feat: add dataflow example with pyproject.toml
The current pipeline_with_dependencies example is
using setup.py. A more up-to-date approach would
be using pyproject.toml. This folder contains a
copy of the original python_with_dependencies, but
leverages pyproject.toml for the package setup.
See CL 663542630 for additional context.
* fix: move _test files to root
For the tests to be recognized, they must be on the same
folder as the nox configurations.
* cleanup: remove pipeline_with_dependencies_toml
* feat: use toml file for packaging
* cleanup: regenerate requirements.txt
* cleanup: remove .egg-info
* fix: ignore egg-info
* nit: order files alphabetically
* docs: update README for requirements.txt
* review: address review feedback GoogleCloudPlatform#12458
* cleanup: move main.py to root folder
* fix: fix path to main.py
* cleanup: add build-system to pyproject.toml
# Dataflow Flex Template: a pipeline with dependencies and a custom container image.
1
+
# Dataflow Flex Template: a pipeline with dependencies and a custom container image
2
2
3
-
[](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataflow/flex-templates/streaming_beam/README.md)
3
+
[](https://console.cloud.google.com/cloudshell/open?git_repo=https://github.com/GoogleCloudPlatform/python-docs-samples&page=editor&open_in_editor=dataflow/flex-templates/pipeline_with_dependencies/README.md)
4
4
5
5
This project illustrates the following Dataflow Python pipeline setup:
6
-
- The pipeline is a package that consists of [multiple files](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#multiple-file-dependencies).
7
-
- The pipeline has at least one dependency that is not provided in the default Dataflow runtime environment.
8
-
- The workflow uses a [custom container image](https://cloud.google.com/dataflow/docs/guides/using-custom-containers) to preinstall dependencies and to define the pipeline runtime environment.
9
-
- The workflow uses a [Dataflow Flex Template](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates) to control the pipeline submission environment.
10
-
- The runtime and submission environment use same set of Python dependencies and can be created in a reproducible manner.
- The runtime and submission environment use same set of Python dependencies
22
+
and can be created in a reproducible manner.
11
23
12
24
To illustrate this setup, we use a pipeline that does the following:
13
25
14
-
1. Finds the longest word in an input file
15
-
2. Creates a [FIGLet text banner](https://en.wikipedia.org/wiki/FIGlet) from of it using [pyfiglet](https://pypi.org/project/pyfiglet/)
16
-
3. Outputs the text banner in another file
26
+
1. Finds the longest word in an input file.
27
+
28
+
1. Creates a [FIGLet text banner](https://en.wikipedia.org/wiki/FIGlet) from of
29
+
it using [pyfiglet](https://pypi.org/project/pyfiglet/).
17
30
31
+
1. Outputs the text banner in another file.
18
32
19
33
## The structure of the example
20
34
21
-
The pipeline package is comprised of the `my_package` directory and the `setup.py` file. The package defines the pipeline, the pipeline dependencies, and the input parameters. You can define multiple pipelines in the same package. The `my_package.launcher` module is used to submit the pipeline to a runner.
35
+
The pipeline package is comprised of the `src/my_package` directory, the
36
+
`pyproject.toml` file and the `setup.py` file. The package defines the pipeline,
37
+
the pipeline dependencies, and the input parameters. You can define multiple
38
+
pipelines in the same package. The `my_package.launcher` module is used to
39
+
submit the pipeline to a runner.
22
40
23
-
The `main.py` file provides a top-level entrypoint to trigger the pipeline launcher from a
24
-
launch environment.
41
+
The `main.py` file provides a top-level entrypoint to trigger the pipeline
42
+
launcher from a launch environment.
25
43
26
-
The `Dockerfile` defines the runtime environment for the pipeline. It also configures the Flex Template, which lets you reuse the runtime image to build the Flex Template.
44
+
The `Dockerfile` defines the runtime environment for the pipeline. It also
45
+
configures the Flex Template, which lets you reuse the runtime image to build
46
+
the Flex Template.
27
47
28
-
The `requirements.txt` file defines all Python packages in the dependency chain of the pipeline package. Use it to create reproducible Python environments in the Docker image.
48
+
The `requirements.txt` file defines all Python packages in the dependency chain
49
+
of the pipeline package. Use it to create reproducible Python environments in
50
+
the Docker image.
29
51
30
-
The `metadata.json` file defines Flex Template parameters and their validation rules. It is optional.
52
+
The `metadata.json` file defines Flex Template parameters and their validation
53
+
rules. It is optional.
31
54
32
55
## Before you begin
33
56
34
-
1. Follow the
35
-
[Dataflow setup instructions](../../README.md).
57
+
1.Follow the
58
+
[Dataflow setup instructions](../../README.md).
36
59
37
-
1.[Enable the Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com).
60
+
1.[Enable the Cloud Build API](https://console.cloud.google.com/flows/enableapi?apiid=cloudbuild.googleapis.com).
38
61
39
-
1. Clone the [`python-docs-samples` repository](https://github.com/GoogleCloudPlatform/python-docs-samples)
40
-
and navigate to the code sample.
62
+
1.Clone the [`python-docs-samples` repository](https://github.com/GoogleCloudPlatform/python-docs-samples)
allows flexible customizations of the runtime environment.
76
100
77
-
This example uses the custom container image both to preinstall all of the pipeline dependencies before job submission and to create a reproducible runtime environment.
101
+
This example uses the custom container image both to preinstall all of the
102
+
pipeline dependencies before job submission and to create a reproducible runtime
103
+
environment.
78
104
79
-
To illustrate customizations, a [custom base base image](https://cloud.google.com/dataflow/docs/guides/build-container-image#use_a_custom_base_image) is used to build the SDK container image.
105
+
To illustrate customizations, a
106
+
[custom base base image](https://cloud.google.com/dataflow/docs/guides/build-container-image#use_a_custom_base_image)
107
+
is used to build the SDK container image.
80
108
81
-
The Flex Template launcher is included in the SDK container image, which makes it possible to [use the SDK container image to build a Flex Template](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
109
+
The Flex Template launcher is included in the SDK container image, which makes
110
+
it possible to
111
+
[use the SDK container image to build a Flex Template](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
82
112
83
113
```sh
84
114
# Use a unique tag to version the artifacts that are built.
Build the Flex Template [from the SDK container image](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
106
-
Using the runtime image as the Flex Template image reduces the number of Docker images that need to be maintained.
107
-
It also ensures that the pipeline uses the same dependencies at submission and at runtime.
137
+
Build the Flex Template
138
+
[from the SDK container image](https://cloud.google.com/dataflow/docs/guides/templates/configuring-flex-templates#use_custom_container_images).
139
+
Using the runtime image as the Flex Template image reduces the number of Docker
140
+
images that need to be maintained. It also ensures that the pipeline uses the
After the pipeline finishes, use the following command to inspect the output:
136
-
```
170
+
171
+
```bash
137
172
gsutil cat gs://$BUCKET/output*
138
173
```
139
174
140
175
## Optional: Update the dependencies in the requirements file and rebuild the Docker images
141
176
142
-
The top-level pipeline dependencies are defined in the `install_requires` section of the `setup.py` file.
177
+
The top-level pipeline dependencies are defined in the `dependencies` section of
178
+
the `pyproject.toml` file.
143
179
144
-
The `requirements.txt` file pins all Python dependencies, that must be installed in the Docker container image, including the transitive dependencies. Listing all packages produces reproducible Python environments every time the image is built.
145
-
Version control the `requirements.txt` file together with the rest of pipeline code.
180
+
The `requirements.txt` file pins all Python dependencies, that must be installed
181
+
in the Docker container image, including the transitive dependencies. Listing
182
+
all packages produces reproducible Python environments every time the image is
183
+
built. Version control the `requirements.txt` file together with the rest of
184
+
pipeline code.
146
185
147
-
When the dependencies of your pipeline change or when you want to use the latest available versions of packages in the pipeline's dependency chain, regenerate the `requirements.txt` file:
186
+
When the dependencies of your pipeline change or when you want to use the latest
187
+
available versions of packages in the pipeline's dependency chain, regenerate
188
+
the `requirements.txt` file:
148
189
149
-
```
150
-
python3.11 -m pip install pip-tools # Use a consistent minor version of Python throughout the project.
If you base your custom container image on the standard Apache Beam base image, to reduce the image size and to give preference to the versions already installed in the Apache Beam base image, use a constraints file:
195
+
If you base your custom container image on the standard Apache Beam base image,
196
+
to reduce the image size and to give preference to the versions already
197
+
installed in the Apache Beam base image, use a constraints file:
1. Build the SDK container Docker image from the Docker file.
165
-
1. Collect the output of `pip freeze` at the last stage of the Docker build.
166
-
1. Seed the `requirements.txt` file with that content.
167
-
168
-
For more information, see the Apache Beam [reproducible environments](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#create-reproducible-environments) documentation.
206
+
1. Use an empty `requirements.txt` file.
207
+
1. Build the SDK container Docker image from the Docker file.
208
+
1. Collect the output of `pip freeze` at the last stage of the Docker build.
209
+
1. Seed the `requirements.txt` file with that content.
0 commit comments