Skip to content

Commit 34b8eea

Browse files
authored
Introduce Python Wordcount with Rust Wrapped Functions (#37234)
* Introduce Python Wordcount with Rust Wrapped Functions * apache licenses * one more license * formatting and whitespace
1 parent 716ce6a commit 34b8eea

7 files changed

Lines changed: 488 additions & 0 deletions

File tree

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
<!--
2+
Licensed to the Apache Software Foundation (ASF) under one
3+
or more contributor license agreements. See the NOTICE file
4+
distributed with this work for additional information
5+
regarding copyright ownership. The ASF licenses this file
6+
to you under the Apache License, Version 2.0 (the
7+
"License"); you may not use this file except in compliance
8+
with the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing,
13+
software distributed under the License is distributed on an
14+
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
15+
KIND, either express or implied. See the License for the
16+
specific language governing permissions and limitations
17+
under the License.
18+
-->
19+
20+
This directory contains an example of a Python pipeline that uses Rust DoFns to perform some of the string processing in wordcount. This is performed using [PyO3](https://pyo3.rs/v0.27.2/) to produce bindings for the Rust code, managed using the [maturin](https://github.com/PyO3/maturin) python package.
21+
22+
This example should be built and run in a Python virtual environment with Apache Beam and maturin installed. The `requirements.txt` file in this directory can be used to install the version of maturin used when the example was created.
23+
24+
To build the Rust code, run the following from the wordcount_rust directory:
25+
26+
```bash
27+
cd ./word_processing
28+
maturin develop
29+
```
30+
31+
This will compile the Rust code and build a Python package linked to it in the current environment. The resulting package can be imported as a Python module called `word_processing`.
32+
33+
To execute wordcount locally using the direct runner, execute the following from the wordcount_rust directory within the same virtual environment:
34+
35+
```bash
36+
python wordcount.py --runner DirectRunner --input * --output counts.txt
37+
```
38+
39+
To execute wordcount using the Dataflow runner, the tarball of the PyO3 Rust package must be provided to GCP. This is done by building the tarball then providing it as an `extra_package` argument. The tarball can be built using the following command from the wordcount_rust directory:
40+
41+
```bash
42+
cd ./word_processing
43+
python -m build --sdist
44+
```
45+
This places the tarball in `./word_processing/dist` as `word_processing-0.1.0.tar.gz`. Job submission to Dataflow from the `wordcount_rust` directory then looks like the following:
46+
47+
```bash
48+
python wordcount.py --runner DataflowRunner --input gs://apache-beam-samples/shakespeare/*.txt --output gs://<YOUR_BUCKET>/wordcount_rust/counts.txt --project <YOUR_PROJECT> --region <YOUR_REGION> --extra_package ./word_processing/dist/word_processing-0.1.0.tar.gz
49+
```
50+
51+
The job will then execute on Dataflow, installing the Rust package during worker setup. Wordcount will then execute and produce a counts.txt file in the specified output bucket.
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
build=1.3.0
19+
maturin==1.11.2

sdks/python/apache_beam/examples/wordcount_rust/word_processing/Cargo.lock

Lines changed: 234 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
[package]
19+
name = "word_processing"
20+
version = "0.1.0"
21+
edition = "2024"
22+
23+
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
24+
[lib]
25+
name = "word_processing"
26+
crate-type = ["cdylib"]
27+
28+
[dependencies]
29+
pyo3 = "0.27.0"
30+
regex = "1.12.2"
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
[build-system]
19+
requires = ["maturin>=1.11,<2.0"]
20+
build-backend = "maturin"
21+
22+
[project]
23+
name = "word_processing"
24+
requires-python = ">=3.8"
25+
classifiers = [
26+
"Programming Language :: Rust",
27+
"Programming Language :: Python :: Implementation :: CPython",
28+
"Programming Language :: Python :: Implementation :: PyPy",
29+
]
30+
dynamic = ["version"]

0 commit comments

Comments
 (0)