What happened?
We have identified a memory leak that affects Beam Python SDK versions 2.47.0 and above. The leak was triggered by an upgrade to protobuf==4.x.x. We rootcaused this leak to protocolbuffers/protobuf#14571 and it has been remediated in Beam 2.52.0.
[update: 2023-12-19]: Due to another issue related to protobuf upgrade, Python streaming users should continue to apply the mitigation steps below with Beam 2.52.0 or switch to Beam 2.53.0 once available.
Mitigation
Until Beam 2.52.0 is released, consider any of the following workarounds:
-
Use apache-beam==2.46.0 or below.
-
Install protobuf 3.x in the submission and runtime environment. For example, you can use a --requirements_file pipeline option with a file that includes:
protobuf==3.20.3
grpcio-status==1.48.2
For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
-
Use a python implementation of protobuf by setting a PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python environment variable in the runtime environment. This might degrade the performance since python implementation is less efficient. For example, you could create a custom Beam SDK container from a Dockerfile that looks like the following:
FROM apache/beam_python3.10_sdk:2.47.0
ENV PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
For more information, see: https://beam.apache.org/documentation/runtime/environments/
-
Install protobuf==4.25.0 or newer in the submission and runtime environment.
Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318.
Additional details
The leak can be reproduced by a pipeline:
with beam.Pipeline(options=pipeline_options) as p:
# duplicate reads to increase throughput
inputs = []
for i in range(32):
inputs.append(
p | f"Read pubsub{i}" >> ReadFromPubSub(topic='projects/pubsub-public-data/topics/taxirides-realtime', with_attributes=True)
)
inputs | beam.Flatten()
Dataflow pipeline options for the above pipeline: --max_num_workers=1 --autoscaling_algorithm=NONE --worker_machine_type=n2-standard-32
The leak was triggered by Beam switching default protobuf package version from 3.19.x to 4.22.x in #24599. The new versions of protobuf also switched the default protobuf implemetation to a upb implementation. The upb implementation had two known leaks that have since been mitigated by protobuf team in: protocolbuffers/protobuf#10088, https://github.com/protocolbuffers/upb/issues/1243 . The latest available protobuf==4.24.4 does not yet have the fix, but we have confirmed that using a patched version built in https://github.com/protocolbuffers/upb/actions/runs/6028136812 fixes the leak.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
What happened?
We have identified a memory leak that affects Beam Python SDK versions 2.47.0 and above. The leak was triggered by an upgrade to
protobuf==4.x.x. We rootcaused this leak to protocolbuffers/protobuf#14571 and it has been remediated in Beam 2.52.0.[update: 2023-12-19]: Due to another issue related to protobuf upgrade, Python streaming users should continue to apply the mitigation steps below with Beam 2.52.0 or switch to Beam 2.53.0 once available.
Mitigation
Until Beam 2.52.0 is released, consider any of the following workarounds:
Use
apache-beam==2.46.0or below.Install protobuf 3.x in the submission and runtime environment. For example, you can use a
--requirements_filepipeline option with a file that includes:For more information, see: https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/
Use a
pythonimplementation of protobuf by setting aPROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=pythonenvironment variable in the runtime environment. This might degrade the performance since python implementation is less efficient. For example, you could create a custom Beam SDK container from aDockerfilethat looks like the following:For more information, see: https://beam.apache.org/documentation/runtime/environments/
Install protobuf==4.25.0 or newer in the submission and runtime environment.
Users of Beam 2.50.0 SDK should additionally follow mitigation options for #28318.
Additional details
The leak can be reproduced by a pipeline:
Dataflow pipeline options for the above pipeline:
--max_num_workers=1 --autoscaling_algorithm=NONE --worker_machine_type=n2-standard-32The leak was triggered by Beam switching default
protobufpackage version from 3.19.x to 4.22.x in #24599. The new versions ofprotobufalso switched the default protobuf implemetation to aupbimplementation. Theupbimplementation had two known leaks that have since been mitigated by protobuf team in: protocolbuffers/protobuf#10088, https://github.com/protocolbuffers/upb/issues/1243 . The latest availableprotobuf==4.24.4does not yet have the fix,but we have confirmed that using a patched version built in https://github.com/protocolbuffers/upb/actions/runs/6028136812 fixes the leak.Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components