Skip to content

Expose GRPC_OPENSSL_CLEANUP_TIMEOUT to control shutdown grace period#39297

Closed
EngHabu wants to merge 10 commits intogrpc:masterfrom
unionai-oss:enghabu/shutdown-timeout
Closed

Expose GRPC_OPENSSL_CLEANUP_TIMEOUT to control shutdown grace period#39297
EngHabu wants to merge 10 commits intogrpc:masterfrom
unionai-oss:enghabu/shutdown-timeout

Conversation

@EngHabu
Copy link
Copy Markdown
Contributor

@EngHabu EngHabu commented Apr 17, 2025

This addresses this fairly annoying issue at shutdown of any python process that uses aio.

The issue and some workarounds are explained collectively in the following issues:
#38679
#33342
#22365

label: Python

Let me dig deeper into the repro and root cause and why this PR.

1. The repro

# Initialize an async channel
# Close the app

There are two issues you see here:

  1. These empty error messages:
Error in sys.excepthook:

Original exception was:
  1. It takes 2 seconds (exactly 2 seconds) to actually shutdown the process afterwards. This may not be a problem if you are running a server but in our case these are ephemeral batch containers that in some cases take < 1 sec to run and finish. Adding 2 seconds to shutdown is simply not an option.

If you read the issues above, you are told you should initialize aio by adding:

from grpc.experimental.aio import init_grpc_aio
init_grpc_aio()

Early on the main thread.

This, correctly, addresses the first issue (empty error messages) but leaves the second issue still there. If you notice in the exported methods, there is also shutdown_grpc_aio so we attempted to do so, unsuccessfully, in various places:
a. Just in the main thread before main() exits
b. As an atexit handler

from grpc.experimental.aio shutdown_grpc_aio
import atexit

@atexit.register
def graceful_shutdown():
    shutdown_grpc_aio()

2. What's going on

To debug the issue, we recompiled grpc with tracing and additional logs:

pip install --force --no-binary :all: --global-option build --global-option --debug --no-use-pep517 .
export GRPC_TRACE=shutdown,timer,polling,executor,api,tsi
export GRPC_VERBOSITY=debug
export PYTHONFAULTHANDLER=1
export PYTHONUNBUFFERED=1
python my_test.py

and noticed the following:

I0000 00:00:1744926877.325494 18646209 init.cc:229] grpc_wait_for_shutdown_with_timeout()
E0000 00:00:1744926879.330106 18646209 init.cc:236] grpc_wait_for_shutdown_with_timeout() timed out.

The delta between these two logs are exactly 2 seconds. This is triggered in the following function:

std::atexit([]() { grpc_wait_for_shutdown_with_timeout(absl::Seconds(2)); });

It waits (for 2 seconds) for g_intializations to get to 0 before timing out.
while (g_initializations != 0) {

void grpc_shutdown(void) { is the only place that reasonably decrements it. This gets called if we call shutdown_grpc_aio, so why isn't it working?!

Looking at the code, it seems reasonable to suggest that if you are going to call init_grpc_aio then you must make sure shutdown_grpc_aio is also called.

3. What's this PR

Regardless of the proper way to handle shutdown, I think enforcing a constant 2seconds may not always be the desired grace period for all applications. This PR allows that grace period to be configurable and continues to default to 2 seconds.

After the PR:

export GRPC_OPENSSL_CLEANUP_TIMEOUT=0.1
I0000 00:00:1744928231.454168 18740582 init.cc:237] grpc_wait_for_shutdown_with_timeout()
E0000 00:00:1744928231.454170 18740582 init.cc:244] grpc_wait_for_shutdown_with_timeout() timed out.

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 21, 2025

@drfloob mind taking a look?

@gtcooke94
Copy link
Copy Markdown
Contributor

Does this 2 seconds timeout occur if shutdown_grpc_aio is called? Or does it work properly?
This seems a generous timeout, but I think shortening it seems like working around the problem of not properly shutting down grpc, rather than fixing the actual problem?

@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 22, 2025

@gtcooke94

Does this 2 seconds timeout occur if shutdown_grpc_aio is called? Or does it work properly?

With shutdown_grpc_aio() 2seconds still happens + it shows those empty excepthook messages.

seems like working around the problem

I agree, however, in a situation where the process is shutting down (the whole container is shutting down), I'm less concerned about leaking resources than I'm about delaying the shutdown.
Note that I'm also not universally shortening the timeout.. It just makes it configurable.

Copy link
Copy Markdown
Contributor

@gtcooke94 gtcooke94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this from a security perspective, let's get other C++ reviewers to check how they want the env var - definitely at least define a constant for this and add some documentation

EngHabu added 2 commits April 23, 2025 13:53
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 23, 2025

Thank you, @gtcooke94!
I think that means you, @XuanWang-Amos :-)

@XuanWang-Amos
Copy link
Copy Markdown
Contributor

I'm actually from Python team, since this is Core change, @yashykt can you help review or assign?

@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 24, 2025

@XuanWang-Amos can you assign it to @yashykt to make sure they get pinged? or have it pop up in their dashboard or something :-)

Comment thread src/core/tsi/ssl_transport_security.cc Outdated
// Name of the environment variable controlling OpenSSL cleanup timeout.
// This variable allows users to specify the timeout (in seconds) for OpenSSL
// resource cleanup during gRPC shutdown. If not set, a default timeout is used.
#define GRPC_OPENSSL_CLEANUP_TIMEOUT_ENV "GRPC_OPENSSL_CLEANUP_TIMEOUT"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#define GRPC_OPENSSL_CLEANUP_TIMEOUT_ENV "GRPC_OPENSSL_CLEANUP_TIMEOUT"
#define GRPC_ARG_OPENSSL_CLEANUP_TIMEOUT_ENV "grpc.openssl_cleanup_timeout"

Comment thread src/core/tsi/ssl_transport_security.cc Outdated
// Retrieve the OpenSSL cleanup timeout from the environment variable.
// This allows users to override the default cleanup timeout for OpenSSL
// resource deallocation during gRPC shutdown.
const char* env = std::getenv(GRPC_OPENSSL_CLEANUP_TIMEOUT_ENV);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use grpc_core::GetEnv here

Comment thread src/core/tsi/ssl_transport_security.cc Outdated
const char* env = std::getenv(GRPC_OPENSSL_CLEANUP_TIMEOUT_ENV);
int timeout_sec = 2;
if (env != nullptr) {
timeout_sec = std::stoi(env);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use absl::SimpleAtoi instead

EngHabu added 3 commits April 25, 2025 09:47
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 25, 2025

@yashykt thank you for the review. Please see the updated commits.

@EngHabu EngHabu requested review from gtcooke94 and yashykt April 25, 2025 17:37
@XuanWang-Amos
Copy link
Copy Markdown
Contributor

Looks like our Bazel build failed, we might need change some BUILD files:

[5,568 / 5,672] Compiling src/core/ext/transport/chttp2/transport/chttp2_transport.cc; 12s remote-cache, processwrapper-sandbox ... (16 actions, 15 running)
ERROR: /var/local/jenkins/grpc/BUILD:4209:16: Compiling src/core/tsi/ssl_transport_security.cc failed: (Exit 1): clang-15 failed: error executing CppCompile command (from target //:tsi_ssl_credentials) 
  (cd /root/.cache/bazel/_bazel_root/954bb7512d44d20015390af6e76121c6/sandbox/processwrapper-sandbox/2728/execroot/com_github_grpc_grpc && \
  exec env - \
    FUZZINTRO_OUTDIR=/src \
    GRPC_BAZEL_RUNTIME=1 \
    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/out \
    PWD=/proc/self/cwd \
  /usr/local/bin/clang-15 -U_FORTIFY_SOURCE -fstack-protector -Wall -Wthread-safety -Wself-assign -Wunused-but-set-parameter -Wno-free-nonheap-object -fcolor-diagnostics -fno-omit-frame-pointer '-std=c++17' -MD -MF bazel-out/k8-fastbuild/bin/_objs/tsi_ssl_credentials/ssl_transport_security.pic.d '-frandom-seed=bazel-out/k8-fastbuild/bin/_objs/tsi_ssl_credentials/ssl_transport_security.pic.o' -fPIC -DCARES_STATICLIB -iquote . -iquote bazel-out/k8-fastbuild/bin -iquote external/com_google_absl -iquote bazel-out/k8-fastbuild/bin/external/com_google_absl -iquote external/com_google_googleapis -iquote bazel-out/k8-fastbuild/bin/external/com_google_googleapis -iquote external/com_google_protobuf -iquote bazel-out/k8-fastbuild/bin/external/com_google_protobuf -iquote external/com_github_cares_cares -iquote bazel-out/k8-fastbuild/bin/external/com_github_cares_cares -iquote external/zlib -iquote bazel-out/k8-fastbuild/bin/external/zlib -iquote external/boringssl -iquote bazel-out/k8-fastbuild/bin/external/boringssl -Ibazel-out/k8-fastbuild/bin/external/com_google_googleapis -Ibazel-out/k8-fastbuild/bin/external/com_google_protobuf/src/google/protobuf/_virtual_imports/any_proto -Ibazel-out/k8-fastbuild/bin/external/com_google_protobuf/third_party/utf8_range/_virtual_includes/utf8_range -Ibazel-out/k8-fastbuild/bin -isystem include -isystem bazel-out/k8-fastbuild/bin/include -isystem src/core/ext/upb-gen -isystem bazel-out/k8-fastbuild/bin/src/core/ext/upb-gen -isystem src/core/ext/upbdefs-gen -isystem bazel-out/k8-fastbuild/bin/src/core/ext/upbdefs-gen -isystem third_party/address_sorting/include -isystem bazel-out/k8-fastbuild/bin/third_party/address_sorting/include -isystem external/com_github_cares_cares/include -isystem bazel-out/k8-fastbuild/bin/external/com_github_cares_cares/include -isystem external/com_github_cares_cares -isystem bazel-out/k8-fastbuild/bin/external/com_github_cares_cares -isystem external/zlib -isystem bazel-out/k8-fastbuild/bin/external/zlib -isystem external/boringssl/src/include -isystem bazel-out/k8-fastbuild/bin/external/boringssl/src/include -Wno-typedef-redefinition -DGRPC_BAZEL_BUILD '-std=c++17' -c src/core/tsi/ssl_transport_security.cc -o bazel-out/k8-fastbuild/bin/_objs/tsi_ssl_credentials/ssl_transport_security.pic.o -no-canonical-prefixes -Wno-builtin-macro-redefined '-D__DATE__="redacted"' '-D__TIMESTAMP__="redacted"' '-D__TIME__="redacted"' '-fmodule-name=//:tsi_ssl_credentials' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/tsi_ssl_credentials.cppmap' -Xclang -fno-cxx-modules -fmodules-strict-decluse -Wprivate-header '-fmodule-map-file=external/local_config_cc/module.modulemap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/channel_arg_names.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/config_vars.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/gpr.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/grpc_base.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/grpc_core_credentials_header.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/grpc_credentials_util.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/grpc_public_hdrs.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/grpc_security_base.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/ref_counted_ptr.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/transport_auth_context.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/tsi_base.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/tsi_ssl_session_cache.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/channel_args.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/error.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/grpc_crl_provider.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/grpc_transport_chttp2_alpn.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/load_file.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/ref_counted.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/slice.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/sync.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/tsi_ssl_types.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/src/core/useful.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/com_google_absl/absl/base/core_headers.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/com_google_absl/absl/log/check.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/com_google_absl/absl/log/log.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/com_google_absl/absl/status/status.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/com_google_absl/absl/status/statusor.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/com_google_absl/absl/strings/strings.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/boringssl/src/crypto.cppmap' '-fmodule-map-file=bazel-out/k8-fastbuild/bin/external/boringssl/src/ssl.cppmap')
# Configuration: 7108c862523f59b4e742a7c9e976566182409b50b4e555417e1ded7aa0ed47dc
# Execution platform: @@internal_platforms_do_not_use//host:host

Use --sandbox_debug to see verbose messages from the sandbox and retain the sandbox build root for debugging
src/core/tsi/ssl_transport_security.cc:71:10: error: module //:tsi_ssl_credentials does not depend on a module exporting 'src/core/util/env.h'
#include "src/core/util/env.h"
         ^
1 error generated.
[5,585 / 5,672] checking cached actions
INFO: Elapsed time: 358.554s, Critical Path: 55.70s
INFO: 5585 processes: 2863 internal, 2722 processwrapper-sandbox.
ERROR: Build did NOT complete successfully

EngHabu added 2 commits April 28, 2025 22:03
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 29, 2025

@yashykt fixed the build issue. Mind helping me run CI again and merge the change?

@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 29, 2025

@yashykt I see it failed again...
I tried one of the failed targets:

bazel build --cpu=darwin_x86_64 --macos_cpus=x86_64 --host_cpu=darwin_x86_64 //src/python/grpcio/grpc/experimental:experimental
clang++: error: unsupported option '-maes' for target 'arm64-apple-macosx15.4'
clang++: error: unsupported option '-msse4.1' for target 'arm64-apple-macosx15.4'
Error in child process '/usr/bin/xcrun'. 1
Target //src/python/grpcio/grpc/experimental:experimental failed to build
Use --verbose_failures to see the command lines of failed build steps.
INFO: Elapsed time: 1.311s, Critical Path: 1.15s
INFO: 34 processes: 16 internal, 18 darwin-sandbox.
ERROR: Build did NOT complete successfully

Mind sharing the new errors?

@gtcooke94
Copy link
Copy Markdown
Contributor

I think we're still getting a similar build error - is the normal bazel build working locally for you?

src/core/tsi/ssl_transport_security.cc:71:10: error: module //:tsi_ssl_credentials does not depend on a module exporting 'src/core/util/env.h'
   71 | #include "src/core/util/env.h"
      |          ^
1 error generated.

Comment thread BUILD Outdated
"//src/core:tsi/ssl/key_logging/ssl_key_logging.h",
"//src/core:tsi/ssl_transport_security.h",
"//src/core:tsi/ssl_transport_security_utils.h",
"//src/core:env",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this dependency needs to get added to the deps section, not the hdrs section

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦🏽‍♂️ thank you... updated..
locally running this succeeds

>bazel build //:tsi_ssl_credentials                                                                                                                             

INFO: Running bazel wrapper (see //tools/bazel for details), bazel version 8.0.1 will be used instead of system-wide bazel installation.
Starting local Bazel server (8.0.1) and connecting to it...
WARNING: WORKSPACE support will be removed in Bazel 9 (late 2025), please migrate to Bzlmod, see https://bazel.build/external/migration.
WARNING: /private/var/tmp/_bazel_haytham/99b9644c7936a3395988ff63192eef6e/external/io_opentelemetry_cpp/api/BUILD:62:10: target '@@io_opentelemetry_cpp//api:with_abseil' is deprecated: The value of this flag is ignored. Bazel builds always depend on Abseil for its pre-adopted `std::` types. You should remove this flag from your build command.
INFO: Analyzed target //:tsi_ssl_credentials (146 packages loaded, 4184 targets configured).
INFO: Found 1 target...
Target //:tsi_ssl_credentials up-to-date:
  bazel-bin/libtsi_ssl_credentials.a
INFO: Elapsed time: 96.622s, Critical Path: 18.44s
INFO: 1505 processes: 250 action cache hit, 6 internal, 1499 darwin-sandbox.
INFO: Build completed successfully, 1505 total actions

EngHabu added 2 commits April 30, 2025 09:04
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu EngHabu requested a review from yashykt April 30, 2025 16:35
@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented Apr 30, 2025

@yashykt mind rerunning CI

@EngHabu
Copy link
Copy Markdown
Contributor Author

EngHabu commented May 1, 2025

@yashykt I assume this is some internal process you can help me drive :-)

@yashykt
Copy link
Copy Markdown
Member

yashykt commented May 1, 2025

Thanks for fixing all the errors, I'll get this merged

@yashykt yashykt requested a review from yousukseung May 1, 2025 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants