Fix three file descriptor leaks in kernel connection lifecycle (#1506)#1619
Merged
Conversation
The cleanup done-callback called iopub_channel.stop_on_recv() before
closing the transient shell/control sockets it owns. When iopub was
closed by a concurrent websocket disconnect (common during tab close
or reload mid-nudge), stop_on_recv raised OSError("Stream is closed")
inside the callback, aborting cleanup and leaking the shell and
control ZMQStreams (4 FDs per occurrence). Reorder cleanup to close
the sockets we own first, then stop_on_recv only if iopub is still
open.
Refs: jupyter-server#1506
When a kernel restart changes ZMQ ports, connect() takes the ports_changed branch and calls create_stream() to install fresh channels, but never closes the previously-buffered set retrieved from get_buffer(). Those ZMQStreams stay registered with the Tornado IOLoop and never get reclaimed, leaking 4 streams (8 FDs) per reconnect-after-restart. Close the stale buffered channels before calling create_stream(). Refs: jupyter-server#1506
When a kernel fails to reply to kernel_info_request (hung/rogue), _handle_kernel_info_reply never fires and kernel_info_channel stays open. disconnect() did not close it, so it leaked on every disconnect that followed a failed nudge (2 FDs per occurrence). Close the channel at the top of disconnect(), above the start_buffering early-return, so all control paths - including the common single-tab close that triggers buffering - release it. Refs: jupyter-server#1506
bfb21fb to
4db29b2
Compare
Contributor
|
I'm rebasing on main as I added CI for Python 3.13 and 3.14, but as CI is passing and there are tests I'm voting on merging this. If it fails on more recent Python (in particular 3.14t because it's finicky), we can just mark those test as xfail on the relevant platform and move on. Thanks a lot for your work on this. |
Carreau
approved these changes
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1506.
Three independent FD leaks in
ZMQChannelsWebsocketConnection:nudge()cleanup race (channels.py:207). The cleanup done-callback callediopub_channel.stop_on_recv()before closing the transient shell/control sockets it owns. If iopub was closed by a concurrent websocket disconnect,stop_on_recvraisedOSError("Stream is closed")inside the callback, aborting cleanup and leaking 4 FDs. Triggered by routine tab close or reload during the nudge window; surfaces in server logs as an unraisableOSErrortraceback pointing atnudge.<locals>.cleanup.Stale buffered channels on reconnect with port change (
channels.py:357). When a kernel restart changes ZMQ ports,connect()replacesself.channelsviacreate_stream()without closing the previously-buffered set. Those streams stay registered with the Tornado IOLoop and never get reclaimed. 8 FDs per reconnect-after-restart; triggered by kernels that auto-restart under load.Orphaned
kernel_info_channelon disconnect (channels.py:423). If the kernel never replies tokernel_info_request(hung/rogue),_handle_kernel_info_replynever fires to close the channel, anddisconnect()didn't close it either. Hoisted the close above thestart_bufferingearly-return so the common single-tab disconnect path releases it. 2 FDs per disconnect-after-failed-nudge.Likely primary root cause
Leak 1 is literally on the
nudge()code path the reporter narrowed to, is triggered by common UI actions (tab close or reload during connection establishment), and leaves a log signature that matches a "narrow by grep" investigation methodology. Leak 2 is a strong secondary given the higher per-event cost under rogue-kernel workloads that cause frequent restarts. Leak 3 is narrower but real and worth fixing while we're here.Tests
One commit per leak. Each fix ships with a regression test that fails on the pre-fix code and passes after. Verified locally: `pytest tests/services/kernels/` → 47 passed.
Contributed by @tonyx93, opus, hickory, chestnut.