Skip to content

Fix three file descriptor leaks in kernel connection lifecycle (#1506)#1619

Merged
Carreau merged 3 commits into
jupyter-server:mainfrom
tonyx93:fix/fd-leak-issue-1506
May 1, 2026
Merged

Fix three file descriptor leaks in kernel connection lifecycle (#1506)#1619
Carreau merged 3 commits into
jupyter-server:mainfrom
tonyx93:fix/fd-leak-issue-1506

Conversation

@tonyx93
Copy link
Copy Markdown
Contributor

@tonyx93 tonyx93 commented Apr 18, 2026

Fixes #1506.

Three independent FD leaks in ZMQChannelsWebsocketConnection:

  1. nudge() cleanup race (channels.py:207). The cleanup done-callback called iopub_channel.stop_on_recv() before closing the transient shell/control sockets it owns. If iopub was closed by a concurrent websocket disconnect, stop_on_recv raised OSError("Stream is closed") inside the callback, aborting cleanup and leaking 4 FDs. Triggered by routine tab close or reload during the nudge window; surfaces in server logs as an unraisable OSError traceback pointing at nudge.<locals>.cleanup.

  2. Stale buffered channels on reconnect with port change (channels.py:357). When a kernel restart changes ZMQ ports, connect() replaces self.channels via create_stream() without closing the previously-buffered set. Those streams stay registered with the Tornado IOLoop and never get reclaimed. 8 FDs per reconnect-after-restart; triggered by kernels that auto-restart under load.

  3. Orphaned kernel_info_channel on disconnect (channels.py:423). If the kernel never replies to kernel_info_request (hung/rogue), _handle_kernel_info_reply never fires to close the channel, and disconnect() didn't close it either. Hoisted the close above the start_buffering early-return so the common single-tab disconnect path releases it. 2 FDs per disconnect-after-failed-nudge.

Likely primary root cause

Leak 1 is literally on the nudge() code path the reporter narrowed to, is triggered by common UI actions (tab close or reload during connection establishment), and leaves a log signature that matches a "narrow by grep" investigation methodology. Leak 2 is a strong secondary given the higher per-event cost under rogue-kernel workloads that cause frequent restarts. Leak 3 is narrower but real and worth fixing while we're here.

Tests

One commit per leak. Each fix ships with a regression test that fails on the pre-fix code and passes after. Verified locally: `pytest tests/services/kernels/` → 47 passed.


Contributed by @tonyx93, opus, hickory, chestnut.

@krassowski krassowski added the bug label Apr 18, 2026
@Carreau Carreau added this to the 2.18 milestone Apr 30, 2026
tonyx93 added 3 commits April 30, 2026 14:33
The cleanup done-callback called iopub_channel.stop_on_recv() before
closing the transient shell/control sockets it owns. When iopub was
closed by a concurrent websocket disconnect (common during tab close
or reload mid-nudge), stop_on_recv raised OSError("Stream is closed")
inside the callback, aborting cleanup and leaking the shell and
control ZMQStreams (4 FDs per occurrence). Reorder cleanup to close
the sockets we own first, then stop_on_recv only if iopub is still
open.

Refs: jupyter-server#1506
When a kernel restart changes ZMQ ports, connect() takes the
ports_changed branch and calls create_stream() to install fresh
channels, but never closes the previously-buffered set retrieved
from get_buffer(). Those ZMQStreams stay registered with the Tornado
IOLoop and never get reclaimed, leaking 4 streams (8 FDs) per
reconnect-after-restart. Close the stale buffered channels before
calling create_stream().

Refs: jupyter-server#1506
When a kernel fails to reply to kernel_info_request (hung/rogue),
_handle_kernel_info_reply never fires and kernel_info_channel stays
open. disconnect() did not close it, so it leaked on every disconnect
that followed a failed nudge (2 FDs per occurrence). Close the channel
at the top of disconnect(), above the start_buffering early-return, so
all control paths - including the common single-tab close that triggers
buffering - release it.

Refs: jupyter-server#1506
@Carreau Carreau force-pushed the fix/fd-leak-issue-1506 branch from bfb21fb to 4db29b2 Compare April 30, 2026 12:33
@Carreau
Copy link
Copy Markdown
Contributor

Carreau commented Apr 30, 2026

I'm rebasing on main as I added CI for Python 3.13 and 3.14, but as CI is passing and there are tests I'm voting on merging this. If it fails on more recent Python (in particular 3.14t because it's finicky), we can just mark those test as xfail on the relevant platform and move on.

Thanks a lot for your work on this.

@Carreau Carreau merged commit 21e1104 into jupyter-server:main May 1, 2026
77 of 78 checks passed
@tonyx93 tonyx93 deleted the fix/fd-leak-issue-1506 branch May 4, 2026 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Nudge function leaking FDs, leading to all kernels stopping due to "Too many open files"

3 participants