Skip to content

gh-148318: Fix multiprocessing exitcode=None after join() in multithreaded programs#150360

Closed
jlaportebot wants to merge 1 commit into
python:mainfrom
jlaportebot:fix/multiprocessing-waitpid-race
Closed

gh-148318: Fix multiprocessing exitcode=None after join() in multithreaded programs#150360
jlaportebot wants to merge 1 commit into
python:mainfrom
jlaportebot:fix/multiprocessing-waitpid-race

Conversation

@jlaportebot
Copy link
Copy Markdown

@jlaportebot jlaportebot commented May 24, 2026

Summary

When multiple threads concurrently join() fork-based child processes, os.waitpid() can be called by one thread for a PID that another thread is also waiting on. Since waitpid() reaps the child process-wide, the second thread's call raises OSError(ECHILD) (errno=10), leaving Process.exitcode stuck at None.

Fixes #148318.

Root Cause

os.waitpid() is a process-level operation — it reaps any matching child, regardless of which thread called it. When two threads simultaneously call join() on different processes, the following race can occur:

  1. Thread A calls poll()os.waitpid(pid_A) → succeeds, reaps child A
  2. Thread B calls poll()os.waitpid(pid_B) → but the kernel may return the status for child A (already reaped by Thread A) → raises OSError(ECHILD)
  3. Thread B's poll() returns None (from the except OSError branch), so self.returncode stays None
  4. join() returns, but exitcode remains None

Fix

Add a class-level threading.Lock around the os.waitpid() call in Popen.poll(). Under the lock, we double-check self.returncode in case another thread already set it while we waited for the lock.

The lock is placed only around the waitpid call and the immediate returncode check, not around the wait() method's sentinel-wait, so it does not add significant contention.

Reproducer

import multiprocessing as mp
import sys
import threading

sys.setswitchinterval(1e-6)
ctx = mp.get_context("fork")
failures = []

def lifecycle(n_iters=100, n_procs=4):
    for _ in range(n_iters):
        ps = [ctx.Process(target=int, daemon=True) for _ in range(n_procs)]
        for p in ps:
            p.start()
        for p in ps:
            p.join()
            if p.exitcode is None:
                failures.append(p.pid)

ts = [threading.Thread(target=lifecycle) for _ in range(2)]
for t in ts:
    t.start()
for t in ts:
    t.join()
print(f"{len(failures)} had exitcode=None after join()")
# Before fix: ~10-40 failures per run
# After fix: 0 failures

Impact

Only affects popen_fork.Popen (the "fork" start method on Unix). The "spawn" methods use different wait mechanisms and are not affected.

The lock is a class attribute, so all Popen instances share it — this ensures cross-instance synchronization. The lock is cheap (uncontended fast path is ~20ns on Linux) and only held for the duration of a single waitpid syscall.

…pythongh-148318)

When multiple threads join() fork-based child processes concurrently,
os.waitpid() can be called by one thread for a PID that another thread
is also waiting on. Since waitpid() reaps the child process-wide, the
second thread's call raises OSError(ECHILD), leaving Process.exitcode
stuck at None.

Add a class-level threading.Lock around the os.waitpid() call in
Popen.poll() so that only one thread performs waitpid at a time, with
a double-check of self.returncode inside the lock to avoid missing
a result set by another thread.

Fixes pythongh-148318.
@jlaportebot jlaportebot requested a review from gpshead as a code owner May 24, 2026 17:05
@bedevere-app
Copy link
Copy Markdown

bedevere-app Bot commented May 24, 2026

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

@python-cla-bot
Copy link
Copy Markdown

python-cla-bot Bot commented May 24, 2026

All commit authors signed the Contributor License Agreement.

CLA signed

@picnixz
Copy link
Copy Markdown
Member

picnixz commented May 24, 2026

We don't accept PRs by automations. According to your profile this is such case: "AI Agent for code contributions and automation"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multiprocessing exitcode=None after join

2 participants