multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

ivarref · 2024-11-05T10:16:46Z

Bug report

Bug description:

multiprocessing.Event.set() will acquire a lock when setting the internal flag. multiprocessing.Event.is_set() will acquire the same lock when checking the flag. Thus if a signal handler calls .set() when .is_set() is running on the same process, there will be a deadlock.

multiprocessing.Event uses a regular non-reentrantlock lock. This should be changed to a reentrant lock. Please see the pull request.

Thanks for all the work on the Python programming language. I appreciate all your efforts highly.

Kind regards.

Example program below that (sometimes) deadlocks. On my machine I typically need to run it less than 10 times before a deadlock occurs. Also included in the code block is a sample stacktrace.

import faulthandler
import multiprocessing
import os
import signal


def run_buggy():
    shutdown_event = multiprocessing.Event()

    def sigterm_handler(_signo, _stack_frame):
        try:
            print(f'sigterm_handler running')
            shutdown_event.set()
        finally:
            print(f'sigterm_handler done')

    signal.signal(signal.SIGTERM, sigterm_handler)
    signal.signal(signal.SIGINT, sigterm_handler)
    faulthandler.register(signal.SIGUSR1)

    print(f'Running process with PID {os.getpid()}')
    print(f'Dump the stack by executing:')
    print(f'kill -SIGUSR1 {os.getpid()}')
    print(f'Try to kill this process with CTRL-C or kill {os.getpid()}')
    while not shutdown_event.is_set():
        pass

if __name__ == '__main__':
    run_buggy()
# Sample stacktrace:
# Current thread 0x00000001ed74cf40 (most recent call first):
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 95 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 237 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 342 in set
#   File "/Users/ire/code/cpython/./bug.py", line 13 in sigterm_handler
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 95 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 237 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 335 in is_set
#   File "/Users/ire/code/cpython/./bug.py", line 25 in run_buggy
#   File "/Users/ire/code/cpython/./bug.py", line 29 in <module>

CPython versions tested on:

3.11, 3.12, CPython main branch

Operating systems tested on:

Linux, macOS

Linked PRs

gh-126434: Use multiprocessing.Value for multiprocessing.Event to avoid deadlock when there is reentrant usage of set from is_set, e.g. when handling system signals #126437

The text was updated successfully, but these errors were encountered:

Zheaoli · 2024-11-05T10:48:40Z

Bug confirmed on main branch

Zheaoli · 2024-11-05T10:54:27Z

@Eclips4 We need tag for this issue, I think topic-multiprocessing would be OK

BTW @ivarref are you already working on a PR?

ivarref · 2024-11-05T11:05:05Z

Hi @Zheaoli

Thanks for the quick response. I've pushed a pull request now. What do you think?

Thanks and kind regards.

ivarref · 2024-11-05T11:12:28Z

Update: I've added a blurb entry.

colesbury · 2024-11-05T15:15:46Z

I commented on the PR, but I don't think multiprocessing.Event is safe to call from a Python signal handler, even with an RLock. I suspect that's true of a lot of Python's concurrency primitives.

The general problem is that Python can execute signal handling code asynchronously at many places that don't expect to support reentrancy.

RLock's don't solve this problem because the things they protect are often not safe to call reentrancy. In other words, calling them reentrantly breaks invariants that a (non-reentrant) lock would otherwise protect.

…to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals

…it()-ing. Raise an exception if that is the case. Fix race condition in multiprocessing.Event.wait() as described in python#95826

…e called from a signal handler in any combination.

…set() call must be made to reach race condition possibility.

gpshead · 2024-11-14T16:51:02Z

If people want workarounds for existing code: #85772 (comment) is the a good starting point. As much as I do not like any recommendation of spawning a thread to do the Event.set() from the Python interrupting context, it works.

Something I believe could be done to Event instead is:

Know if the thread its methods that do with self._code: today are being called from is one which Python signal handlers will be invoked by. (this means "we're the main thread" - I'm being overly abstract)
If so, punt the method call to a different thread and wait on the result to come back from that thread. If not, continue in the existing thread.

I'd rather we weren't in the business of spawning threads from multiprocessing at all, but I believe this will work and will work transparently. It is a new feature, we'd not backport it as a bug fix. But it also lines up nicely with 3.14's existing change of the default multiprocessing start method away from threading-unsafe "fork".

We could even adopt this conditionally based on the multiprocessing ctx.get_start_method() in order to avoid the fork multithreaded process issues. Meaning the deadlock would still be possible for anyone using the "fork" start method. Not a huge deal given that "fork" and threading can already lead to other deadlocks? That'd just need documenting.

(we can ignore the sys.settrace function use case I put in the title, that is an intentional footgun for anyone using it - I put it in the title for completeness as the issue is ultimately caused by anything diverting to execute other Python code at arbitrary points within one thread)

colesbury · 2024-11-14T17:34:03Z

I think the idea of calling Event.set() from a different thread makes sense, but making that decision inside Event seems like playing whack-a-mole with asynchronous deadlock issues. This is an instance of a general problem: it's the same issue that comes up with GC finalizers running in the application thread and it's solved by running them in a dedicated thread ¹.

I think we should do something similar with Python signal handling: instead of handling signals on the main thread, Python should handle signals on a dedicated thread, perhaps as a configurable option to signal.signal.

The mainstream programming runtimes that support threads and use tracing GC (Java, Go, .NET) do this, with the exception of CPython. ↩

gpshead · 2024-11-14T17:54:57Z

Those thoughts are in part why I was being abstract about where things are handled. Conversations about the signal handler being done in another thread have been slowly recurring over the years. Good to know that other languages actually do this.

A challenge if we adopt this approach is that people expect as an API that exceptions escaping a signal handlers are injected into the main thread. That seems solvable even if handler execution happens on another thread: Keep the existing pending signal check logic and turn that into a "pending exception from a signal handler to reraise" check.

In this world, which thread signals are handled in and which thread sees exceptions escaping a signal handler interrupt it could both be independent concepts, potentially configurable (I'd prefer to avoid letting people do unusual things with such process global settings for consistency of expectations).

colesbury · 2024-11-14T18:25:39Z

Good to know that other languages actually do this.

My comment may not have been clear: I meant that other languages run finalizers in a dedicated thread, although .NET also appears to run signal handlers in a different thread. Go apparently does signal handling via channels, which allows the user to control the thread. As far as I can tell, the Java standard library does not provide a way for applications to handle UNIX signals.

... exceptions escaping a signal handlers are injected into the main thread...

This would reintroduce the deadlocks and corruption issues that using a separate thread avoids. Python code isn't robust to exceptions being raised at arbitrary places. For example, it can corrupt the state of threading.Condition or anything else that relies on try/finally or context managers with __exit__. (If the exception is asynchronously raised at the start of a finally block, then the remainder of the finally block won't run.)

ivarref · 2024-11-15T13:08:34Z

@gpshead

If so, punt the method call to a different thread and wait on the result to come back from that thread. If not, continue in the existing thread.

I don't think that would quite work. Imagine the following:

sighandler_executor = ThreadPoolExecutor(max_workers=1)

main-thread condition.wait()
main-thread SIGINT
main-thread fut = sighandler_executor.submit(...)
main-thread fut.result()

sighandler_executor-thread condition.notify_all()
sighandler_executor-thread condition.notify: self._woken_count.acquire() # wait for a sleeper to wake

# the sleeper (main-thread) will never wake (execute `self._woken_count.release()`) 
# because it is waiting for the sighandler_executor-thread result.
# => deadlock

Thus, you would need to fire and forget (not wait for the result). That would potentially cause other problems.

How about adding the following argument to signal.signal in 3.14:

signal.signal(signalnum, handler, dedicated_thread=True)

If signal.signal is never called, what should the default behaviour be?

Java has a method called addShutdownHook. The input parameter for that function is a not-started thread. The input parameter (thread) will be started and executed when the JVM is about to exit, for example upon encountering SIGINT, etc. Thus SIGINT will not be processed inside/on top of already running code.

Java's Thread.sleep will throw an InterruptedException if the thread is, well, interrupted. This will happen on SIGINT as far as I know.

Thus I suppose that Python's time.sleep and similar could also raise an exception if SIGINT is received and there is no signal handler. I suppose the default signal handler could set some sort of interrupt flag and/or have functions like time.sleep raise exceptions. I don't see any fields for it in e.g. Thread though (unlike in Java).

A dedicated signal thread that can only process one signal at a time seems like the right approach to me. But what is the possible breakage here, as @gpshead writes "expect as an API that exceptions escaping a signal handlers are injected into the main thread", ?

How about this for the default behaviour in 3.14: Signal handling run on a dedicated thread. SIGINT, etc. will cause thread(s) to shutdown if no signal handler is installed.

Is the current default behaviour, with no signal handlers installed, of Python to upon receiving SIGINT, to interrupt the main thread and let the other non-daemon threads finish as they please?

Perhaps there is not that much to worry about: if the existing code does not do signal handling OR raises exceptions from the signal handler, which I suppose this approximately amounts to the same thing, is it fine to shutdown the program (or main thread only?) in such a case. The only difference is that the shutdown initiation comes from a different thread.

I don't know much about Python's shutdown, signals and threads, so I'm uncertain how helpful my comments are, and how much work it would be to implement the suggestions described here. Thanks and kind regards even so :-)

ivarref · 2024-11-28T12:11:59Z

Hi again @colesbury and @gpshead

I've done a new take on this. It effectively runs signal handlers on a dedicated thread*.

The new signal.py code is highlighted here.

It is backwards compatible if the argument use_dedicated_thread=False is passed to to signal.signal. The default is true.
It will bubble an uncaught exception in a signal handler to the main thread. (Do you think this makes sense?)
The change is only in python code. The original C code from signalmodule.c is left untouched.

* It uses queue.SimpleQueue to achieve this, as queue.SimpleQueue.put is reentrant and can thus be used from actual signal handlers.

Here is an example program that uses the new signal handler:

import os
import signal
import threading
import time
import multiprocessing

def sigint_self():
    time.sleep(1)
    print(f'{threading.current_thread().name}: Stopping PID {os.getpid()} with SIGINT')
    os.kill(os.getpid(), signal.SIGINT)

def sigkill_self():
    time.sleep(5)
    print(f'{threading.current_thread().name}: Killing PID {os.getpid()} with SIGKILL')
    os.kill(os.getpid(), signal.SIGKILL)

def run_signal_handler_dedicated_thread():
    event = multiprocessing.Event()
    def sigint_handler(_signo, _stack_frame):
        try:
            # x = 1 / 0
            # ^^ If uncommented, the uncaught exception will be bubbled to the main thread.
            print(f'{threading.current_thread().name}: sigint_handler is setting event')
            event.set() # This would deadlock using the old signal handler code
        finally:
            print(f'{threading.current_thread().name}: sigint_handler is done')

    def sigterm_handler(_signo, _stack_frame):
        print(f'{threading.current_thread().name}: sigterm_handler is running')
        pass

    signal.signal(signal.SIGTERM, sigterm_handler)
    signal.signal(signal.SIGINT, sigint_handler)

    threading.Thread(target=sigint_self, daemon=True).start()
    threading.Thread(target=sigkill_self, daemon=True).start() # Used for debugging only.

    print(f'{threading.current_thread().name}: Waiting on event. PID = {os.getpid()}')
    event.wait()
    print(f'{threading.current_thread().name}: Waiting is done')

if __name__ == '__main__':
    try:
        run_signal_handler_dedicated_thread()
    finally:
        print(f'{threading.current_thread().name}: Exiting')

The committed code could certainly be cleaned up. No tests currently exists. I'd be happy to add that. But first I'd like to ask:

is this a viable path forwards? Using only Python to implement this and leaving the old C code as is**? Are there other considerations that must be done?

** I don't think I'm competent wrt. C and signals to write new C code.

Thanks and kind regards.

…bubble exception on main thread

… by defaultˆ

…ignal handler (on a non-main thread)?

ivarref added the type-bug An unexpected behavior, bug, or error label Nov 5, 2024

ivarref mentioned this issue Nov 5, 2024

gh-126434: Use multiprocessing.Value for multiprocessing.Event to avoid deadlock when there is reentrant usage of set from is_set, e.g. when handling system signals #126437

Draft

tomasr8 added the topic-multiprocessing label Nov 5, 2024

ZeroIntensity added the stdlib Python modules in the Lib dir label Nov 5, 2024

gpshead assigned gpshead and unassigned gpshead Nov 7, 2024

ivarref added a commit to ivarref/cpython that referenced this issue Nov 11, 2024

pythongh-126434: Use multiprocessing.Value for multiprocessing.Event …

e5302e6

…to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals

ivarref added a commit to ivarref/cpython that referenced this issue Nov 12, 2024

pythongh-126434: Add test

cbecc76

ivarref added a commit to ivarref/cpython that referenced this issue Nov 12, 2024

pythongh-126434: Remove print statements used/useful for debugging

2f90fb2

ivarref added a commit to ivarref/cpython that referenced this issue Nov 12, 2024

pythongh-126434: Support systems with high load (no accidental timeouts)

0ca4a85

ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024

pythongh-126434: multiprocessing.Event is fully reentrant, i.e. can b…

25236fe

…e called from a signal handler in any combination.

ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024

pythongh-126434: Sleep longer.

b1103b2

ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024

pythongh-126434: Update blurb.

bec9070

ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024

pythongh-126434: Document 64 bit example of how many times reentrant …

c15a45a

…set() call must be made to reach race condition possibility.

gpshead mentioned this issue Nov 14, 2024

Multiprocessing.Event.wait race condition when timeout is not none on Mac OS #95826

Closed

gpshead changed the title ~~multiprocessing.Event.set() can be deadlocked by .is_set when called by a sigterm handler~~ multiprocessing.Event.set() can deadlock when called from a signal handler or trace function Nov 14, 2024

gpshead mentioned this issue Nov 14, 2024

multiprocessing.Event.wait hangs when interrupted by signal that sets the event #85772

Closed

ivarref added a commit to ivarref/cpython that referenced this issue Nov 28, 2024

pythongh-126434: Process signals on dedicated thread

c308344

github-actions bot mentioned this issue Dec 1, 2024

Monthly issue metrics report hugovk/test#88

Closed

ivarref added a commit to ivarref/cpython that referenced this issue Dec 3, 2024

pythongh-126434: Continue process signals on dedicated thread, don't …

5a43697

…bubble exception on main thread

ivarref added a commit to ivarref/cpython that referenced this issue Dec 3, 2024

pythongh-126434: Start fix tests when using use_dedicated_thread=True…

574e91c

… by defaultˆ

ivarref added a commit to ivarref/cpython that referenced this issue Dec 3, 2024

pythongh-126434: What should be done if sys.exit() is called from a s…

25f4c7e

…ignal handler (on a non-main thread)?

ivarref added a commit to ivarref/cpython that referenced this issue Dec 3, 2024

pythongh-126434: Propagate sys.exit from signal handler to main thread

88b45b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

ivarref commented Nov 5, 2024 •

edited by gpshead

Loading

Zheaoli commented Nov 5, 2024

Zheaoli commented Nov 5, 2024

ivarref commented Nov 5, 2024

ivarref commented Nov 5, 2024

colesbury commented Nov 5, 2024

gpshead commented Nov 14, 2024

colesbury commented Nov 14, 2024 •

edited

Loading

gpshead commented Nov 14, 2024

colesbury commented Nov 14, 2024

ivarref commented Nov 15, 2024 •

edited

Loading

ivarref commented Nov 28, 2024 •

edited

Loading

multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

Comments

ivarref commented Nov 5, 2024 • edited by gpshead Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

Zheaoli commented Nov 5, 2024

Zheaoli commented Nov 5, 2024

ivarref commented Nov 5, 2024

ivarref commented Nov 5, 2024

colesbury commented Nov 5, 2024

gpshead commented Nov 14, 2024

colesbury commented Nov 14, 2024 • edited Loading

Footnotes

gpshead commented Nov 14, 2024

colesbury commented Nov 14, 2024

ivarref commented Nov 15, 2024 • edited Loading

ivarref commented Nov 28, 2024 • edited Loading

ivarref commented Nov 5, 2024 •

edited by gpshead

Loading

colesbury commented Nov 14, 2024 •

edited

Loading

ivarref commented Nov 15, 2024 •

edited

Loading

ivarref commented Nov 28, 2024 •

edited

Loading