Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

Open
ivarref opened this issue Nov 5, 2024 · 11 comments
Labels
stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error

Comments

@ivarref
Copy link

ivarref commented Nov 5, 2024

Bug report

Bug description:

multiprocessing.Event.set() will acquire a lock when setting the internal flag. multiprocessing.Event.is_set() will acquire the same lock when checking the flag. Thus if a signal handler calls .set() when .is_set() is running on the same process, there will be a deadlock.

multiprocessing.Event uses a regular non-reentrantlock lock. This should be changed to a reentrant lock. Please see the pull request.

Thanks for all the work on the Python programming language. I appreciate all your efforts highly.

Kind regards.

Example program below that (sometimes) deadlocks. On my machine I typically need to run it less than 10 times before a deadlock occurs. Also included in the code block is a sample stacktrace.

import faulthandler
import multiprocessing
import os
import signal


def run_buggy():
    shutdown_event = multiprocessing.Event()

    def sigterm_handler(_signo, _stack_frame):
        try:
            print(f'sigterm_handler running')
            shutdown_event.set()
        finally:
            print(f'sigterm_handler done')

    signal.signal(signal.SIGTERM, sigterm_handler)
    signal.signal(signal.SIGINT, sigterm_handler)
    faulthandler.register(signal.SIGUSR1)

    print(f'Running process with PID {os.getpid()}')
    print(f'Dump the stack by executing:')
    print(f'kill -SIGUSR1 {os.getpid()}')
    print(f'Try to kill this process with CTRL-C or kill {os.getpid()}')
    while not shutdown_event.is_set():
        pass

if __name__ == '__main__':
    run_buggy()
# Sample stacktrace:
# Current thread 0x00000001ed74cf40 (most recent call first):
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 95 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 237 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 342 in set
#   File "/Users/ire/code/cpython/./bug.py", line 13 in sigterm_handler
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 95 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 237 in __enter__
#   File "/Users/ire/code/cpython/Lib/multiprocessing/synchronize.py", line 335 in is_set
#   File "/Users/ire/code/cpython/./bug.py", line 25 in run_buggy
#   File "/Users/ire/code/cpython/./bug.py", line 29 in <module>

CPython versions tested on:

3.11, 3.12, CPython main branch

Operating systems tested on:

Linux, macOS

Linked PRs

@ivarref ivarref added the type-bug An unexpected behavior, bug, or error label Nov 5, 2024
@Zheaoli
Copy link
Contributor

Zheaoli commented Nov 5, 2024

Bug confirmed on main branch

@Zheaoli
Copy link
Contributor

Zheaoli commented Nov 5, 2024

@Eclips4 We need tag for this issue, I think topic-multiprocessing would be OK

BTW @ivarref are you already working on a PR?

@ivarref
Copy link
Author

ivarref commented Nov 5, 2024

Hi @Zheaoli

Thanks for the quick response. I've pushed a pull request now. What do you think?

Thanks and kind regards.

@ivarref
Copy link
Author

ivarref commented Nov 5, 2024

Update: I've added a blurb entry.

@colesbury
Copy link
Contributor

I commented on the PR, but I don't think multiprocessing.Event is safe to call from a Python signal handler, even with an RLock. I suspect that's true of a lot of Python's concurrency primitives.

The general problem is that Python can execute signal handling code asynchronously at many places that don't expect to support reentrancy.

RLock's don't solve this problem because the things they protect are often not safe to call reentrancy. In other words, calling them reentrantly breaks invariants that a (non-reentrant) lock would otherwise protect.

@ZeroIntensity ZeroIntensity added the stdlib Python modules in the Lib dir label Nov 5, 2024
@gpshead gpshead assigned gpshead and unassigned gpshead Nov 7, 2024
ivarref added a commit to ivarref/cpython that referenced this issue Nov 11, 2024
…to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals
ivarref added a commit to ivarref/cpython that referenced this issue Nov 12, 2024
ivarref added a commit to ivarref/cpython that referenced this issue Nov 13, 2024
…it()-ing. Raise an exception if that is the case. Fix race condition in multiprocessing.Event.wait() as described in python#95826
ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024
…e called from a signal handler in any combination.
ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024
ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024
ivarref added a commit to ivarref/cpython that referenced this issue Nov 14, 2024
…set() call must be made to reach race condition possibility.
@gpshead gpshead changed the title multiprocessing.Event.set() can be deadlocked by .is_set when called by a sigterm handler multiprocessing.Event.set() can deadlock when called from a signal handler or trace function Nov 14, 2024
@gpshead
Copy link
Member

gpshead commented Nov 14, 2024

If people want workarounds for existing code: #85772 (comment) is the a good starting point. As much as I do not like any recommendation of spawning a thread to do the Event.set() from the Python interrupting context, it works.

Something I believe could be done to Event instead is:

  • Know if the thread its methods that do with self._code: today are being called from is one which Python signal handlers will be invoked by. (this means "we're the main thread" - I'm being overly abstract)
  • If so, punt the method call to a different thread and wait on the result to come back from that thread. If not, continue in the existing thread.

I'd rather we weren't in the business of spawning threads from multiprocessing at all, but I believe this will work and will work transparently. It is a new feature, we'd not backport it as a bug fix. But it also lines up nicely with 3.14's existing change of the default multiprocessing start method away from threading-unsafe "fork".

We could even adopt this conditionally based on the multiprocessing ctx.get_start_method() in order to avoid the fork multithreaded process issues. Meaning the deadlock would still be possible for anyone using the "fork" start method. Not a huge deal given that "fork" and threading can already lead to other deadlocks? That'd just need documenting.

(we can ignore the sys.settrace function use case I put in the title, that is an intentional footgun for anyone using it - I put it in the title for completeness as the issue is ultimately caused by anything diverting to execute other Python code at arbitrary points within one thread)

@colesbury
Copy link
Contributor

colesbury commented Nov 14, 2024

I think the idea of calling Event.set() from a different thread makes sense, but making that decision inside Event seems like playing whack-a-mole with asynchronous deadlock issues. This is an instance of a general problem: it's the same issue that comes up with GC finalizers running in the application thread and it's solved by running them in a dedicated thread 1.

I think we should do something similar with Python signal handling: instead of handling signals on the main thread, Python should handle signals on a dedicated thread, perhaps as a configurable option to signal.signal.

Footnotes

  1. The mainstream programming runtimes that support threads and use tracing GC (Java, Go, .NET) do this, with the exception of CPython.

@gpshead
Copy link
Member

gpshead commented Nov 14, 2024

Those thoughts are in part why I was being abstract about where things are handled. Conversations about the signal handler being done in another thread have been slowly recurring over the years. Good to know that other languages actually do this.

A challenge if we adopt this approach is that people expect as an API that exceptions escaping a signal handlers are injected into the main thread. That seems solvable even if handler execution happens on another thread: Keep the existing pending signal check logic and turn that into a "pending exception from a signal handler to reraise" check.

In this world, which thread signals are handled in and which thread sees exceptions escaping a signal handler interrupt it could both be independent concepts, potentially configurable (I'd prefer to avoid letting people do unusual things with such process global settings for consistency of expectations).

@colesbury
Copy link
Contributor

Good to know that other languages actually do this.

My comment may not have been clear: I meant that other languages run finalizers in a dedicated thread, although .NET also appears to run signal handlers in a different thread. Go apparently does signal handling via channels, which allows the user to control the thread. As far as I can tell, the Java standard library does not provide a way for applications to handle UNIX signals.

... exceptions escaping a signal handlers are injected into the main thread...

This would reintroduce the deadlocks and corruption issues that using a separate thread avoids. Python code isn't robust to exceptions being raised at arbitrary places. For example, it can corrupt the state of threading.Condition or anything else that relies on try/finally or context managers with __exit__. (If the exception is asynchronously raised at the start of a finally block, then the remainder of the finally block won't run.)

@ivarref
Copy link
Author

ivarref commented Nov 15, 2024

@gpshead

If so, punt the method call to a different thread and wait on the result to come back from that thread. If not, continue in the existing thread.

I don't think that would quite work. Imagine the following:

sighandler_executor = ThreadPoolExecutor(max_workers=1)

main-thread condition.wait()
main-thread SIGINT
main-thread fut = sighandler_executor.submit(...)
main-thread fut.result()

sighandler_executor-thread condition.notify_all()
sighandler_executor-thread condition.notify: self._woken_count.acquire() # wait for a sleeper to wake

# the sleeper (main-thread) will never wake (execute `self._woken_count.release()`) 
# because it is waiting for the sighandler_executor-thread result.
# => deadlock

Thus, you would need to fire and forget (not wait for the result). That would potentially cause other problems.


How about adding the following argument to signal.signal in 3.14:

signal.signal(signalnum, handler, dedicated_thread=True)

If signal.signal is never called, what should the default behaviour be?

Java has a method called addShutdownHook. The input parameter for that function is a not-started thread. The input parameter (thread) will be started and executed when the JVM is about to exit, for example upon encountering SIGINT, etc. Thus SIGINT will not be processed inside/on top of already running code.

Java's Thread.sleep will throw an InterruptedException if the thread is, well, interrupted. This will happen on SIGINT as far as I know.

Thus I suppose that Python's time.sleep and similar could also raise an exception if SIGINT is received and there is no signal handler. I suppose the default signal handler could set some sort of interrupt flag and/or have functions like time.sleep raise exceptions. I don't see any fields for it in e.g. Thread though (unlike in Java).


A dedicated signal thread that can only process one signal at a time seems like the right approach to me. But what is the possible breakage here, as @gpshead writes "expect as an API that exceptions escaping a signal handlers are injected into the main thread", ?

How about this for the default behaviour in 3.14: Signal handling run on a dedicated thread. SIGINT, etc. will cause thread(s) to shutdown if no signal handler is installed.

Is the current default behaviour, with no signal handlers installed, of Python to upon receiving SIGINT, to interrupt the main thread and let the other non-daemon threads finish as they please?

Perhaps there is not that much to worry about: if the existing code does not do signal handling OR raises exceptions from the signal handler, which I suppose this approximately amounts to the same thing, is it fine to shutdown the program (or main thread only?) in such a case. The only difference is that the shutdown initiation comes from a different thread.

I don't know much about Python's shutdown, signals and threads, so I'm uncertain how helpful my comments are, and how much work it would be to implement the suggestions described here. Thanks and kind regards even so :-)

ivarref added a commit to ivarref/cpython that referenced this issue Nov 28, 2024
@ivarref
Copy link
Author

ivarref commented Nov 28, 2024

Hi again @colesbury and @gpshead

I've done a new take on this. It effectively runs signal handlers on a dedicated thread*.

The new signal.py code is highlighted here.

  • It is backwards compatible if the argument use_dedicated_thread=False is passed to to signal.signal. The default is true.

  • It will bubble an uncaught exception in a signal handler to the main thread. (Do you think this makes sense?)

  • The change is only in python code. The original C code from signalmodule.c is left untouched.

* It uses queue.SimpleQueue to achieve this, as queue.SimpleQueue.put is reentrant and can thus be used from actual signal handlers.

Here is an example program that uses the new signal handler:

import os
import signal
import threading
import time
import multiprocessing

def sigint_self():
    time.sleep(1)
    print(f'{threading.current_thread().name}: Stopping PID {os.getpid()} with SIGINT')
    os.kill(os.getpid(), signal.SIGINT)

def sigkill_self():
    time.sleep(5)
    print(f'{threading.current_thread().name}: Killing PID {os.getpid()} with SIGKILL')
    os.kill(os.getpid(), signal.SIGKILL)

def run_signal_handler_dedicated_thread():
    event = multiprocessing.Event()
    def sigint_handler(_signo, _stack_frame):
        try:
            # x = 1 / 0
            # ^^ If uncommented, the uncaught exception will be bubbled to the main thread.
            print(f'{threading.current_thread().name}: sigint_handler is setting event')
            event.set() # This would deadlock using the old signal handler code
        finally:
            print(f'{threading.current_thread().name}: sigint_handler is done')

    def sigterm_handler(_signo, _stack_frame):
        print(f'{threading.current_thread().name}: sigterm_handler is running')
        pass

    signal.signal(signal.SIGTERM, sigterm_handler)
    signal.signal(signal.SIGINT, sigint_handler)

    threading.Thread(target=sigint_self, daemon=True).start()
    threading.Thread(target=sigkill_self, daemon=True).start() # Used for debugging only.

    print(f'{threading.current_thread().name}: Waiting on event. PID = {os.getpid()}')
    event.wait()
    print(f'{threading.current_thread().name}: Waiting is done')

if __name__ == '__main__':
    try:
        run_signal_handler_dedicated_thread()
    finally:
        print(f'{threading.current_thread().name}: Exiting')

The committed code could certainly be cleaned up. No tests currently exists. I'd be happy to add that. But first I'd like to ask:

is this a viable path forwards? Using only Python to implement this and leaving the old C code as is**? Are there other considerations that must be done?

** I don't think I'm competent wrt. C and signals to write new C code.

Thanks and kind regards.

ivarref added a commit to ivarref/cpython that referenced this issue Dec 3, 2024
ivarref added a commit to ivarref/cpython that referenced this issue Dec 3, 2024
ivarref added a commit to ivarref/cpython that referenced this issue Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-multiprocessing type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants