gh-126434: Use multiprocessing.Value for multiprocessing.Event to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals #126437

ivarref · 2024-11-05T11:03:37Z

Please see issue #126434 for details about bug and code to reproduce deadlock.

Kind regards.

Issue: multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

cpython-cla-bot · 2024-11-05T11:03:40Z

All commit authors signed the Contributor License Agreement.

bedevere-app · 2024-11-05T11:03:41Z

Most changes to Python require a NEWS entry. Add one using the blurb_it web app or the blurb command-line tool.

If this change has little impact on Python users, wait for a maintainer to apply the skip news label instead.

Zheaoli

Mostly LGTM

BTW this PR title need be changed

Misc/NEWS.d/next/Library/2024-11-05-11-11-29.gh-issue-126434.Hw3wdI.rst

ivarref · 2024-11-05T12:15:04Z

Mostly LGTM

BTW this PR title need be changed

I've changed the title @Zheaoli . What do you think? Thanks again.

Zheaoli

LGTM

Misc/NEWS.d/next/Library/2024-11-05-11-11-29.gh-issue-126434.Hw3wdI.rst

Zheaoli · 2024-11-05T13:06:19Z

This patch is failed in free threading TSAN test. need to figure out the root cause

tomasr8 · 2024-11-05T13:57:01Z

This test seems to fail in other PRs as well, I restarted the job again to see if it fails again.

colesbury

The Event implementation is not correct with an RLock because that allows reentrancy (from the same thread) during operations that are not written to be reentrant. For example:

    def is_set(self):
        with self._cond:
            if self._flag.acquire(False):
                self._flag.release()
                return True
            return False

After the self._flag.acquire(False) before the self._flag.release(), reentrant is_set() calls will incorrectly appear to be false.

Event is not written to be safely called from a Python signal handler.

bedevere-app · 2024-11-05T15:09:00Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

Zheaoli · 2024-11-05T15:35:09Z

Event is not written to be safely called from a Python signal handler.

If this is the origin design behavior. Should we update the docs?

ZeroIntensity

This needs a test, anyway.

gpshead · 2024-11-11T03:50:36Z

Misc/NEWS.d/next/Library/2024-11-05-11-11-29.gh-issue-126434.Hw3wdI.rst

This needs rewriting to describe the current approach. Also, please update the PR title and description.

Thanks. I've updated the description now.

What do you think @gpshead ?

ivarref · 2024-11-11T08:04:44Z

Thanks for the feedback from all of you @colesbury, @gpshead, @Zheaoli @tomasr8 and @ZeroIntensity .

I agree with the criticism that my initial suggestion for is_set had a bug.

I've written a new implementation. I believe this new approach solves the deadlock described in the original issue: a reentrant use of set() in combination with is_set().

I also added some asserts. Instead of a deadlock, an assert will now be raised. What do you think about this? Personally I'd rather have an exception than a deadlock.

@ZeroIntensity How would you write a test for such a concurrency issue? I can only think of using some sort of metaprogramming, but at the moment I don't know enough Python to do it.

…to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals

ZeroIntensity · 2024-11-11T15:01:32Z

@ZeroIntensity How would you write a test for such a concurrency issue? I can only think of using some sort of metaprogramming, but at the moment I don't know enough Python to do it.

Here, you would run the reproducer from gh-126434 in a subprocess and check the return code + stdout/stderr. You can do this pretty easily with subprocess.

As a side note, don't force push--we squash everything down to one commit at the end anyway. Force pushing just makes it harder to keep track of what's going on.

ivarref · 2024-11-13T14:11:52Z

The key word here is stable. I'm not sure switching to Value is worth it for an edge case. If we do end up taking the change, I'm against backporting it for now.

What is the cost of not switching to Value (and this implementation)? Is it better to keep a deadlock so that the code can keep using semaphores? I don't think it is a good idea to keep a deadlock in order to keep the internal implemention only using semaphores.

The newly pushed code will detect and raise an exception when .set() is called from a thread that is already .wait()-ing. It uses a thread local lock to accomplish this. I know that is not very "pretty", but it's better (IMO) than a deadlock, and also the only way to solve this (except spawning a new thread).

Here is an issue describing this wait->set problem: #85772 . The issue is from 2020. It links a stackoverflow question from 2014. I think it's safe to say that these deadlocks have been around and affecting developers/programs for some time.

The newly pushed code also solves #95826 The root cause here was that multiprocessing.Event re-checked the internal flag after returning from the condition wait. It incorrectly ignored the return value from Condition.wait. Thus if another thread executed multiprocessing.Event.clear() before the finally block in Condition.wait, which re-acquires the locks, but after the try block had completed, the code in multiprocessing.Event.wait() would re-check the internal flag and then incorrectly return False, indicating that a timeout had occurred.

I've added tests so that these three issues are covered.

Trying to CC users of referred issues should they still be interested in this topic: @salvatoreingala @marcos4758 @mcclurem @mdboom

Thanks and kind regards.

ZeroIntensity · 2024-11-13T14:20:04Z

We're breaking users relying on the private API (which is their fault for doing, but it's not exactly encouraging...), and this implementation doesn't seem correct, because multiprocessing.Value is not thread safe. You'll have to put a lock over it, but won't that result in the same deadlock that we see currently?

…e called from a signal handler in any combination.

ivarref · 2024-11-14T10:19:24Z

Thanks again for your time and comments @ZeroIntensity .

We're breaking users relying on the private API (which is their fault for doing, but it's not exactly encouraging...)

Right, that is fair enough.

I've explicitly taken the locks now. My understanding though was that writes and reads were atomic, but operations involving a read and a write, e.g. an increment, were not. Please correct me if I'm wrong. Source for this claim. Anyhow: locks are now taken explicitly.

I rewrote the implementation now. My claim is that the new implementation is completely reentrant, i.e. it is safe to call from signal handlers. It should thus fix #85772 and #95826, as well as this issue (#126434). No exceptions are raised and no deadlocks should occur. There is a highly theoretical chance of a race condition that is described inside the code. I believe the chance is so low that it is safe to ignore. My guess is that an out of memory exception would occur long before the race condition could occur.

The big change is that instead of waiting on a condition, it now busy waits on a flag and set_id. Using a set_id avoids a race condition between set and clear. set will also work when invoked from a signal handler and the wait-ing thread will continue correctly. See the implementation for details.

The implementation sleeps for 10 milliseconds when busy waiting. On my machine, MacBook Air (15-inch, M3, 2024), CPU usage, when wait()-ing, is 0.5% at most. (Use top -p <pid> on linux and top -pid <pid> on OS X to show CPU usage.) This, the increased CPU usage, is the downside to this implementation (I haven't actually checked the CPU usage on the old implementation, but I assume it is much less). But somewhat increased CPU usage is much better than a deadlock or an exception in my opinion.

What do you think?

Update: I've also updated the blurb.

Thanks and kind regards.

marcos4758 · 2024-11-14T12:38:13Z

Hello, @ivarref!

A while ago, I decided to be happy and stopped using wait() 😁

Looking at your strategy I noticed that it is similar to what I did:

Before that, I tried using a wait on an event, which cost me hours of debugging and drained my patience.

I know it's not elegant, but it works.

ZeroIntensity · 2024-11-14T12:52:11Z

I'll look closer at this later today.

…set() call must be made to reach race condition possibility.

bedevere-bot · 2024-11-14T14:21:39Z

🤖 New build scheduled with the buildbot fleet by @ZeroIntensity for commit bec9070 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

ZeroIntensity · 2024-11-14T15:22:31Z

Yeah, the failing buildbots for test_multiprocessing_event make me very uneasy about the stability of the change.

gpshead · 2024-11-14T15:10:33Z

Lib/test/test_multiprocessing_event/test_event.py

+        script = support.findfile("is_set_set.py", subdir="multiprocessingdata")
+        for x in range(10):
+            try:
+                assert 0 == subprocess.call([sys.executable, script], timeout=60)


use test.support.assert_python_ok or assert_python_failure rather than launching sys.executable yourself. child interpreters may need flags passed, that takes care of it.

gpshead · 2024-11-14T15:14:40Z

Lib/test/test_multiprocessing_event/test_event.py

+@support.requires_subprocess()
+class TestEventSignalHandling(unittest.TestCase):
+    def test_no_race_for_is_set_set(self):
+        import subprocess


Put this import up top, the module is always importable regardless of whether or not it actually works on the platform which requires_subprocess tells you.

gpshead · 2024-11-14T16:51:50Z

Lib/multiprocessing/synchronize.py

+            elif timeout is not None and (time.monotonic() - start_time) > timeout:
+                return False
+            else:
+                # Fixes https://github.com/python/cpython/issues/85772 by spinning and sleeping.


Iff (lets not assume we should go this route - see my comment on the issue) we're going to abandon use of the native OS platform APIs which properly implement timeouts on inter-process semaphores without a busy loop via _multiprocessing.SemLock in Modules/_multiprocessing/semaphore.c, sleeping in the loop should be done in an exponential back-off fashion as was the case even in threading itself before we started using proper OS APIs there. See https://github.com/python/cpython/blob/v3.1.3/Lib/threading.py#L227 for the old exponential back-off delay example.

Doing a busy loop with sleeps as a low level primitive in 2024 feels very wrong to me. They've always been really unfriendly to both latency due to unnecessary delays and power usage from frequent unnecessary wakes.

I suggested an alternate idea in #126434.

History

Prior to CPython 3.2 threading.Condition was implemented with a back-off in a similar manner. 3.2 improved on that old hack by using the OS APIs for lock timeouts in 7c3e577

Thanks @gpshead and @ZeroIntensity

I will read your comments and suggestions thoroughly tomorrow, and come back to you.

Thanks and kind regards.

gpshead · 2024-11-14T16:52:49Z

I marked this as Draft just to make it clear that the direction to go has not been decided upon yet.

ZeroIntensity · 2024-11-14T16:58:14Z

If we do go with this, might it be better to put it under a new feature rather than Event (perhaps something like multiprocessing.SignalSafeEvent)?

ivarref · 2024-11-28T14:55:57Z

I rewrote the approach to use a dedicated thread for signal handing. The initial signal handling is still done on the main thread, but merely pushes the signal to a dedicated thread. This is done using queue.SimpleQueue, of which .put is reentrant. See #126434 (comment) for an example program that uses the new signal handler.

As asked in the issue comment above: is this a viable path going forward? Using only Python to implement this and leaving the current C code as is? Are there other considerations that must be done?

…bubble exception on main thread

… by defaultˆ

…ignal handler (on a non-main thread)?

ivarref · 2024-12-03T20:24:05Z

The current approach is now:

Run the signal handler on a dedicated daemon thread.
Send signals from the main thread to the dedicated signal thread using SimpleQueue.put, which supports reentrant usage.
In the face of an unexpected exception inside the signal handler, it will be logged using traceback.print_exc().

Example flow:

sequenceDiagram
    participant OS
    participant MainThread
    participant SignalHandlerThread
    OS->>MainThread: SIGINT
    MainThread->>SignalHandlerThread: send SIGINT using SimpleQueue.put
    SignalHandlerThread->>SignalHandlerThread: invoke user specified handler
    SignalHandlerThread->>SignalHandlerThread: handler calls event.set()

Discussion: what should be done in the event of a signal handler calling sys.exit()? If nothing is done when sys.exit is invoked from a signal thread, I suppose this will break a decent amount of code?

Current fix: send an exit event to the main thread and send the same signal to interrupt the main thread again, i.e. the reverse flow of the diagram above:

sequenceDiagram
    participant OS
    participant MainThread
    participant SignalHandlerThread
    SignalHandlerThread->>SignalHandlerThread: handler calls sys.exit(0)
    SignalHandlerThread->>MainThread: send SYSEXIT event using SimpleQueue.put
    SignalHandlerThread->>SignalHandlerThread: raise_signal(SIGINT)
    OS->>MainThread: SIGINT
    MainThread->>MainThread: checks SYSEXIT queue
    MainThread->>MainThread: Calls sys.exit(0)
    MainThread->>MainThread: `finally` block in will be executed.

This fix is the best I could think of. What do you think of this solution?

Not all tests pass at the moment.

CC @ZeroIntensity @gpshead Thanks for your time and patience.

ZeroIntensity · 2024-12-04T13:29:24Z

21 files changed

First of all, I'm impressed by your dedication!

Unfortunately, and I hate to say this, but judging by the complexity of this change, and by the fact that we need whole diagrams to explain it, I think we need a PEP. There's a lot that we need to think about here, and we just can't get enough community input from a PR.

ivarref · 2024-12-04T14:21:06Z

Thanks for your kind words @ZeroIntensity

Re 21 files: most of that is adding a line for the signal thread shutdown for a bunch of tests.

A PEP: I am not sure what to say about that. In my opinion this is not a big/huge change, and the diagram was just to illustrate how it is working. I'm sure a ton of work has already been put into the signal module and only handling (as of now) on the main thread. Also a ton of work must have been done with respect to the different platforms (I read about how to do os.kill on windows for example, it was not easy). This makes me sceptical to change any of this.

To me it seems that it is much easier to build this feature on top of the existing Python infrastructure, i.e. in a way that does not involve changing the underlying "mechanics" of C, OS-stuff, etc.. This makes it a moderate change IMO. And keeping the old behavior, it is possible to opt-out if needed.

Perhaps it could break something in unexpected ways. I did not think (or know any details about) sys.exit before trying to fix a test. Is there more cases that would require(?) special handling, or cause breakage for users?

What do you think @gpshead and @colesbury , is this "worthy"/necessary of a PEP?

If you all "vote" for a PEP, that is fine by me, and I could try to draft something.

Thanks and kind regards.

ZeroIntensity · 2024-12-04T14:31:32Z

FWIW, we'll likely need changes to the C code anyway. Global state for modules that affect the whole process don't play well with subinterpreters, so I suspect that this will have to get implemented on the runtime structure.

ivarref · 2024-12-05T09:49:44Z

Thanks @ZeroIntensity

Unlike sub-interpreters, the main interpreter has unique process-global responsibilities like signal handling.

from https://docs.python.org/3/c-api/init.html#sub-interpreter-support

I'm reading this as: the main interpreter will be the only interpreter that is responsible for signal handling. The subinterpreters should not need to access the signal module and nor the global state in that module. I suppose the signal functions can assert that they are being called on the main interpreter (or perhaps does so already in C-level?).

That said: I know next to nothing about subinterpreters. Is there some other way that the global state can be corrupted? Or: how does global module state corrupt subinterpreters? (Is there existing global module state? I suppose there is?)

Thanks and kind regards.

ZeroIntensity · 2024-12-05T15:10:55Z

Ah, by that it means that the main interpreter is responsible for dealing with signals sent to the Python process, so SIGINT. The signal module can still be imported and used in a subinterpreter (it's re-initialized for every interpreter). This will cause an extra, useless thread for every new interpreter, and that won't scale well. (It's not an impossible problem to solve--you can check the interpreter via _interpreters.get_current--but thinking about these kinds of things is exactly why I'm in favor of PEPing this.)

I suppose the signal functions can assert that they are being called on the main interpreter (or perhaps does so already in C-level?).

I don't think so, but that's probably a bug with subinterpreters themselves. I'll look into that seperately.

Anyways, the experimental, buggy subinterpreters aside, I'm not sure that the implementation even works, because CI is failing across the board. I'd recommend getting tests green before discussing the philosophical side of it.

ivarref requested a review from gpshead as a code owner November 5, 2024 11:03

bedevere-app bot added the awaiting review label Nov 5, 2024

bedevere-app bot mentioned this pull request Nov 5, 2024

multiprocessing.Event.set() can deadlock when called from a signal handler or trace function #126434

Open

Zheaoli suggested changes Nov 5, 2024

View reviewed changes

Misc/NEWS.d/next/Library/2024-11-05-11-11-29.gh-issue-126434.Hw3wdI.rst Outdated Show resolved Hide resolved

bedevere-app bot added awaiting core review and removed awaiting review labels Nov 5, 2024

ivarref changed the title ~~gh-126434: Fix multiprocessing.Event to use reentrant lock to support .set()…~~ gh-126434: Use RLock for multiprocessing Event to avoid deadlock when handle system signal (#126437) Nov 5, 2024

ivarref changed the title ~~gh-126434: Use RLock for multiprocessing Event to avoid deadlock when handle system signal (#126437)~~ gh-126434: Use RLock for multiprocessing Event to avoid deadlock when handle system signal Nov 5, 2024

Zheaoli approved these changes Nov 5, 2024

View reviewed changes

tomasr8 reviewed Nov 5, 2024

View reviewed changes

Misc/NEWS.d/next/Library/2024-11-05-11-11-29.gh-issue-126434.Hw3wdI.rst Outdated Show resolved Hide resolved

colesbury requested changes Nov 5, 2024

View reviewed changes

bedevere-app bot added awaiting changes and removed awaiting core review labels Nov 5, 2024

ZeroIntensity reviewed Nov 5, 2024

View reviewed changes

gpshead reviewed Nov 11, 2024

View reviewed changes

ivarref changed the title ~~gh-126434: Use RLock for multiprocessing Event to avoid deadlock when handle system signal~~ gh-126434: Use Value for multiprocessing Event to avoid deadlock when handling system signals Nov 11, 2024

pythongh-126434: Use multiprocessing.Value for multiprocessing.Event …

e5302e6

…to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals

ivarref force-pushed the fix-issue-126434 branch from 516c663 to e5302e6 Compare November 11, 2024 12:25

pythongh-126434: Add test

cbecc76

ivarref requested a review from erlend-aasland as a code owner November 12, 2024 08:43

ivarref added 2 commits November 14, 2024 10:49

pythongh-126434: multiprocessing.Event is fully reentrant, i.e. can b…

25236fe

…e called from a signal handler in any combination.

pythongh-126434: Sleep longer.

b1103b2

pythongh-126434: Update blurb.

bec9070

pythongh-126434: Document 64 bit example of how many times reentrant …

c15a45a

…set() call must be made to reach race condition possibility.

ZeroIntensity added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 14, 2024

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Nov 14, 2024

gpshead reviewed Nov 14, 2024

View reviewed changes

gpshead marked this pull request as draft November 14, 2024 16:52

bedevere-app bot removed the awaiting changes label Nov 14, 2024

pythongh-126434: Process signals on dedicated thread

c308344

ivarref added 4 commits December 3, 2024 11:43

pythongh-126434: Continue process signals on dedicated thread, don't …

5a43697

…bubble exception on main thread

pythongh-126434: Start fix tests when using use_dedicated_thread=True…

574e91c

… by defaultˆ

pythongh-126434: What should be done if sys.exit() is called from a s…

25f4c7e

…ignal handler (on a non-main thread)?

pythongh-126434: Propagate sys.exit from signal handler to main thread

88b45b9

gh-126434: Use multiprocessing.Value for multiprocessing.Event to avoid deadlock when there is reentrant usage of set from is_set, e.g. when handling system signals #126437

Are you sure you want to change the base?

gh-126434: Use multiprocessing.Value for multiprocessing.Event to avoid deadlock when there is reentrant usage of set from is_set, e.g. when handling system signals #126437

Conversation

ivarref commented Nov 5, 2024 • edited Loading

cpython-cla-bot bot commented Nov 5, 2024 • edited Loading

bedevere-app bot commented Nov 5, 2024

Zheaoli left a comment

Choose a reason for hiding this comment

ivarref commented Nov 5, 2024

Zheaoli left a comment

Choose a reason for hiding this comment

Zheaoli commented Nov 5, 2024

tomasr8 commented Nov 5, 2024

colesbury left a comment

Choose a reason for hiding this comment

bedevere-app bot commented Nov 5, 2024

Zheaoli commented Nov 5, 2024

ZeroIntensity left a comment

Choose a reason for hiding this comment

gpshead Nov 11, 2024

Choose a reason for hiding this comment

ivarref Nov 11, 2024

Choose a reason for hiding this comment

ivarref commented Nov 11, 2024 • edited Loading

ZeroIntensity commented Nov 11, 2024

ivarref commented Nov 13, 2024

ZeroIntensity commented Nov 13, 2024

ivarref commented Nov 14, 2024 • edited Loading

marcos4758 commented Nov 14, 2024

ZeroIntensity commented Nov 14, 2024

bedevere-bot commented Nov 14, 2024

ZeroIntensity commented Nov 14, 2024 • edited Loading

gpshead Nov 14, 2024

Choose a reason for hiding this comment

gpshead Nov 14, 2024

Choose a reason for hiding this comment

gpshead Nov 14, 2024

Choose a reason for hiding this comment

History

ivarref Nov 14, 2024

Choose a reason for hiding this comment

gpshead commented Nov 14, 2024

ZeroIntensity commented Nov 14, 2024

ivarref commented Nov 28, 2024 • edited Loading

ivarref commented Dec 3, 2024

ZeroIntensity commented Dec 4, 2024

ivarref commented Dec 4, 2024 • edited Loading

ZeroIntensity commented Dec 4, 2024

ivarref commented Dec 5, 2024 • edited Loading

ZeroIntensity commented Dec 5, 2024

gh-126434: Use multiprocessing.Value for multiprocessing.Event to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals #126437

gh-126434: Use multiprocessing.Value for multiprocessing.Event to avoid deadlock when there is reentrant usage of `set` from `is_set`, e.g. when handling system signals #126437

ivarref commented Nov 5, 2024 •

edited

Loading

cpython-cla-bot bot commented Nov 5, 2024 •

edited

Loading

ivarref commented Nov 11, 2024 •

edited

Loading

ivarref commented Nov 14, 2024 •

edited

Loading

ZeroIntensity commented Nov 14, 2024 •

edited

Loading

ivarref commented Nov 28, 2024 •

edited

Loading

ivarref commented Dec 4, 2024 •

edited

Loading

ivarref commented Dec 5, 2024 •

edited

Loading