Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-114271: Fix race in Thread.join() #114839

Merged
merged 34 commits into from
Mar 16, 2024

Conversation

mpage
Copy link
Contributor

@mpage mpage commented Feb 1, 2024

This PR makes thread joining thread-safe in free-threaded builds and fixes gh-116372, which affects both default and free-threaded builds.

The combination of _tstate_lock and _ThreadHandle is replaced by _ThreadHandle and the supporting infrastructure for _tstate_lock is removed. Joining a thread now joins its _ThreadHandle. The _ThreadHandles for all threads that we want to join on shutdown (non-daemon threads that were created by the threading module) are tracked in a linked-list attached to the _thread module. At shutdown we iterate over this list and join the handles.

There is one wrinkle with this approach: the _MainThread instance. The _MainThread instance is joinable. Previously, it represented whatever thread imported the threading module. Typically this is the main thread of the main interpreter, but in some rare circumstances (embedding scenarios?) it might not be. The previous approach allowed the MainThread instance to be joined even if it wasn't the main thread of the main interpreter, but this isn't possible with the current approach. To work around this we change the _MainThread to always represent the main thread of the runtime, also fixing gh-83223.

There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:

1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
   out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
   out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
   However, C holds it, so the assertion fails.

The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.

There are two main parts to this PR:

1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
   The event is set by the runtime prior to the thread being cleared (in the same
   place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
   event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
   non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
   `PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
   now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
   `_tstate_lock`s.

[^1]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1201
[^2]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1115
[^3]: 8194653
@mpage
Copy link
Contributor Author

mpage commented Feb 1, 2024

Just a heads up - please don't look at this until it's marked as ready for review. There are a couple of failing CI jobs that need to be addressed.

When we have `typedef struct _PyEventRc _PyEventRc` in `pstate.h`
the C analyzer can't decide between that and the definition in `pycore_lock.h`.
@mpage mpage marked this pull request as ready for review February 5, 2024 22:49
@mpage
Copy link
Contributor Author

mpage commented Feb 6, 2024

@colesbury - This is good to go. Would you please have a look?

@mpage mpage marked this pull request as draft February 6, 2024 19:52
@mpage
Copy link
Contributor Author

mpage commented Feb 6, 2024

Putting this back in draft. Going to see if I can move the join logic into ThreadHandle.

@mpage mpage marked this pull request as ready for review March 5, 2024 20:30
@mpage
Copy link
Contributor Author

mpage commented Mar 6, 2024

@colesbury - Could you take a look at this when you get a chance, please?

Copy link
Contributor

@colesbury colesbury left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the approach looks nice!

I'm still looking this over, but I have some questions about thread._handle. Currently, thread._handle starts out as None and is then set to a ThreadHandle after the thread is started. This None -> ThreadHandle transition seems like it adds complexity and I think there may be a few bugs related to it (see comments below).

Would it be possible to always have a non-None self._handle by pushing a bit more state into ThreadHandle?

Lib/threading.py Show resolved Hide resolved
Python/pystate.c Outdated Show resolved Hide resolved
@ericsnowcurrently
Copy link
Member

Yeah, you've captured it well. FYI, I was looking at the _MainThread stuff separately already. 😄

@ericsnowcurrently
Copy link
Member

I'd be glad to follow-through on the changes I posted in gh-116776, if that would help you.

@mpage
Copy link
Contributor Author

mpage commented Mar 14, 2024

I'd be glad to follow-through on the changes I posted in #116776, if that would help you.

@ericsnowcurrently - Thanks! Is that something you think you can get done quickly (i.e. in the next few days)? If so, I'd be happy to stack on top of those changes once you're done.

@ericsnowcurrently
Copy link
Member

Yeah, I'll aim to wrap it up tomorrow.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I really like the simplification here!

Modules/_threadmodule.c Show resolved Hide resolved
Modules/_threadmodule.c Outdated Show resolved Hide resolved
Modules/_threadmodule.c Outdated Show resolved Hide resolved
@ericsnowcurrently
Copy link
Member

I'll defer to @pitrou's assessment on this. I'd rather this land sooner than later and don't want my pace to hold things up. Again, thanks for working on this.

(I'll follow up separately on any cleanup, about the "main" thread, and about our approach to managing thread lifetimes. I actually appreciate not feeling so rushed to sort those out. 😄)

@ericsnowcurrently
Copy link
Member

FWIW, this change feels a bit closer to what I had in mind last year: #110848 (comment).

@mpage
Copy link
Contributor Author

mpage commented Mar 15, 2024

The CIFuzz (memory) failure is unrelated to this PR.

@pitrou
Copy link
Member

pitrou commented Mar 15, 2024

Do we have a Thread Sanitizer CI build somewhere? In my experience it's quite good at finding synchronization bugs.
@gpshead

Edit: oh, I see this is being proposed as #116872. Thank you @corona10 !

@pitrou
Copy link
Member

pitrou commented Mar 15, 2024

@gpshead Would you like to take a look at this?

@pitrou
Copy link
Member

pitrou commented Mar 16, 2024

Ok, TSAN tests passed after merging from main (though there are too few of them currently).

@pitrou
Copy link
Member

pitrou commented Mar 16, 2024

Updated again to get test_threading in the set of TSAN tests.

@pitrou pitrou merged commit 33da0e8 into python:main Mar 16, 2024
35 of 36 checks passed
vstinner pushed a commit to vstinner/cpython that referenced this pull request Mar 20, 2024
There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:

1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
   out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
   out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
   However, C holds it, so the assertion fails.

The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.

There are two main parts to this PR:

1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
   The event is set by the runtime prior to the thread being cleared (in the same
   place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
   event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
   non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
   `PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
   now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
   `_tstate_lock`s.

[^1]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1201
[^2]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1115
[^3]: mpage@8194653

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
adorilson pushed a commit to adorilson/cpython that referenced this pull request Mar 25, 2024
There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:

1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
   out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
   out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
   However, C holds it, so the assertion fails.

The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.

There are two main parts to this PR:

1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
   The event is set by the runtime prior to the thread being cleared (in the same
   place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
   event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
   non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
   `PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
   now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
   `_tstate_lock`s.

[^1]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1201
[^2]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1115
[^3]: mpage@8194653

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Apr 17, 2024
There is a race between when `Thread._tstate_lock` is released[^1] in `Thread._wait_for_tstate_lock()`
and when `Thread._stop()` asserts[^2] that it is unlocked. Consider the following execution
involving threads A, B, and C:

1. A starts.
2. B joins A, blocking on its `_tstate_lock`.
3. C joins A, blocking on its `_tstate_lock`.
4. A finishes and releases its `_tstate_lock`.
5. B acquires A's `_tstate_lock` in `_wait_for_tstate_lock()`, releases it, but is swapped
   out before calling `_stop()`.
6. C is scheduled, acquires A's `_tstate_lock` in `_wait_for_tstate_lock()` but is swapped
   out before releasing it.
7. B is scheduled, calls `_stop()`, which asserts that A's `_tstate_lock` is not held.
   However, C holds it, so the assertion fails.

The race can be reproduced[^3] by inserting sleeps at the appropriate points in
the threading code. To do so, run the `repro_join_race.py` from the linked repo.

There are two main parts to this PR:

1. `_tstate_lock` is replaced with an event that is attached to `PyThreadState`.
   The event is set by the runtime prior to the thread being cleared (in the same
   place that `_tstate_lock` was released). `Thread.join()` blocks waiting for the
   event to be set.
2. `_PyInterpreterState_WaitForThreads()` provides the ability to wait for all
   non-daemon threads to exit. To do so, an `is_daemon` predicate was added to
   `PyThreadState`. This field is set each time a thread is created. `threading._shutdown()`
   now calls into `_PyInterpreterState_WaitForThreads()` instead of waiting on
   `_tstate_lock`s.

[^1]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1201
[^2]: https://github.com/python/cpython/blob/441affc9e7f419ef0b68f734505fa2f79fe653c7/Lib/threading.py#L1115
[^3]: mpage@8194653

---------

Co-authored-by: blurb-it[bot] <43283697+blurb-it[bot]@users.noreply.github.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
intellij-monorepo-bot pushed a commit to JetBrains/intellij-community that referenced this pull request Jun 21, 2024
…hon 3.13

`Thread._is_stopped` was removed in python/cpython#114839 in Python 3.13

(cherry picked from commit d568bc288aed00268ffeef137b9b901f480964ef)

IJ-MR-137093

GitOrigin-RevId: aa8b27b09db85f69b44d6fbbd333f8ec32489d33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race in Thread.join()
5 participants