Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-108987: Fix _thread.start_new_thread() race condition #109135

Merged
merged 1 commit into from
Sep 11, 2023

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Sep 8, 2023

Fix _thread.start_new_thread() race condition. If a thread is created during Python finalization, the newly spawned thread now exits immediately instead of trying to access freed memory and lead to a crash.

thread_run() calls PyEval_AcquireThread() which checks if the thread must exit. The problem was that tstate was dereferenced earlier in _PyThreadState_Bind() which leads to a crash most of the time.

Fix _thread.start_new_thread() race condition. If a thread is created
during Python finalization, the newly spawned thread now exits
immediately instead of trying to access freed memory and lead to a
crash.

thread_run() calls PyEval_AcquireThread() which checks if the thread
must exit. The problem was that tstate was dereferenced earlier in
_PyThreadState_Bind() which leads to a crash most of the time.

Move _PyThreadState_CheckConsistency() from thread_run() to
_PyThreadState_Bind().
@vstinner vstinner force-pushed the start_new_thread branch 2 times, most recently from 9ecc4f7 to a50f9b6 Compare September 8, 2023 10:53
@vstinner vstinner marked this pull request as ready for review September 8, 2023 10:53
@vstinner
Copy link
Member Author

vstinner commented Sep 8, 2023

I'm now trying to confirm manually that my change fix #108987 : this bug is not easy to reproduce in a reliable way. It seems more likely on Windows 32-bit for some reasons.

@vstinner
Copy link
Member Author

vstinner commented Sep 8, 2023

I'm trying to make the crash more likely by adding a sleep in thread_run(). So far, I failed to trigger the crash on Linux:

diff --git a/Modules/_threadmodule.c b/Modules/_threadmodule.c
index 05bb49756c..39ee06f2f7 100644
--- a/Modules/_threadmodule.c
+++ b/Modules/_threadmodule.c
@@ -1076,6 +1076,13 @@ thread_run(void *boot_raw)
     struct bootstate *boot = (struct bootstate *) boot_raw;
     PyThreadState *tstate = boot->tstate;
 
+    {
+        struct timeval tv;
+        tv.tv_sec = 0;
+        tv.tv_usec = 100 * 1000;
+        select(0, NULL, NULL, NULL, &tv);
+    }
+
     // gh-104690: If Python is being finalized and PyInterpreterState_Delete()
     // was called, tstate becomes a dangling pointer.
     assert(_PyThreadState_CheckConsistency(tstate));

On Linux, I'm running these two commands in two different terminals:

  • ./python -m test -v test_threading -m test_default_timeout --forever -j200
  • ./python -m test -j8 -r --forever

Also, to fix test_default_timeout() test, I wrote this local fix:

diff --git a/Lib/test/lock_tests.py b/Lib/test/lock_tests.py
index a4f52cb20a..1949991fe6 100644
--- a/Lib/test/lock_tests.py
+++ b/Lib/test/lock_tests.py
@@ -1016,13 +1016,17 @@ def test_default_timeout(self):
         """
         # create a barrier with a low default timeout
         barrier = self.barriertype(self.N, timeout=0.3)
-        def f():
-            i = barrier.wait()
+        def thread_run():
+            try:
+                i = barrier.wait()
+            except threading.BrokenBarrierError:
+                return
+
             if i == self.N // 2:
                 # One thread is later than the default timeout of 0.3s.
                 time.sleep(1.0)
             self.assertRaises(threading.BrokenBarrierError, barrier.wait)
-        self.run_threads(f)
+        self.run_threads(thread_run)
 
     def test_single_thread(self):
         b = self.barriertype(1)

@vstinner
Copy link
Member Author

vstinner commented Sep 8, 2023

Additional patch to attempt to make the crash more likely:

diff --git a/Python/pystate.c b/Python/pystate.c
index 09c3538ad7..788c4b2a62 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -1034,6 +1034,11 @@ PyInterpreterState_Delete(PyInterpreterState *interp)
     _PyObject_FiniState(interp);
 
     free_interpreter(interp);
+
+    struct timeval tv;
+    tv.tv_sec = 0;
+    tv.tv_usec = 100 * 1000;
+    select(0, NULL, NULL, NULL, &tv);
 }
 
 

@serhiy-storchaka serhiy-storchaka self-requested a review September 8, 2023 11:32
@vstinner
Copy link
Member Author

vstinner commented Sep 8, 2023

Ok, I managed to reproduce the crash on Linux in a reliable way.

Apply this patch:

diff --git a/Modules/_threadmodule.c b/Modules/_threadmodule.c
index 05bb49756c..be40fb4bbf 100644
--- a/Modules/_threadmodule.c
+++ b/Modules/_threadmodule.c
@@ -1076,6 +1076,15 @@ thread_run(void *boot_raw)
     struct bootstate *boot = (struct bootstate *) boot_raw;
     PyThreadState *tstate = boot->tstate;
 
+    {
+        fprintf(stderr, "-- sleep 100 ms before PyEval_AcquireThread() --\n");
+
+        struct timeval tv;
+        tv.tv_sec = 0;
+        tv.tv_usec = 100 * 1000;
+        select(0, NULL, NULL, NULL, &tv);
+    }
+
     // gh-104690: If Python is being finalized and PyInterpreterState_Delete()
     // was called, tstate becomes a dangling pointer.
     assert(_PyThreadState_CheckConsistency(tstate));
diff --git a/Python/pystate.c b/Python/pystate.c
index 09c3538ad7..6392164f17 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -1034,6 +1034,13 @@ PyInterpreterState_Delete(PyInterpreterState *interp)
     _PyObject_FiniState(interp);
 
     free_interpreter(interp);
+
+    fprintf(stderr, "-- sleep 1 sec after free_interpreter() --\n");
+
+    struct timeval tv;
+    tv.tv_sec = 1;
+    tv.tv_usec = 0;
+    select(0, NULL, NULL, NULL, &tv);
 }
 
 

Run this script bug.py:

import _thread
import os
import sys

NTHREAD = 250

def create_thread():
    _thread.start_new_thread(os.write, (1, b'.'))

threads = [create_thread()
           for _ in range(NTHREAD)]
print("exit")

Output on the main branch with Python built in debug mode (./configure --with-pydebug CFLAGS="-O0"):

$ ./python x.py 
exit
python: Python/pystate.c:2910: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
Abandon (core dumped)

This race condition is now quite obvious to me. I'm not sure why nobody found it earlier. Maybe in the past, threads exited while trying to acquire the GIL, in PyEval_AcquireThread()?

@vstinner
Copy link
Member Author

vstinner commented Sep 8, 2023

I can reproduce the crash on Python 3.12 using my select() patch above.

For this test, I used PR #109133 (not merged into 3.12 yet) to get a more accurate error message. I added the sleep after _PyThreadState_CheckConsistency() in thread_run().

$ ./python ../main/bug.py 
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
(...)
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
exit
-- sleep 1 sec after free_interpreter() --
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
Abandon (core dumped)

@vstinner
Copy link
Member Author

vstinner commented Sep 8, 2023

Now I can also reproduce the bug on Linux with Python 3.11 using the sleep() patch:

$ ./python ../main/bug.py 
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
(...)
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
exit
-- sleep 1 sec after free_interpreter() --
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
Abandon (core dumped)

The important part is to add the sleep at the right place:

  • thread_run(): before _PyThreadState_CheckConsistency().
  • PyInterpreterState_Delete(): after free_interpreter().

@vstinner
Copy link
Member Author

vstinner commented Sep 9, 2023

@serhiy-storchaka @gpshead: Would you mind to review this fix for an old race condition in _thread.start_new_thread()?

I'm not sure about leaking references on purpose :-( Maybe I can add a basic sync primitive between the PyThread_start_new_thread() call on thread_run(): the "parent" function would wait until thread_run() is called.

The caller must hold the GIL to call _thread.start_new_thread(). If this function waits until thread_run() is called, we are sure that Python cannot be finalized in the meanwhile and we are good.

Avoiding reference leaks with threads... is hard. This change uses _PyThreadState_MustExit() which was previously called must_exit(). In ceval_gil.c, there are already many code paths quitting a thread if must_exit() is true. I'm not sure that we always avoid reference leaks in such code. Morever, I know that @gpshead dislikes when ceval_gil.c calls PyThread_exit_thread(): C++ dislikes it as well :-)

Maybe we can start with this "simple" approach, and later, if it becomes a real issue, consider designing an "even better" approach. I suppose that right now, the most important thing is to prevent Python to leak. Leaking a few references sound "less bad" than crashing the whole process.

Python has the old habit of cleaning resources of a thread from another thread: the thread calling Py_Finalize() is responsible to clear all resources of all other threads. I'm thinking about pystate.h zapthreads() function.

In a perfect world, a thread should clear its own resources. But well. There are technical challenges. We may want to be able to complete Py_Finalize() as soon as possible, while a daemon thread can be waiting for something unrelated to Python. The current semantics is that daemon threads survive to Python finalization, but "must exit" if they attempt to acquire the GIL after Python finalization.

I noticed a similar problem with sub-interpreters, it seems like the main interpreter calling Py_Finalize() does clean resources of other interpreters. Resources ownership is not strictly of each interpreter. The main interpreter seems to like to clean up resources of other interpreters. It makes me feel unconfortable. But again, they are technical challenges.

Put {sub-interpreters, nogil, multithreaded applications, signals, processes, multiprocessing, concurrent.futures} in a shaker with ice cubes, shake it for 2 min, serve in a glass, take a sip, and then cry. LOL.

cc @colesbury @ericsnowcurrently who may love such crazy bugs as well.

@vstinner vstinner merged commit 517cd82 into python:main Sep 11, 2023
@vstinner vstinner deleted the start_new_thread branch September 11, 2023 15:27
@vstinner vstinner added needs backport to 3.11 only security fixes needs backport to 3.12 bug and security fixes labels Sep 11, 2023
@miss-islington
Copy link
Contributor

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

@miss-islington
Copy link
Contributor

Sorry, @vstinner, I could not cleanly backport this to 3.11 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 517cd82ea7d01b344804413ef05610934a43a241 3.11

@miss-islington
Copy link
Contributor

Sorry, @vstinner, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 517cd82ea7d01b344804413ef05610934a43a241 3.12

@vstinner
Copy link
Member Author

Oh, the backport to 3.12 should wait for #109133 to be merged.

vstinner added a commit to vstinner/cpython that referenced this pull request Sep 11, 2023
…n#109135)

Fix _thread.start_new_thread() race condition. If a thread is created
during Python finalization, the newly spawned thread now exits
immediately instead of trying to access freed memory and lead to a
crash.

thread_run() calls PyEval_AcquireThread() which checks if the thread
must exit. The problem was that tstate was dereferenced earlier in
_PyThreadState_Bind() which leads to a crash most of the time.

Move _PyThreadState_CheckConsistency() from thread_run() to
_PyThreadState_Bind().

(cherry picked from commit 517cd82)
@bedevere-app
Copy link

bedevere-app bot commented Sep 11, 2023

GH-109272 is a backport of this pull request to the 3.11 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.11 only security fixes label Sep 11, 2023
vstinner added a commit that referenced this pull request Sep 11, 2023
) (#109272)

gh-108987: Fix _thread.start_new_thread() race condition (#109135)

Fix _thread.start_new_thread() race condition. If a thread is created
during Python finalization, the newly spawned thread now exits
immediately instead of trying to access freed memory and lead to a
crash.

thread_run() calls PyEval_AcquireThread() which checks if the thread
must exit. The problem was that tstate was dereferenced earlier in
_PyThreadState_Bind() which leads to a crash most of the time.

Move _PyThreadState_CheckConsistency() from thread_run() to
_PyThreadState_Bind().

(cherry picked from commit 517cd82)
vstinner added a commit to vstinner/cpython that referenced this pull request Oct 4, 2023
…n#109135)

Fix _thread.start_new_thread() race condition. If a thread is created
during Python finalization, the newly spawned thread now exits
immediately instead of trying to access freed memory and lead to a
crash.

thread_run() calls PyEval_AcquireThread() which checks if the thread
must exit. The problem was that tstate was dereferenced earlier in
_PyThreadState_Bind() which leads to a crash most of the time.

Move _PyThreadState_CheckConsistency() from thread_run() to
_PyThreadState_Bind().

(cherry picked from commit 517cd82)
@bedevere-app
Copy link

bedevere-app bot commented Oct 4, 2023

GH-110342 is a backport of this pull request to the 3.12 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Oct 4, 2023
vstinner added a commit that referenced this pull request Oct 4, 2023
) (#110342)

* gh-108987: Fix _thread.start_new_thread() race condition (#109135)

Fix _thread.start_new_thread() race condition. If a thread is created
during Python finalization, the newly spawned thread now exits
immediately instead of trying to access freed memory and lead to a
crash.

thread_run() calls PyEval_AcquireThread() which checks if the thread
must exit. The problem was that tstate was dereferenced earlier in
_PyThreadState_Bind() which leads to a crash most of the time.

Move _PyThreadState_CheckConsistency() from thread_run() to
_PyThreadState_Bind().

(cherry picked from commit 517cd82)

* gh-109795: `_thread.start_new_thread`: allocate thread bootstate using raw memory allocator (#109808)

(cherry picked from commit 1b8f236)

---------

Co-authored-by: Radislav Chugunov <52372310+chgnrdv@users.noreply.github.com>
@gpshead
Copy link
Member

gpshead commented Oct 4, 2023

thanks for this, at first glance I believe it to be correct. (hard to ever say for sure on this kind of thing =)
at least there was an assertion in the code before this change. turning that into an actual non crashing bail out as this PR does seems good.

@vstinner
Copy link
Member Author

vstinner commented Oct 5, 2023

Sadly, this design "check before use" still has a "short" race condition: #110052 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants