gh-108987: Fix _thread.start_new_thread() race condition #109135

vstinner · 2023-09-08T10:49:38Z

Fix _thread.start_new_thread() race condition. If a thread is created during Python finalization, the newly spawned thread now exits immediately instead of trying to access freed memory and lead to a crash.

thread_run() calls PyEval_AcquireThread() which checks if the thread must exit. The problem was that tstate was dereferenced earlier in _PyThreadState_Bind() which leads to a crash most of the time.

Issue: test_threading fails and crashes when rerun #108987

Fix _thread.start_new_thread() race condition. If a thread is created during Python finalization, the newly spawned thread now exits immediately instead of trying to access freed memory and lead to a crash. thread_run() calls PyEval_AcquireThread() which checks if the thread must exit. The problem was that tstate was dereferenced earlier in _PyThreadState_Bind() which leads to a crash most of the time. Move _PyThreadState_CheckConsistency() from thread_run() to _PyThreadState_Bind().

vstinner · 2023-09-08T10:54:31Z

I'm now trying to confirm manually that my change fix #108987 : this bug is not easy to reproduce in a reliable way. It seems more likely on Windows 32-bit for some reasons.

vstinner · 2023-09-08T11:01:18Z

I'm trying to make the crash more likely by adding a sleep in thread_run(). So far, I failed to trigger the crash on Linux:

diff --git a/Modules/_threadmodule.c b/Modules/_threadmodule.c
index 05bb49756c..39ee06f2f7 100644
--- a/Modules/_threadmodule.c
+++ b/Modules/_threadmodule.c
@@ -1076,6 +1076,13 @@ thread_run(void *boot_raw)
     struct bootstate *boot = (struct bootstate *) boot_raw;
     PyThreadState *tstate = boot->tstate;
 
+    {
+        struct timeval tv;
+        tv.tv_sec = 0;
+        tv.tv_usec = 100 * 1000;
+        select(0, NULL, NULL, NULL, &tv);
+    }
+
     // gh-104690: If Python is being finalized and PyInterpreterState_Delete()
     // was called, tstate becomes a dangling pointer.
     assert(_PyThreadState_CheckConsistency(tstate));

On Linux, I'm running these two commands in two different terminals:

./python -m test -v test_threading -m test_default_timeout --forever -j200
./python -m test -j8 -r --forever

Also, to fix test_default_timeout() test, I wrote this local fix:

diff --git a/Lib/test/lock_tests.py b/Lib/test/lock_tests.py
index a4f52cb20a..1949991fe6 100644
--- a/Lib/test/lock_tests.py
+++ b/Lib/test/lock_tests.py
@@ -1016,13 +1016,17 @@ def test_default_timeout(self):
         """
         # create a barrier with a low default timeout
         barrier = self.barriertype(self.N, timeout=0.3)
-        def f():
-            i = barrier.wait()
+        def thread_run():
+            try:
+                i = barrier.wait()
+            except threading.BrokenBarrierError:
+                return
+
             if i == self.N // 2:
                 # One thread is later than the default timeout of 0.3s.
                 time.sleep(1.0)
             self.assertRaises(threading.BrokenBarrierError, barrier.wait)
-        self.run_threads(f)
+        self.run_threads(thread_run)
 
     def test_single_thread(self):
         b = self.barriertype(1)

vstinner · 2023-09-08T11:10:55Z

Additional patch to attempt to make the crash more likely:

diff --git a/Python/pystate.c b/Python/pystate.c
index 09c3538ad7..788c4b2a62 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -1034,6 +1034,11 @@ PyInterpreterState_Delete(PyInterpreterState *interp)
     _PyObject_FiniState(interp);
 
     free_interpreter(interp);
+
+    struct timeval tv;
+    tv.tv_sec = 0;
+    tv.tv_usec = 100 * 1000;
+    select(0, NULL, NULL, NULL, &tv);
 }

vstinner · 2023-09-08T11:45:50Z

Ok, I managed to reproduce the crash on Linux in a reliable way.

Apply this patch:

diff --git a/Modules/_threadmodule.c b/Modules/_threadmodule.c
index 05bb49756c..be40fb4bbf 100644
--- a/Modules/_threadmodule.c
+++ b/Modules/_threadmodule.c
@@ -1076,6 +1076,15 @@ thread_run(void *boot_raw)
     struct bootstate *boot = (struct bootstate *) boot_raw;
     PyThreadState *tstate = boot->tstate;
 
+    {
+        fprintf(stderr, "-- sleep 100 ms before PyEval_AcquireThread() --\n");
+
+        struct timeval tv;
+        tv.tv_sec = 0;
+        tv.tv_usec = 100 * 1000;
+        select(0, NULL, NULL, NULL, &tv);
+    }
+
     // gh-104690: If Python is being finalized and PyInterpreterState_Delete()
     // was called, tstate becomes a dangling pointer.
     assert(_PyThreadState_CheckConsistency(tstate));
diff --git a/Python/pystate.c b/Python/pystate.c
index 09c3538ad7..6392164f17 100644
--- a/Python/pystate.c
+++ b/Python/pystate.c
@@ -1034,6 +1034,13 @@ PyInterpreterState_Delete(PyInterpreterState *interp)
     _PyObject_FiniState(interp);
 
     free_interpreter(interp);
+
+    fprintf(stderr, "-- sleep 1 sec after free_interpreter() --\n");
+
+    struct timeval tv;
+    tv.tv_sec = 1;
+    tv.tv_usec = 0;
+    select(0, NULL, NULL, NULL, &tv);
 }

Run this script bug.py:

import _thread
import os
import sys

NTHREAD = 250

def create_thread():
    _thread.start_new_thread(os.write, (1, b'.'))

threads = [create_thread()
           for _ in range(NTHREAD)]
print("exit")

Output on the main branch with Python built in debug mode (./configure --with-pydebug CFLAGS="-O0"):

$ ./python x.py 
exit
python: Python/pystate.c:2910: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
Abandon (core dumped)

This race condition is now quite obvious to me. I'm not sure why nobody found it earlier. Maybe in the past, threads exited while trying to acquire the GIL, in PyEval_AcquireThread()?

vstinner · 2023-09-08T11:52:59Z

I can reproduce the crash on Python 3.12 using my select() patch above.

For this test, I used PR #109133 (not merged into 3.12 yet) to get a more accurate error message. I added the sleep after _PyThreadState_CheckConsistency() in thread_run().

$ ./python ../main/bug.py 
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
(...)
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
exit
-- sleep 1 sec after free_interpreter() --
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2870: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
Abandon (core dumped)

vstinner · 2023-09-08T11:58:17Z

Now I can also reproduce the bug on Linux with Python 3.11 using the sleep() patch:

$ ./python ../main/bug.py 
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
(...)
-- sleep 100 ms before PyEval_AcquireThread() --
-- sleep 100 ms before PyEval_AcquireThread() --
exit
-- sleep 1 sec after free_interpreter() --
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
python: Python/pystate.c:2259: _PyThreadState_CheckConsistency: Assertion `!_PyMem_IsPtrFreed(tstate->interp)' failed.
Abandon (core dumped)

The important part is to add the sleep at the right place:

thread_run(): before _PyThreadState_CheckConsistency().
PyInterpreterState_Delete(): after free_interpreter().

vstinner · 2023-09-09T09:15:44Z

@serhiy-storchaka @gpshead: Would you mind to review this fix for an old race condition in _thread.start_new_thread()?

I'm not sure about leaking references on purpose :-( Maybe I can add a basic sync primitive between the PyThread_start_new_thread() call on thread_run(): the "parent" function would wait until thread_run() is called.

The caller must hold the GIL to call _thread.start_new_thread(). If this function waits until thread_run() is called, we are sure that Python cannot be finalized in the meanwhile and we are good.

Avoiding reference leaks with threads... is hard. This change uses _PyThreadState_MustExit() which was previously called must_exit(). In ceval_gil.c, there are already many code paths quitting a thread if must_exit() is true. I'm not sure that we always avoid reference leaks in such code. Morever, I know that @gpshead dislikes when ceval_gil.c calls PyThread_exit_thread(): C++ dislikes it as well :-)

Maybe we can start with this "simple" approach, and later, if it becomes a real issue, consider designing an "even better" approach. I suppose that right now, the most important thing is to prevent Python to leak. Leaking a few references sound "less bad" than crashing the whole process.

Python has the old habit of cleaning resources of a thread from another thread: the thread calling Py_Finalize() is responsible to clear all resources of all other threads. I'm thinking about pystate.h zapthreads() function.

In a perfect world, a thread should clear its own resources. But well. There are technical challenges. We may want to be able to complete Py_Finalize() as soon as possible, while a daemon thread can be waiting for something unrelated to Python. The current semantics is that daemon threads survive to Python finalization, but "must exit" if they attempt to acquire the GIL after Python finalization.

I noticed a similar problem with sub-interpreters, it seems like the main interpreter calling Py_Finalize() does clean resources of other interpreters. Resources ownership is not strictly of each interpreter. The main interpreter seems to like to clean up resources of other interpreters. It makes me feel unconfortable. But again, they are technical challenges.

Put {sub-interpreters, nogil, multithreaded applications, signals, processes, multiprocessing, concurrent.futures} in a shaker with ice cubes, shake it for 2 min, serve in a glass, take a sip, and then cry. LOL.

cc @colesbury @ericsnowcurrently who may love such crazy bugs as well.

miss-islington · 2023-09-11T15:28:03Z

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.11.
🐍🍒⛏🤖

miss-islington · 2023-09-11T15:28:03Z

Thanks @vstinner for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

miss-islington · 2023-09-11T15:28:06Z

Sorry, @vstinner, I could not cleanly backport this to 3.11 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 517cd82ea7d01b344804413ef05610934a43a241 3.11

miss-islington · 2023-09-11T15:28:09Z

Sorry, @vstinner, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.
cherry_picker 517cd82ea7d01b344804413ef05610934a43a241 3.12

vstinner · 2023-09-11T15:48:24Z

Oh, the backport to 3.12 should wait for #109133 to be merged.

…n#109135) Fix _thread.start_new_thread() race condition. If a thread is created during Python finalization, the newly spawned thread now exits immediately instead of trying to access freed memory and lead to a crash. thread_run() calls PyEval_AcquireThread() which checks if the thread must exit. The problem was that tstate was dereferenced earlier in _PyThreadState_Bind() which leads to a crash most of the time. Move _PyThreadState_CheckConsistency() from thread_run() to _PyThreadState_Bind(). (cherry picked from commit 517cd82)

bedevere-app · 2023-09-11T16:00:01Z

GH-109272 is a backport of this pull request to the 3.11 branch.

) (#109272) gh-108987: Fix _thread.start_new_thread() race condition (#109135) Fix _thread.start_new_thread() race condition. If a thread is created during Python finalization, the newly spawned thread now exits immediately instead of trying to access freed memory and lead to a crash. thread_run() calls PyEval_AcquireThread() which checks if the thread must exit. The problem was that tstate was dereferenced earlier in _PyThreadState_Bind() which leads to a crash most of the time. Move _PyThreadState_CheckConsistency() from thread_run() to _PyThreadState_Bind(). (cherry picked from commit 517cd82)

…n#109135) Fix _thread.start_new_thread() race condition. If a thread is created during Python finalization, the newly spawned thread now exits immediately instead of trying to access freed memory and lead to a crash. thread_run() calls PyEval_AcquireThread() which checks if the thread must exit. The problem was that tstate was dereferenced earlier in _PyThreadState_Bind() which leads to a crash most of the time. Move _PyThreadState_CheckConsistency() from thread_run() to _PyThreadState_Bind(). (cherry picked from commit 517cd82)

bedevere-app · 2023-10-04T10:26:22Z

GH-110342 is a backport of this pull request to the 3.12 branch.

) (#110342) * gh-108987: Fix _thread.start_new_thread() race condition (#109135) Fix _thread.start_new_thread() race condition. If a thread is created during Python finalization, the newly spawned thread now exits immediately instead of trying to access freed memory and lead to a crash. thread_run() calls PyEval_AcquireThread() which checks if the thread must exit. The problem was that tstate was dereferenced earlier in _PyThreadState_Bind() which leads to a crash most of the time. Move _PyThreadState_CheckConsistency() from thread_run() to _PyThreadState_Bind(). (cherry picked from commit 517cd82) * gh-109795: `_thread.start_new_thread`: allocate thread bootstate using raw memory allocator (#109808) (cherry picked from commit 1b8f236) --------- Co-authored-by: Radislav Chugunov <52372310+chgnrdv@users.noreply.github.com>

gpshead · 2023-10-04T23:35:06Z

thanks for this, at first glance I believe it to be correct. (hard to ever say for sure on this kind of thing =)
at least there was an assertion in the code before this change. turning that into an actual non crashing bail out as this PR does seems good.

vstinner · 2023-10-05T08:43:50Z

Sadly, this design "check before use" still has a "short" race condition: #110052 (comment)

bedevere-bot added the awaiting core review label Sep 8, 2023

bedevere-bot mentioned this pull request Sep 8, 2023

test_threading fails and crashes when rerun #108987

Closed

vstinner marked this pull request as draft September 8, 2023 10:50

bedevere-bot removed the awaiting core review label Sep 8, 2023

vstinner force-pushed the start_new_thread branch 2 times, most recently from 9ecc4f7 to a50f9b6 Compare September 8, 2023 10:53

vstinner marked this pull request as ready for review September 8, 2023 10:53

bedevere-bot added the awaiting core review label Sep 8, 2023

serhiy-storchaka self-requested a review September 8, 2023 11:32

serhiy-storchaka approved these changes Sep 11, 2023

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting core review labels Sep 11, 2023

vstinner merged commit 517cd82 into python:main Sep 11, 2023

vstinner deleted the start_new_thread branch September 11, 2023 15:27

bedevere-app bot removed the awaiting merge label Sep 11, 2023

vstinner added needs backport to 3.11 only security fixes needs backport to 3.12 bug and security fixes labels Sep 11, 2023

miss-islington assigned vstinner Sep 11, 2023

bedevere-app bot removed the needs backport to 3.11 only security fixes label Sep 11, 2023

This was referenced Sep 11, 2023

[3.11] gh-108987: Fix _thread.start_new_thread() race condition (#109135) #109272

Merged

gh-109276: libregrtest: use separated file for JSON #109277

Merged

vstinner mentioned this pull request Sep 14, 2023

[3.12] gh-104690: thread_run() checks for tstate dangling pointer (#109056) #109133

Merged

chgnrdv mentioned this pull request Sep 23, 2023

Starting new thread during finalization leads to call to PyMem_Free without holding the GIL #109795

Closed

vstinner mentioned this pull request Sep 25, 2023

gh-109795: _thread.start_new_thread: allocate thread bootstate using raw memory allocator #109808

Merged

bedevere-app bot removed the needs backport to 3.12 bug and security fixes label Oct 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-108987: Fix _thread.start_new_thread() race condition #109135

gh-108987: Fix _thread.start_new_thread() race condition #109135

vstinner commented Sep 8, 2023 •

edited by bedevere-bot

Loading

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023 •

edited

Loading

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023

vstinner commented Sep 9, 2023 •

edited

Loading

miss-islington commented Sep 11, 2023

miss-islington commented Sep 11, 2023

miss-islington commented Sep 11, 2023

miss-islington commented Sep 11, 2023

vstinner commented Sep 11, 2023

bedevere-app bot commented Sep 11, 2023

bedevere-app bot commented Oct 4, 2023

gpshead commented Oct 4, 2023

vstinner commented Oct 5, 2023 •

edited

Loading

gh-108987: Fix _thread.start_new_thread() race condition #109135

gh-108987: Fix _thread.start_new_thread() race condition #109135

Conversation

vstinner commented Sep 8, 2023 • edited by bedevere-bot Loading

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023 • edited Loading

vstinner commented Sep 8, 2023

vstinner commented Sep 8, 2023

vstinner commented Sep 9, 2023 • edited Loading

miss-islington commented Sep 11, 2023

miss-islington commented Sep 11, 2023

miss-islington commented Sep 11, 2023

miss-islington commented Sep 11, 2023

vstinner commented Sep 11, 2023

bedevere-app bot commented Sep 11, 2023

bedevere-app bot commented Oct 4, 2023

gpshead commented Oct 4, 2023

vstinner commented Oct 5, 2023 • edited Loading

vstinner commented Sep 8, 2023 •

edited by bedevere-bot

Loading

vstinner commented Sep 8, 2023 •

edited

Loading

vstinner commented Sep 9, 2023 •

edited

Loading

vstinner commented Oct 5, 2023 •

edited

Loading