Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-124878: Fix race conditions during interpreter finalization #130649

Merged
merged 2 commits into from
Mar 6, 2025

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Feb 27, 2025

The PyThreadState field gains a reference count field to avoid issues with PyThreadState being a dangling pointer to freed memory. The refcount starts with a value of two: one reference is owned by the interpreter's linked list of thread states and one reference is owned by the OS thread. The reference count is decremented when the thread state is removed from the interpreter's linked list and before the OS thread calls PyThread_hang_thread(). The thread that decrements it to zero frees the PyThreadState memory.

The holds_gil field is moved out of the _status bit field, to avoid a data race where on thread calls PyThreadState_Clear(), modifying the _status bit field while the OS thread reads holds_gil when attempting to acquire the GIL.

The PyThreadState.state field now has _Py_THREAD_SHUTTING_DOWN as a possible value. This corresponds to the _PyThreadState_MustExit() check. This avoids race conditions in the free threading build when checking _PyThreadState_MustExit().

The PyThreadState field gains a reference count field to avoid
issues with PyThreadState being a dangling pointer to freed memory.
The refcount starts with a value of two: one reference is owned by the
interpreter's linked list of thread states and one reference is owned by
the OS thread. The reference count is decremented when the thread state
is removed from the interpreter's linked list and before the OS thread
calls `PyThread_hang_thread()`. The thread that decrements it to zero
frees the `PyThreadState` memory.

The `holds_gil` field is moved out of the `_status` bit field, to avoid
a data race where on thread calls `PyThreadState_Clear()`, modifying the
`_status` bit field while the OS thread reads `holds_gil` when
attempting to acquire the GIL.

The `PyThreadState.state` field now has `_Py_THREAD_SHUTTING_DOWN` as a
possible value. This corresponds to the `_PyThreadState_MustExit()`
check. This avoids race conditions in the free threading build when
checking `_PyThreadState_MustExit()`.
Copy link
Member

@vstinner vstinner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Avoiding dangling pointers close a whole category of bugs. Using a reference count is a nice solution for that. I also like the fact that Py_Finalize() now sets threads state to "shutting down": it's more explicit like that.

@vstinner
Copy link
Member

By the way, this change may also fix the old crash #110052 (test_4_daemon_threads).

@colesbury
Copy link
Contributor Author

Yeah, I think it should fix test_4_daemon_threads as well. I'm using the repro from that issue:

./python -m test test_threading -m test_4_daemon_threads -j50 -F --fail-env-changed

And also a small patch to ceval_gil.c to make the crash more likely to occur on my machine.

On main (with the patch to ceval_gil.c), I see a crash pretty quickly after ~100 iterations on both the default and free threaded builds.

With this PR, I haven't seen any failures in 25,000 iterations (~10 minutes).

Patch to ceval_gil.c
diff --git a/Python/ceval_gil.c b/Python/ceval_gil.c
index 2c1cc17b2ff..e14f1a8afa2 100644
--- a/Python/ceval_gil.c
+++ b/Python/ceval_gil.c
@@ -306,6 +306,8 @@ take_gil(PyThreadState *tstate)
         _PyThreadState_HangThread(tstate);
     }

+    usleep(10);
+
     assert(_PyThreadState_CheckConsistency(tstate));
     PyInterpreterState *interp = tstate->interp;
     struct _gil_runtime_state *gil = interp->ceval.gil;

@colesbury colesbury merged commit 052cb71 into python:main Mar 6, 2025
42 checks passed
@colesbury colesbury deleted the gh-124878-tstate-refcount branch March 6, 2025 15:38
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x RHEL8 3.x (tier-3) has failed when building commit 052cb71.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/#/builders/509/builds/8563) and take a look at the build logs.
  4. Check if the failure is related to this commit (052cb71) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/#/builders/509/builds/8563

Failed tests:

  • test.test_multiprocessing_spawn.test_manager

Summary of the results of the build (if available):

==

Click to see traceback logs
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 266, in serve_client
    raise ke
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 260, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
                              ~~~~~~~~~^^^^^^^
KeyError: '3ff9859f260'
---------------------------------------------------------------------------
Timeout (0:20:00)!
Thread 0x000003ffb74f7270 (most recent call first):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/popen_fork.py", line 28 in poll
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/popen_fork.py", line 44 in wait
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 149 in join
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 623 in _callCleanup
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 697 in doCleanups
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 664 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 716 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/runner.py", line 259 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 84 in _run_suite
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 42 in run_unittest
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 162 in test_func
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 118 in regrtest_runner
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 165 in _load_run_test
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 210 in _runtest_env_changed_exc
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 319 in _runtest
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 348 in run_single_test
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/worker.py", line 92 in worker_process
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/worker.py", line 127 in main
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/worker.py", line 131 in <module>
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/runpy.py", line 88 in _run_code
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/runpy.py", line 198 in _run_module_as_main


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 266, in serve_client
    raise ke
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 260, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
                              ~~~~~~~~~^^^^^^^
KeyError: '3ff985fb340'
---------------------------------------------------------------------------
k


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/_test_multiprocessing.py", line 1625, in f
    woken.release()
    ~~~~~~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 1067, in release
    return self._callmethod('release')
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 848, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 264, in serve_client
    self.id_to_local_proxy_obj[ident]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: '3ff89835aa0'


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 266, in serve_client
    raise ke
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 260, in serve_client
    obj, exposed, gettypeid = id_to_obj[ident]
                              ~~~~~~~~~^^^^^^^
KeyError: '3ff89835aa0'
---------------------------------------------------------------------------
Timeout (0:20:00)!
Thread 0x000003ff9f377270 (most recent call first):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/popen_fork.py", line 28 in poll
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/popen_fork.py", line 44 in wait
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 149 in join
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 623 in _callCleanup
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 697 in doCleanups
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 664 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/case.py", line 716 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 122 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/suite.py", line 84 in __call__
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/unittest/runner.py", line 259 in run
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 84 in _run_suite
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 42 in run_unittest
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 162 in test_func
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 118 in regrtest_runner
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 165 in _load_run_test
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 210 in _runtest_env_changed_exc
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 319 in _runtest
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/single.py", line 348 in run_single_test
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/worker.py", line 92 in worker_process
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/worker.py", line 127 in main
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/libregrtest/worker.py", line 131 in <module>
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/runpy.py", line 88 in _run_code
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/runpy.py", line 198 in _run_module_as_main


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/_test_multiprocessing.py", line 1625, in f
    woken.release()
    ~~~~~~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 1067, in release
    return self._callmethod('release')
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 848, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 264, in serve_client
    self.id_to_local_proxy_obj[ident]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: '3ff985fb340'


Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 313, in _bootstrap
    self.run()
    ~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/test/_test_multiprocessing.py", line 1625, in f
    woken.release()
    ~~~~~~~~~~~~~^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 1067, in release
    return self._callmethod('release')
           ~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 848, in _callmethod
    raise convert_to_error(kind, result)
multiprocessing.managers.RemoteError: 
---------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/buildbot/buildarea/3.x.cstratak-rhel8-s390x/build/Lib/multiprocessing/managers.py", line 264, in serve_client
    self.id_to_local_proxy_obj[ident]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^
KeyError: '3ff9859f260'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants