Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_interpreters is not thread safe on the free-threaded build #126644

Open
devdanzin opened this issue Nov 10, 2024 · 5 comments
Open

_interpreters is not thread safe on the free-threaded build #126644

devdanzin opened this issue Nov 10, 2024 · 5 comments
Assignees
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir topic-free-threading topic-subinterpreters type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@devdanzin
Copy link
Contributor

devdanzin commented Nov 10, 2024

Crash report

What happened?

First off, sorry for not being able to offer code that is more reduced and certain to trigger a repro.

The code below non-deterministically triggers python: Python/index_pool.c:92: heap_pop: Assertion 'heap->size > 0' failed. in a free-threading build with PYTHON_GIL=0.

import _interpreters
from threading import Thread

def f(): pass

ints = []
for x in range(300):
    ints.append(_interpreters.create())

threads = []
for interpr in ints:
    threads.append(Thread(target=_interpreters.run_string, args=(interpr, "f()",)))
    threads.append(Thread(target=_interpreters.get_current, args=()))
    threads.append(Thread(target=_interpreters.destroy, args=(interpr,)))
    threads.append(Thread(target=_interpreters.list_all, args=()))
    threads.append(Thread(target=_interpreters.destroy, args=(interpr,)))
    threads.append(Thread(target=_interpreters.get_current, args=()))
    threads.append(Thread(target=_interpreters.get_main, args=()))
    threads.append(Thread(target=_interpreters.destroy, args=(interpr,)))
    threads.append(Thread(target=_interpreters.run_string, args=(interpr, "f()",)))

for thread in threads:
    try:
        print("START", thread)
        thread.start()
    except Exception:
        pass

for thread in threads:
    try:
        print("JOIN", thread)
        thread.join()
    except Exception:
        pass

Backtrace:

#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140729519150656) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140729519150656) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=140729519150656, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007ffff7ce0476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7cc67f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x00007ffff7cc671b in __assert_fail_base (fmt=0x7ffff7e7b130 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n",
    assertion=0x555555aceaf4 "heap->size > 0", file=0x555555aceae0 "Python/index_pool.c", line=92, function=<optimized out>)
    at ./assert/assert.c:92
#6  0x00007ffff7cd7e96 in __GI___assert_fail (assertion=assertion@entry=0x555555aceaf4 "heap->size > 0",
    file=file@entry=0x555555aceae0 "Python/index_pool.c", line=line@entry=92,
    function=function@entry=0x555555aceb40 <__PRETTY_FUNCTION__.1> "heap_pop") at ./assert/assert.c:101
#7  0x00005555558caa1c in heap_pop (heap=<optimized out>) at Python/index_pool.c:92
#8  0x00005555558cac33 in _PyIndexPool_AllocIndex (pool=pool@entry=0x7ffff780f498) at Python/index_pool.c:173
#9  0x000055555568d9f0 in _Py_ReserveTLBCIndex (interp=interp@entry=0x7ffff780b020) at Objects/codeobject.c:2752
#10 0x0000555555951c33 in new_threadstate (interp=interp@entry=0x7ffff780b020, whence=whence@entry=5) at Python/pystate.c:1516
#11 0x0000555555953994 in _PyThreadState_NewBound (interp=interp@entry=0x7ffff780b020, whence=whence@entry=5)
    at Python/pystate.c:1578
#12 0x0000555555896ee9 in _enter_session (session=session@entry=0x7ffe24ff8740, interp=interp@entry=0x7ffff780b020)
    at Python/crossinterp.c:1548
#13 0x000055555589b9fb in _PyXI_Enter (session=session@entry=0x7ffe24ff8740, interp=interp@entry=0x7ffff780b020,
    nsupdates=nsupdates@entry=0x0) at Python/crossinterp.c:1711
#14 0x00007ffff7c3d217 in _run_in_interpreter (interp=interp@entry=0x7ffff780b020, codestr=0x200006182c8 "f()", codestrlen=3,
    shareables=shareables@entry=0x0, flags=1, p_excinfo=p_excinfo@entry=0x7ffe24ff8870) at ./Modules/_interpretersmodule.c:461
#15 0x00007ffff7c3d3ef in _interp_exec (self=self@entry=<module at remote 0x20000778b30>, interp=interp@entry=0x7ffff780b020,
    code_arg=<optimized out>, shared_arg=0x0, p_excinfo=p_excinfo@entry=0x7ffe24ff8870) at ./Modules/_interpretersmodule.c:950
#16 0x00007ffff7c3dc6b in interp_run_string (self=<module at remote 0x20000778b30>, args=args@entry=(9, 'f()'), kwds=kwds@entry={})
    at ./Modules/_interpretersmodule.c:1110
#17 0x000055555570145a in cfunction_call (func=func@entry=<built-in method run_string of module object at remote 0x20000778b30>,
    args=args@entry=(9, 'f()'), kwargs=kwargs@entry={}) at Objects/methodobject.c:551
#18 0x000055555567f69a in _PyObject_Call (tstate=0x5555560c09d0,
    callable=callable@entry=<built-in method run_string of module object at remote 0x20000778b30>, args=args@entry=(9, 'f()'),
    kwargs=kwargs@entry={}) at Objects/call.c:361

This code most usually results in one of the following errors, from most common to rarest:

  1. tstate_delete_common assertion, which seems to indicate a thread creation failed, see Crash at finalization after fail to start new thread #109746.
python: Python/pystate.c:1739: tstate_delete_common: Assertion `tstate->_status.cleared && !tstate->_status.finalized' failed.
  1. Fatal Python error, seems to be Crash Due to Exception in threading._shutdown() #113148.
Fatal Python error: Py_EndInterpreter: not the last thread
Python runtime state: initialized

Current thread 0x00007f09af37e640 (most recent call first):
  <no Python frame>

Thread 0x00007f09b0380640 (most recent call first):
  <no Python frame>
Aborted

Given that, running it multiple times seems necessary to trigger the correct abort in heap_pop. The affected code seems to have been included in #123926, so cc @mpage.

Found using fusil by @vstinner.

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

Python 3.14.0a1+ experimental free-threading build (heads/main-dirty:54c63a32d0, Nov 8 2024, 20:16:36) [GCC 11.4.0]

Linked PRs

@devdanzin devdanzin added the type-crash A hard crash of the interpreter, possibly with a core dump label Nov 10, 2024
@ZeroIntensity ZeroIntensity added extension-modules C modules in the Modules dir 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Nov 10, 2024
@ZeroIntensity ZeroIntensity self-assigned this Nov 10, 2024
@ZeroIntensity
Copy link
Member

Confirmed on main. I'll try and deal with this today.

@picnixz picnixz changed the title Failed assertion in index_pool.c: heap_pop with interpreters and threads aborts free-threading build Failed assertion in index_pool.c: heap_pop with interpreters and threads aborts on free-threading build Nov 10, 2024
@ZeroIntensity
Copy link
Member

There are two bugs here. First, the initial check for if the subinterpreter is running in _interpreters.destroy doesn't account for the fact that another thread could start running it on the free-threaded build. That's a relatively simple fix, though--just set up some synchronization and prevent interpreters from starting once destroy has started.

The much more difficult issue is how we capture exceptions. Because we need to send the object to the caller, we have to detach from the subinterpreter, and then access it from the calling interpreter, which isn't thread safe, even on the GIL-ful build. The moment we detach from the thread state it's fair game for another thread to try and attach to it, and overwrite the exception that we want. There are a few possible fixes, and I'm not sure which one I want to do yet:

  • Add a big ugly lock around interpreter switching to any interpreter (will contend very badly).
  • Add an equally ugly lock to _interpreters (as in, marking it as needing the GIL).
  • Add a per-interpreter lock for switching. This is what I'm leaning towards, but it hurts the C API, because exception capturing happens outside of _PyXI_Enter and _PyXI_Exit, so those APIs will need a lock to call them, foiling my plans to make them public.
  • Capture exceptions before detaching the thread, serialize them and store them on the crossinterpreter session, and then deserialize once we're back in the calling interpreter (will be a PITA to implement).

@ZeroIntensity ZeroIntensity changed the title Failed assertion in index_pool.c: heap_pop with interpreters and threads aborts on free-threading build Interpreter exception capturing is not thread safe Nov 10, 2024
@ZeroIntensity ZeroIntensity changed the title Interpreter exception capturing is not thread safe Subinterpreter exception capturing is not thread safe Nov 10, 2024
@ZeroIntensity
Copy link
Member

OK, it looks like the exception-capturing problem is less severe than I thought. We only rely on the state of the interpreter for _PyXI_ERR_ALREADY_RUNNING, so the easy fix is to just manually raise the error there. The much bigger problem is that it seems subinterpreters just isn't meant to run on the free-threaded build. I guess I'll do what I can to help deal with what.

@ZeroIntensity ZeroIntensity changed the title Subinterpreter exception capturing is not thread safe _interpreters is not thread safe on the free-threaded build Nov 11, 2024
@colesbury
Copy link
Contributor

Because we need to send the object to the caller, we have to detach from the subinterpreter, and then access it from the calling interpreter, which isn't thread safe, even on the GIL-ful build

Can you expand on this? Accessing Python objects from another interpreter is not safe in the free-threaded build, even with a lock. The GC and some other code paths assume that interpreter's and their objects are isolated.

@ZeroIntensity
Copy link
Member

I was wrong with my analysis. I thought originally that we just moved the exception between the interpreter, but instead it gets stored and serialized. The issue right now is that _PyXI_ERR_ALREADY_RUNNING uses _PyInterpreterState_FailIfRunningMain, which isn't right because another thread could start and stop the interpreter in the meantime. I just fixed it on my end by manually raising the interpreter error and not using that function.

The much bigger concern I have is that it seems _PyInterpreterState_SetRunningMain and _PyInterpreterState_SetNotRunningMain aren't thread-safe. On the free-threaded build, I'm assuming there's no waiting for _PyThreadState_Attach, because there's no GIL? If that's the case, then a few of the interpreter state functions need to get some locking. (For example, _PyInterpreterState_SetRunningMain checks if the main thread state has already been set, and then sets the current thread state, but on the free-threaded build it's possible for it to start running after that check, and then overwrite the old thread state, which causes a fatal error upon finalization.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir topic-free-threading topic-subinterpreters type-crash A hard crash of the interpreter, possibly with a core dump
Projects
Status: Todo
Development

No branches or pull requests

4 participants