-
-
Notifications
You must be signed in to change notification settings - Fork 31.9k
Make _Py_TryIncref
public as an unstable API as PyUnstable_TryIncref()
#128844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Since Py_INCREF() and Py_NewRef() omit "PyObject" in their name, I suggest
Would you mind to elaborate this function? What is its API? What is its usage? |
One thing is not clear to me. I suspect there's a mechanism to prevent that in free-threaded builds, but I'd like to know what it is and how it applies. |
When called on an object, The API is a bit of a "wart", but I don't see a way to avoid it. // Enable subsequent calls to `PyUnstable_TryIncref` on `op`
void PyUnstable_Object_EnableTryIncRef(PyObject *op);
In the asyncio and pybind11 use cases, there are already locks during deallocation and around the Here's an example in psuedo-code based on pybind11's use case. Note that when The // map from C++ pointers to Python wrapper objects
std::unordered_map<void *, PyObject *> registered_instances;
std::mutex mutex;
// Find an existing Python wrapper for a C++ pointer
PyObject *
get_wrapper(void *ptr)
{
std::unique_lock<std::mutex> lk(mutex);
if (auto it = registered_instances.find(ptr); it != registered_instances.end()) {
PyObject *wrapper = it->second;
if (_Py_TryIncref(wrapper)) { // NOTE: `Py_INCREF()` here would not be safe in the free threading build!
return wrapper;
}
}
}
void
wrapper_dealloc(PyObject *wrapper)
{
std::unique_lock<std::mutex> lk(mutex);
void *ptr = get_cpp_ptr_from_wrapper(wrapper);
registered_instances.erase(ptr);
...
} |
Ok, I'm fine with adding these two PyUnstable functions. |
_Py_TryIncref
public as an unstable API as PyUnstable_Object_TryIncref()
_Py_TryIncref
public as an unstable API as PyUnstable_TryIncref()
Thank you for the example! It makes sense to me as well. +1. |
This exposes `_Py_TryIncref` as `PyUnstable_TryIncref()` and the helper function `_PyObject_SetMaybeWeakref` as `PyUnstable_EnableTryIncRef`. These are helpers for dealing with unowned references in a safe way, particularly in the free threading build.
This exposes `_Py_TryIncref` as `PyUnstable_TryIncref()` and the helper function `_PyObject_SetMaybeWeakref` as `PyUnstable_EnableTryIncRef`. These are helpers for dealing with unowned references in a safe way, particularly in the free threading build. Co-authored-by: Petr Viktorin <encukou@gmail.com>
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, when incrementing the reference count, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use.
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Joint work with @vfdev-5 We found the following TSAN race report in JAX's CI: jax-ml/jax#28551 ``` WARNING: ThreadSanitizer: data race (pid=35893) Read of size 1 at 0x7fffca320cb9 by thread T57 (mutexes: read M0): #0 mlir::python::PyOperation::checkValid() const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1300:8 (libjax_common.so+0x41e8b1d) (BuildId: 55242ad732cdae54) #1 mlir::python::populateIRCore(nanobind::module_&)::$_57::operator()(mlir::python::PyOperationBase&) const /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:3221:40 (libjax_common.so+0x41e8b1d) llvm#2 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::operator()(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) const /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:275:24 (libjax_common.so+0x41e8b1d) llvm#3 _object* nanobind::detail::func_create<true, true, mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef, mlir::python::PyOperationBase&, 0ul, nanobind::is_method, nanobind::is_getter, nanobind::rv_policy>(mlir::python::populateIRCore(nanobind::module_&)::$_57&, MlirStringRef (*)(mlir::python::PyOperationBase&), std::integer_sequence<unsigned long, 0ul>, nanobind::is_method const&, nanobind::is_getter const&, nanobind::rv_policy const&)::'lambda'(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*)::__invoke(void*, _object**, unsigned char*, nanobind::rv_policy, nanobind::detail::cleanup_list*) /proc/self/cwd/external/nanobind/include/nanobind/nb_func.h:219:14 (libjax_common.so+0x41e8b1d) ... Previous write of size 1 at 0x7fffca320cb9 by thread T56 (mutexes: read M0): #0 mlir::python::PyOperation::setInvalid() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRModule.h:729:29 (libjax_common.so+0x419f012) (BuildId: 55242ad732cdae54) #1 mlir::python::PyMlirContext::clearOperation(MlirOperation) /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:741:10 (libjax_common.so+0x419f012) llvm#2 mlir::python::PyOperation::~PyOperation() /proc/self/cwd/external/llvm-project/mlir/lib/Bindings/Python/IRCore.cpp:1213:19 (libjax_common.so+0x41a414b) (BuildId: 55242ad732cdae54) llvm#3 void nanobind::detail::wrap_destruct<mlir::python::PyOperation>(void*) /proc/self/cwd/external/nanobind/include/nanobind/nb_class.h:245:21 (libjax_common.so+0x41ecf21) (BuildId: 55242ad732cdae54) llvm#4 nanobind::detail::inst_dealloc(_object*) /proc/self/cwd/external/nanobind/src/nb_type.cpp:255:13 (libjax_common.so+0x3284136) (BuildId: 55242ad732cdae54) llvm#5 _Py_Dealloc /project/cpython/Objects/object.c:3025:5 (python3.14+0x2a2422) (BuildId: 6051e096a967bdf49efb15da94a67d8eff710a9b) llvm#6 _Py_MergeZeroLocalRefcount /project/cpython/Objects/object.c (python3.14+0x2a2422) llvm#7 Py_DECREF(_object*) /proc/self/cwd/external/python_x86_64-unknown-linux-gnu-freethreaded/include/python3.14t/refcount.h:387:13 (libjax_common.so+0x41aaadc) (BuildId: 55242ad732cdae54) ... ``` At the simplest level, the `valid` field of a PyOperation must be protected by a lock, because it may be concurrently accessed from multiple threads. Much more interesting, however is how we get into the situation described by the two stack traces above in the first place. The scenario that triggers this is the following: * thread T56 holds the last Python reference on a PyOperation, and decides to release it. * After T56 starts to release its reference, but before T56 removes the PyOperation from the liveOperations map a second thread T57 comes along and looks up the same MlirOperation in the liveOperations map. * Finding the operation to be present, thread T57 increments the reference count of that PyOperation and returns it to the caller. This is illegal! Python is in the process of calling the destructor of that object, and once an object is in that state it cannot be safely revived. To fix this, whenever we increment the reference count of a PyOperation that we found via the liveOperations map and to which we only hold a non-owning reference, we must use the Python 3.14+ API `PyUnstable_TryIncRef`, which exists precisely for this scenario (python/cpython#128844). That API does not exist under Python 3.13, so we need a backport of it in that case, for which we the backport that both nanobind and pybind11 also use. Fixes jax-ml/jax#28551
Feature or enhancement
We should make
_Py_TryIncref
public as function with the following signature:EDIT: Renamed to
PyUnstable_TryIncref
in accordance with Victor's suggestion.The function increments the reference count if it's not zero in a thread-safe way. It's logically equivalent to the following snippet and in the default (GIL-enabled) build it's implemented as such:
Additionally, we should make
_PyObject_SetMaybeWeakref
public asPyUnstable_Object_EnableTryIncRef
. This function has no equivalent in the GIL-enabled build (it's a no-op), but it's important for makingTryIncref
work reliably with our biased reference counting implementation.Motivation
The
TryIncref
primitive is a building block for handling borrowed and unowned references. It addresses an issue that generally cannot be solved by adding extra synchronization like mutexes because it handles the race between the reference count reaching zero (which is outside developers' control) and theTryIncref
.We use it internally in three subsystems:
PyObject *
entries.Recently, we discovered a thread safety bug in pybind11 related to the use of borrowed/unowned references. Using
_Py_TryIncref
in place ofPy_INCREF
would fix the bug. I think nanobind probably has a similar issue.Alternatives
PyWeakRef
objects increases the overhead of pybind11 bindings by 30% in some simple tests._Py_TryIncref
in extensions. I think this is much worse than making the function public as an unstable API because it requires direct access to the reference count fields -- the implementation is tied to the implementation of biased reference counting -- and I'd like to avoid extensions depending directly on those details.See also
Py_INCREF
for the free-threaded build #113920Linked PRs
_Py_TryIncref
public as an unstable API. #128926The text was updated successfully, but these errors were encountered: