-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout generic callbacks used in destructors and inflight request cancelation #123
Timeout generic callbacks used in destructors and inflight request cancelation #123
Conversation
Destructors may be called from Python's garbage collector, disallowing us to disable the GIL for that operation. When that occurs, there might be a deadlock as other threads have the GIL, such as the progress thread when a Python's listener callback is invoked and executed in that thread.
ucxx::Endpoint
and ucxx::Listener
destructorsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor cleanups (to avoid excessive looping when maxAttempts > 1
and a suggestion of how to implement a timeout for the spinlock case.
I was trying to think about how we could drop the gil (say) when destructing in python, but I am not sure if it is possible.
I wonder if it is also a good idea to explicitly release the gil when dropping the C++ shared_ptr attributes of UCXX python cdef classes. I think something like:
(For example) Makes sure that (if this is the last reference to the WDYT? I think this change is worthwhile anyway so I can propose in a separate PR. |
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Sorry, I missed this reply earlier. I think this is a great idea, it might actually do what we were attempting of releasing the GIL at destruction time, and |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Peter
/merge |
Thanks for the review @wence- ! |
#123 introduced timeouts to the generic callbacks, preventing failure to acquire lock due to GIL competition. However, those were not exposed to Python and at least one of the reasons it still timeouts is because of that, notice how the default `period=0` (never unblock) is used: ```cpp Thread 1 (Thread 0x7f36d675f740 (LWP 155586) "pytest"): #0 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fff45058e58) at ../sysdeps/nptl/futex-internal.h:183 #1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x7fff45058e08, cond=0x7fff45058e30) at pthread_cond_wait.c:508 #2 __pthread_cond_wait (cond=0x7fff45058e30, mutex=0x7fff45058e08) at pthread_cond_wait.c:647 #3 0x00007f36d43634d4 in std::condition_variable::wait<ucxx::utils::CallbackNotifier::wait(uint64_t)::<lambda()> > (__p=..., __lock=..., this=0x7fff45058e30) at /opt/conda/envs/test/x86_64-conda-linux-gnu/include/c++/11.4.0/condition_variable:103 #4 ucxx::utils::CallbackNotifier::wait (this=this@entry=0x7fff45058e00, period=period@entry=0) at /datasets/pentschev/src/ucxx-deadlock/cpp/src/utils/callback_notifier.cpp:66 #5 0x00007f36d43470e1 in ucxx::Endpoint::close (this=0x7f369c701a90, period=0, maxAttempts=1) at /datasets/pentschev/src/ucxx-deadlock/cpp/src/endpoint.cpp:171 #6 0x00007f36d4753381 in __pyx_pw_4ucxx_4_lib_7libucxx_11UCXEndpoint_9close(_object*, _object* const*, long, _object*) () from /opt/conda/envs/test/lib/python3.10/site-packages/ucxx/_lib/libucxx.cpython-310-x86_64-linux-gnu.so ``` This PR exposes those arguments to Python and specify a default for Python async API `Endpoint.abort()` to prevent such deadlocks from occurring. Authors: - Peter Andreas Entschev (https://github.com/pentschev) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #136
Destructors may be called from Python's garbage collector, disallowing us to disable the GIL for that operation. When that occurs, there might be a deadlock as other threads have the GIL, such as the progress thread when a Python's listener callback is invoked and executed in that thread.