-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add non-blocking ucxx::Endpoint::close()
request
#192
Add non-blocking ucxx::Endpoint::close()
request
#192
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General logic but not tests or python changes yet...
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
// _worker->progress(); | ||
// return closeRequest->isCompleted(); | ||
// }; | ||
// loopWithTimeout(std::chrono::milliseconds(5000), f); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just uncommenting this and commenting out the code on line 280 didn't let me see the problem you describe...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should definitely see it if you have a UCX debug build. Just tried it now with UCX 1.15 debug and I'm able to reproduce it:
[dgx13:3842411:0:3842411] ucp_worker.c:2888 Assertion `worker->inprogress++ == 0' failed
/src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c: [ ucp_worker_progress() ]
...
2885 UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);
2886
2887 /* check that ucp_worker_progress is not called from within ucp_worker_progress */
==> 2888 ucs_assert(worker->inprogress++ == 0);
2889 count = uct_worker_progress(worker->uct);
2890 ucs_async_check_miss(&worker->async);
2891
==== backtrace (tid:3842411) ====
0 0x000000000005fd79 ucp_worker_progress() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2888
1 0x000000000005bebe ucxx::Worker::progressOnce() /src/ucxx/cpp/src/worker.cpp:254
2 0x000000000005beec ucxx::Worker::progressPending() /src/ucxx/cpp/src/worker.cpp:260
3 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:268
4 0x000000000005ca07 std::mutex::lock() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:100
5 0x000000000005ca07 std::lock_guard<std::mutex>::lock_guard() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:229
6 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:272
7 0x000000000004e1c1 std::_Function_handler<bool (), (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::{lambda()#2}>::_M_invoke() /src/ucxx/cpp/tests/listener.cpp:318
8 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:993
9 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:987
10 0x000000000004e1c1 operator()() /src/ucxx/cpp/tests/listener.cpp:319
11 0x000000000004e1c1 __invoke_impl<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
12 0x000000000004e1c1 __invoke_r<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:114
13 0x000000000004e1c1 _M_invoke() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
14 0x000000000003ef43 std::function<void (ucs_status_t, std::shared_ptr<void>)>::operator()() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
15 0x000000000003ef43 std::__shared_ptr<void, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1154
16 0x000000000003ef43 std::shared_ptr<void>::~shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr.h:122
17 0x000000000003ef43 operator()() /src/ucxx/cpp/src/endpoint.cpp:162
18 0x000000000003ef43 __invoke_impl<void, ucxx::Endpoint::close(bool, ucxx::EndpointCloseCallbackUserFunction, ucxx::EndpointCloseCallbackUserData)::<lambda(ucs_status_t, ucxx::EndpointCloseCallbackUserData)>&, ucs_status_t, std::shared_ptr<void> >() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
19 0x000000000003ef43 __invoke_r<void, ucxx::Endpoint::close(bool, ucxx::EndpointCloseCallbackUserFunction, ucxx::EndpointCloseCallbackUserData)::<lambda(ucs_status_t, ucxx::EndpointCloseCallbackUserData)>&, ucs_status_t, std::shared_ptr<void> >() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:111
20 0x000000000003ef43 _M_invoke() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
21 0x0000000000048c72 std::function<void (ucs_status_t, std::shared_ptr<void>)>::operator()() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
22 0x0000000000048c72 std::__shared_ptr<void, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1154
23 0x0000000000048c72 std::shared_ptr<void>::~shared_ptr() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr.h:122
24 0x0000000000048c72 ucxx::Request::setStatus() /src/ucxx/cpp/src/request.cpp:230
25 0x0000000000049209 ucxx::Request::callback() /src/ucxx/cpp/src/request.cpp:149
26 0x0000000000040014 ucp_request_complete_send() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_request.inl:249
27 0x0000000000040014 ucp_ep_local_disconnect_progress() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_ep.c:1625
28 0x000000000005e7f9 ucs_callbackq_slow_proxy() /src/ucx-1.15/build/src/ucs/../../../src/ucs/datastruct/callbackq.c:404
29 0x000000000005fc4a ucs_callbackq_dispatch() /src/ucx-1.15/build/../src/ucs/datastruct/callbackq.h:211
30 0x000000000005fc4a uct_worker_progress() /src/ucx-1.15/build/../src/uct/api/uct.h:2777
31 0x000000000005fc4a ucp_worker_progress() /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2889
32 0x000000000005bebe ucxx::Worker::progressOnce() /src/ucxx/cpp/src/worker.cpp:254
33 0x000000000005beec ucxx::Worker::progressPending() /src/ucxx/cpp/src/worker.cpp:260
34 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:268
35 0x000000000005ca07 std::mutex::lock() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:100
36 0x000000000005ca07 std::lock_guard<std::mutex>::lock_guard() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:229
37 0x000000000005ca07 ucxx::Worker::progress() /src/ucxx/cpp/src/worker.cpp:272
38 0x000000000004e1c1 std::_Function_handler<bool (), (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::{lambda()#2}>::_M_invoke() /src/ucxx/cpp/tests/listener.cpp:318
39 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:993
40 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::operator->() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:987
41 0x000000000004e1c1 operator()() /src/ucxx/cpp/tests/listener.cpp:319
42 0x000000000004e1c1 __invoke_impl<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
43 0x000000000004e1c1 __invoke_r<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:114
44 0x000000000004e1c1 _M_invoke() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
45 0x000000000007820d std::function<bool ()>::operator()() /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
46 0x000000000005456a (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody() /src/ucxx/cpp/tests/listener.cpp:321
47 0x000000000005456a (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody() /src/ucxx/cpp/tests/listener.cpp:321
48 0x000000000004d40e testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>() ???:0
49 0x000000000004d6a1 testing::Test::Run() ???:0
50 0x000000000004da2f testing::TestInfo::Run() ???:0
51 0x000000000004deff testing::TestSuite::Run() ???:0
52 0x0000000000059423 testing::internal::UnitTestImpl::RunAllTests() ???:0
53 0x000000000004dfdd testing::UnitTest::Run() ???:0
54 0x0000000000001072 main() ???:0
55 0x0000000000024083 __libc_start_main() /build/glibc-wuryBv/glibc-2.31/csu/../csu/libc-start.c:308
56 0x000000000002971e _start() ???:0
=================================
Aborted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, ok, I didn't build in debug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake approval.
A race condition may occur if `ucxx::Endpoint::setCloseCallback()` occurs while `ucxx::Endpoint::errorCallback()` is being executed, for example during Python `remove_close_callback()` call, causing `errorCallback` to attempt executing the callback which is not available anymore.
/merge |
Due to the primary use with Python, endpoint closing was initially implemented in blocking-mode only. With C++ usage that is not always desired, thus exposing a non-blocking close option that returns a
ucxx::Request
is beneficial to C++ users.