Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add non-blocking ucxx::Endpoint::close() request #192

Merged
merged 27 commits into from
Mar 15, 2024

Conversation

pentschev
Copy link
Member

@pentschev pentschev commented Feb 16, 2024

Due to the primary use with Python, endpoint closing was initially implemented in blocking-mode only. With C++ usage that is not always desired, thus exposing a non-blocking close option that returns a ucxx::Request is beneficial to C++ users.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General logic but not tests or python changes yet...

cpp/include/ucxx/endpoint.h Outdated Show resolved Hide resolved
cpp/include/ucxx/request_endpoint_close.h Outdated Show resolved Hide resolved
cpp/src/endpoint.cpp Outdated Show resolved Hide resolved
cpp/src/endpoint.cpp Show resolved Hide resolved
cpp/src/request_endpoint_close.cpp Outdated Show resolved Hide resolved
cpp/src/request_endpoint_close.cpp Show resolved Hide resolved
pentschev and others added 2 commits March 14, 2024 14:39
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
Co-authored-by: Lawrence Mitchell <wence@gmx.li>
cpp/tests/listener.cpp Outdated Show resolved Hide resolved
cpp/tests/listener.cpp Outdated Show resolved Hide resolved
// _worker->progress();
// return closeRequest->isCompleted();
// };
// loopWithTimeout(std::chrono::milliseconds(5000), f);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just uncommenting this and commenting out the code on line 280 didn't let me see the problem you describe...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should definitely see it if you have a UCX debug build. Just tried it now with UCX 1.15 debug and I'm able to reproduce it:

[dgx13:3842411:0:3842411]  ucp_worker.c:2888 Assertion `worker->inprogress++ == 0' failed

/src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c: [ ucp_worker_progress() ]
      ...
     2885     UCP_WORKER_THREAD_CS_ENTER_CONDITIONAL(worker);
     2886
     2887     /* check that ucp_worker_progress is not called from within ucp_worker_progress */
==>  2888     ucs_assert(worker->inprogress++ == 0);
     2889     count = uct_worker_progress(worker->uct);
     2890     ucs_async_check_miss(&worker->async);
     2891

==== backtrace (tid:3842411) ====
 0 0x000000000005fd79 ucp_worker_progress()  /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2888
 1 0x000000000005bebe ucxx::Worker::progressOnce()  /src/ucxx/cpp/src/worker.cpp:254
 2 0x000000000005beec ucxx::Worker::progressPending()  /src/ucxx/cpp/src/worker.cpp:260
 3 0x000000000005ca07 ucxx::Worker::progress()  /src/ucxx/cpp/src/worker.cpp:268
 4 0x000000000005ca07 std::mutex::lock()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:100
 5 0x000000000005ca07 std::lock_guard<std::mutex>::lock_guard()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:229
 6 0x000000000005ca07 ucxx::Worker::progress()  /src/ucxx/cpp/src/worker.cpp:272
 7 0x000000000004e1c1 std::_Function_handler<bool (), (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::{lambda()#2}>::_M_invoke()  /src/ucxx/cpp/tests/listener.cpp:318
 8 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:993
 9 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::operator->()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:987
10 0x000000000004e1c1 operator()()  /src/ucxx/cpp/tests/listener.cpp:319
11 0x000000000004e1c1 __invoke_impl<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
12 0x000000000004e1c1 __invoke_r<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:114
13 0x000000000004e1c1 _M_invoke()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
14 0x000000000003ef43 std::function<void (ucs_status_t, std::shared_ptr<void>)>::operator()()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
15 0x000000000003ef43 std::__shared_ptr<void, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1154
16 0x000000000003ef43 std::shared_ptr<void>::~shared_ptr()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr.h:122
17 0x000000000003ef43 operator()()  /src/ucxx/cpp/src/endpoint.cpp:162
18 0x000000000003ef43 __invoke_impl<void, ucxx::Endpoint::close(bool, ucxx::EndpointCloseCallbackUserFunction, ucxx::EndpointCloseCallbackUserData)::<lambda(ucs_status_t, ucxx::EndpointCloseCallbackUserData)>&, ucs_status_t, std::shared_ptr<void> >()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
19 0x000000000003ef43 __invoke_r<void, ucxx::Endpoint::close(bool, ucxx::EndpointCloseCallbackUserFunction, ucxx::EndpointCloseCallbackUserData)::<lambda(ucs_status_t, ucxx::EndpointCloseCallbackUserData)>&, ucs_status_t, std::shared_ptr<void> >()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:111
20 0x000000000003ef43 _M_invoke()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
21 0x0000000000048c72 std::function<void (ucs_status_t, std::shared_ptr<void>)>::operator()()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
22 0x0000000000048c72 std::__shared_ptr<void, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:1154
23 0x0000000000048c72 std::shared_ptr<void>::~shared_ptr()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr.h:122
24 0x0000000000048c72 ucxx::Request::setStatus()  /src/ucxx/cpp/src/request.cpp:230
25 0x0000000000049209 ucxx::Request::callback()  /src/ucxx/cpp/src/request.cpp:149
26 0x0000000000040014 ucp_request_complete_send()  /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_request.inl:249
27 0x0000000000040014 ucp_ep_local_disconnect_progress()  /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_ep.c:1625
28 0x000000000005e7f9 ucs_callbackq_slow_proxy()  /src/ucx-1.15/build/src/ucs/../../../src/ucs/datastruct/callbackq.c:404
29 0x000000000005fc4a ucs_callbackq_dispatch()  /src/ucx-1.15/build/../src/ucs/datastruct/callbackq.h:211
30 0x000000000005fc4a uct_worker_progress()  /src/ucx-1.15/build/../src/uct/api/uct.h:2777
31 0x000000000005fc4a ucp_worker_progress()  /src/ucx-1.15/build/src/ucp/../../../src/ucp/core/ucp_worker.c:2889
32 0x000000000005bebe ucxx::Worker::progressOnce()  /src/ucxx/cpp/src/worker.cpp:254
33 0x000000000005beec ucxx::Worker::progressPending()  /src/ucxx/cpp/src/worker.cpp:260
34 0x000000000005ca07 ucxx::Worker::progress()  /src/ucxx/cpp/src/worker.cpp:268
35 0x000000000005ca07 std::mutex::lock()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:100
36 0x000000000005ca07 std::lock_guard<std::mutex>::lock_guard()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_mutex.h:229
37 0x000000000005ca07 ucxx::Worker::progress()  /src/ucxx/cpp/src/worker.cpp:272
38 0x000000000004e1c1 std::_Function_handler<bool (), (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::{lambda()#2}>::_M_invoke()  /src/ucxx/cpp/tests/listener.cpp:318
39 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::_M_get()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:993
40 0x000000000004e1c1 std::__shared_ptr_access<ucxx::Request, (__gnu_cxx::_Lock_policy)2, false, false>::operator->()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/shared_ptr_base.h:987
41 0x000000000004e1c1 operator()()  /src/ucxx/cpp/tests/listener.cpp:319
42 0x000000000004e1c1 __invoke_impl<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:61
43 0x000000000004e1c1 __invoke_r<bool, (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()::<lambda()>&>()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/invoke.h:114
44 0x000000000004e1c1 _M_invoke()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:290
45 0x000000000007820d std::function<bool ()>::operator()()  /miniconda3/envs/rn-240313/x86_64-conda-linux-gnu/include/c++/11.4.0/bits/std_function.h:590
46 0x000000000005456a (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()  /src/ucxx/cpp/tests/listener.cpp:321
47 0x000000000005456a (anonymous namespace)::ListenerTest_EndpointNonBlockingCloseWithCallbacks_Test::TestBody()  /src/ucxx/cpp/tests/listener.cpp:321
48 0x000000000004d40e testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>()  ???:0
49 0x000000000004d6a1 testing::Test::Run()  ???:0
50 0x000000000004da2f testing::TestInfo::Run()  ???:0
51 0x000000000004deff testing::TestSuite::Run()  ???:0
52 0x0000000000059423 testing::internal::UnitTestImpl::RunAllTests()  ???:0
53 0x000000000004dfdd testing::UnitTest::Run()  ???:0
54 0x0000000000001072 main()  ???:0
55 0x0000000000024083 __libc_start_main()  /build/glibc-wuryBv/glibc-2.31/csu/../csu/libc-start.c:308
56 0x000000000002971e _start()  ???:0
=================================
Aborted

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, ok, I didn't build in debug

@pentschev pentschev marked this pull request as ready for review March 14, 2024 16:09
@pentschev pentschev requested review from a team as code owners March 14, 2024 16:09
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake approval.

A race condition may occur if `ucxx::Endpoint::setCloseCallback()`
occurs while `ucxx::Endpoint::errorCallback()` is being executed, for
example during Python `remove_close_callback()` call, causing
`errorCallback` to attempt executing the callback which is not available
anymore.
@pentschev
Copy link
Member Author

eb8e708 should fix a race condition I didn't catch during #206 .

@pentschev
Copy link
Member Author

Thanks @wence- and @bdice for reviews and approvals.

@pentschev
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit 0204af1 into rapidsai:branch-0.37 Mar 15, 2024
57 checks passed
@pentschev pentschev deleted the request-endpoint-close branch March 15, 2024 14:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants