-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS: rxm crash in rxm_conn_close() on the server when client exits during rdma transfer #6665
Comments
Could you please provide information on the setup used and steps (command + parameters) to recreate this issue. We have done the daos and ior setup on our cluster. |
Hi, Unfortunately right now we dont have simple reproduction scenario, and it requires few more components to be setup/installed before you can hit the issue. I am currently in the process of trying to reduce reproduction to cart-level test (which would eliminate need for daos server and other components). However at this point I was not able to hit the problem with simpler samples. In order to recreate this you will need 3 nodes, 2 server and 1 client. Beyond daos stack this requires following: HDF5 library to be built from here: configure line to use: Next is to compile vol-daos with hdf5: Download from: https://github.com/HDFGroup/vol-daos Once you update PATH and LD_LIBRARY_PATH to point to hdf5 prefix location vol ccmake would pick that up automatically steps to build vol-daos are specified in: In short you should be able to just do the folloiwng: Once everything is compiled, you should have 'h5_partest_t_shapesame' test available which reproduces the problem. Next step is to start daos server: mkdir /tmp/daos_server configure daos_server.yml file. example below
|
Mohamad was able to reproduce this problem running ior, waiting 2 seconds Example of command run: DAOS_CONT and DAOS_POOL can be set using helper script above once the daos server has started |
I am still having issues with the make of the vol-daos.
|
I don't know what ior is, but it we use that, can we reproduce the problem easier than following that 2 page recipe? |
According to Mohamad yes, the ior run can also be used to reproduce things without having to setup daos-vol/hdf5 setup. |
I've been able just now to reproduce the problem much easier using cart-level server and client end. step1: Start server using script below. (All runs assume you are starting from the top of daos/ directory) Modify 'HOST' envariable to set to your hostname; also change INTERFACE_1/INTERFACE_2 `export CRT_PHY_ADDR_STR="ofi+tcp;ofi_rxm" SERVER_APP="./install/bin/crt_launch -e install/lib/daos/TESTING/tests/test_group_np_srv --name selftest_srv_grp --cfg_path=." set -x In a separate terminal launch self_test: Wait few seconds and ctrl+c out of it.
|
Just running the servers and the cleints - one of the servers seems to be crashing because it cannot create a mount point at /mnt/daos. The client and the other server report to be listening. [nnanal@cst-icx3 ~]$ daos_agent -i -s /tmp/daos_agent/ -o ~/daos_agent.yml |
@nikhilnanal please try with just cart level reproducers, as they would require significantly less setup. In your case you might need to first mkdir /mnt/daos and make sure it is chmod-ed/chowned by the same user who is launching the daos |
I tried to run the script using cart. However it cannot find the crt_launch.
orterun was unable to launch the specified application as it could not access Executable: ./install/bin/crt_launch while attempting to start process rank 0.4 total processes failed to start is there a specific version of daos I should build? I m building daos v1.1.3. |
daos compiles some of the samples/tests optionally based on whether MPI is found on your system or not. after that: If everything is correct you should get crt_launch in your install/bin/ directory. |
@nikhilnanal |
I was able to build the crt_launch and for once it did run the test with the output above. but the second time and afterwards everytime I've tried to run it . it is giving these errors:
By default, for Open MPI 4.0 and later, infiniband ports on a device Local host: cst-icx1 WARNING: There was an error initializing an OpenFabrics device. Local host: cst-icx1
|
The OpenIb errors were present in the first trial as well. so I am not sure whats causing the HG errors. |
make sure there are no runaway processes from the first run; might need to kill test_group_np_srv and test_group_np_cli manually |
okay that seems to work, thank you. |
you can run it on the same node; i only use 1 node in my own reproduction |
Okay now it is showing up hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12. hg_core_send_output_cb(): NA callback returned error (NA_PROTOCOL_ERROR)04/15-11:15:26.70 cst-icx1 CaRT[261916/262082] hg WARN src/cart/crt_hg.c:1153 crt_hg_reply_send_cb() hg_cbinfo->ret: 22, opc: 0xff030007. hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)04/15-11:15:26.70 cst-icx1 CaRT[261919/262079] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12. hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)04/15-11:15:26.70 cst-icx1 CaRT[261915/262084] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12. hg_bulk_transfer_cb(): NA callback returned error (NA_PROTOCOL_ERROR)04/15-11:15:26.70 cst-icx1 CaRT[261918/262077] hg ERR src/cart/crt_hg.c:1456 crt_hg_bulk_transfer_cb() crt_hg_bulk_transfer_cb,hg_cbinfo->ret: 12.
|
the expectation for servers is to continue working when client is ctrl-c-ed out |
bulk transfer errors are expected, as we are terminating bulk mid transaction, however the subsequent server crash is not |
ok thank you. I ll try to debug from here. |
I'm checking the source code and suspect rxm_conn_handle_notify() can leave a freed handle in rxm_cmap::handles_av, could you confirm this change makes sense, or I misunderstood the code, thanks.
|
I've tried locally @gnailzenh 's patch and it didnt seem to fix the issue. |
It's possible for handle->fi_addr == FI_ADDR_NOTAVAIL. The peer's address does not need to be in the AV. The initialization of handle only sets either fi_addr or peer, but it's not obvious to me if that requirement is always maintained. The if-else suggests we could add assert(handle->fi_addr == FI_ADDR_NOTAVAIL) in the if case. If that gets hit, then you've at least found one issue. |
@frostedcmos - Is the segfault while running a debug version of libfabric? |
No, the retry just tonight was using release version: |
I suspect the completion error may be closer to the actual problem. Thanks. I'll analyze the tcp and rxm error reporting, particularly around the handling for internal messages. |
I believe in the case of a loop back connection (endpoint connects to itself) you will find that handle->fi_addr is valid and the peer exists as well. Other cases should be one or the other since the peer is moved if the AV entry is later added. So it may still be a good idea to check both independently. |
PR #6707 attempts to handle completion errors better. It's hard to test those changes since it requires generating completion errors though. But it might help with the segfault in rxm_handle_comp_error() that was hit. @nikhilnanal - if you're at the point where you can reproduce the crash, testing the changes in that PR would be useful, at least to ensure that it didn't make things worse. @swelch - Do you know where in the code path that occurs? I don't mind check both independently to be safe, but I'd like to understand if there is a real issue that we could be hitting. |
@shefty - I believe it occurs when the local side has initiated a connect request to itself (hence the address is in it's AV), then when processing the connect request in the rxm_cmap_process_connreq() the AV handle state will show as RXM_CMAP_CONNREQ_SENT, since the local and requester address will be the same, a peer handle is allocated and the fi_addr set to the associated AV entry; the peer handle is used to create a new message endpoint and accept the connection. Seems like the notify will close the peer, but leave the handle in the AV valid. However, the more I think about it we may get a notify for both the AV handle and the peer handle in this case; which if true it may ultimately close both the peer and AV handle (but the connection is unusable when the first is received). |
@swelch -- Thanks, I see that path now. So it is possible for those both to be set. I'll add a fix to my open PR to address the problem pointed out by @gnailzenh. |
#6707 - updated with fi_addr fix. |
thanks, I also noticed there is a "TODO" in rxm_eq_sread()
Because we are using auto-progress, is this a race that can happen? |
Hmm... that sounds like it's describing a real race. I don't know for certain without doing a deep dive through the code to see how the cleanup occurs. |
I did find issues in the tcp provider where it could report completions for transfers that were NOT initiated by the upper level user. E.g. an internal ack. There are fixes for this in master. |
Hi, is there a tag for these fixes, or could you provide commit hashes of those patches? |
There's not a tag, but there will be a v1.13 release within about 3 weeks. |
Btw, I've rewritten the connection management code in rxm, which I hope will start us down a path of fixing all of the DAOS connection related issues. See #6778. The code is still under testing, and I'm hesitant to pull it into v1.13 without broader testing. |
has there been any update on this? is it planned to be merged anytime soon into post v1.13? |
Update: With tcp;ofi_rxm: Reproducer still runs with sockets and verbs;ofi_rxm. |
#7110 resolved the issue in local testing, when added on top of other fixes in main. |
In a test we have multiple servers and clients attempting to rdma to/from those servers.
If during rdma we kill the client via CTRL+C, the server side code crashes with the following trace:
OFI: 1.12.0
Provider: tcp;ofi_rxm
(gdb) bt
#0 0x00007f57edaefcb3 in rxm_conn_close () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#1 0x00007f57edaf152d in rxm_conn_handle_event () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#2 0x00007f57edaf277b in rxm_msg_eq_progress () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#3 0x00007f57edaf290d in rxm_cmap_connect () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#4 0x00007f57edaf2d61 in rxm_get_conn () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#5 0x00007f57edaf7fc2 in rxm_ep_tsend () from /home/mschaara/install/daos/bin/../lib64/../prereq/dev/mercury/lib/../../ofi/lib/libfabric.so.1
#6 0x00007f57f2fa27ca in fi_tsend (context=0x7f5720174dc8, tag=, dest_addr=, desc=, len=, buf=, ep=)
at /home/mschaara/install/daos/prereq/dev/ofi/include/rdma/fi_tagged.h:114
#7 na_ofi_cq_process_retries (context=0x7f5720043b10) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na_ofi.c:3380
#8 na_ofi_progress (na_class=0x7f572002c0f0, context=0x7f5720043b10, timeout=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na_ofi.c:5161
#9 0x00007f57f2f99c21 in NA_Progress (na_class=na_class@entry=0x7f572002c0f0, context=context@entry=0x7f5720043b10, timeout=timeout@entry=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/na/na.c:1168
#10 0x00007f57f31c3370 in hg_core_progress_na (na_class=0x7f572002c0f0, na_context=0x7f5720043b10, timeout=0, progressed_ptr=progressed_ptr@entry=0x2dc28c0 "")
at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3896
#11 0x00007f57f31c51a4 in hg_core_poll (progressed_ptr=, timeout=, context=0x7f572002c4d0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3838
#12 hg_core_progress (context=0x7f572002c4d0, timeout=0) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:3693
#13 0x00007f57f31ca38b in HG_Core_progress (context=, timeout=) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury_core.c:5056
#14 0x00007f57f31bcd52 in HG_Progress (context=context@entry=0x7f572002c120, timeout=) at /home/mschaara/source/deps_daos/daos/build/external/dev/mercury/src/mercury.c:2020
#15 0x00007f57f5c6ce41 in crt_hg_progress (hg_ctx=hg_ctx@entry=0x7f5720026dc8, timeout=timeout@entry=0) at src/cart/crt_hg.c:1233
#16 0x00007f57f5c2faa5 in crt_progress (crt_ctx=0x7f5720026db0, timeout=0) at src/cart/crt_context.c:1394
#17 0x0000000000422225 in dss_srv_handler (arg=0x2cd1410) at src/engine/srv.c:470
#18 0x00007f57f4bbc7ea in ABTD_ythread_func_wrapper () from /home/mschaara/install/daos/bin/../prereq/dev/argobots/lib/libabt.so.0
#19 0x00007f57f4bbc991 in make_fcontext () from /home/mschaara/install/daos/bin/../prereq/dev/argobots/lib/libabt.so.0
#20 0x0000000000000000 in ?? ()
(gdb) q
The text was updated successfully, but these errors were encountered: