-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault in rxm_open_conn on master branch (NULL provider name) #7300
Comments
We don't see this issue in our CI. The fabric_attr are only partially initialized. The uninitialized pieces (prov_name, api_version) are usually set by the libfabric core within the fi_getinfo() function. It looks like somehow that functionality got skipped, but I haven't identified how that could have occurred yet. The referenced msg_info is an fi_info structure that rxm maintains internally. |
I believe I see the problem (requires using shared rx context, which our CI doesn't set -- that's a gap). I'm working on a fix. |
PR #7308 should address this issue. I was able to reproduce the error with a modified version of fabtests (fi_msg). |
@shefty Thanks for the fixing, I verified that the segfault can be fixed by PR7308. |
latest libfabric master branch (bb8bcc7), the test got a segfault -
(gdb) bt
#0 0x00007f1a10f4ffe5 in __strcasestr_sse42 () from /lib64/libc.so.6
#1 0x00007f1a0b4af9e8 in rxm_open_conn (conn=0x7f19ec555bb0, msg_info=0x7f19ec555110) at prov/rxm/src/rxm_conn.c:203
#2 0x00007f1a0b4b144e in rxm_process_connreq (ep=0x7f19ec117cf0, cm_entry=0x211a760) at prov/rxm/src/rxm_conn.c:677
#3 0x00007f1a0b4b18d2 in rxm_handle_event (ep=0x7f19ec117cf0, event=1, cm_entry=0x211a760, len=40) at prov/rxm/src/rxm_conn.c:758
#4 0x00007f1a0b4b19fe in rxm_conn_progress (ep=0x7f19ec117cf0) at prov/rxm/src/rxm_conn.c:784
#5 0x00007f1a0b4c592b in rxm_ep_do_progress (util_ep=0x7f19ec117cf0) at prov/rxm/src/rxm_cq.c:1941
#6 0x00007f1a0b4c5a0f in rxm_ep_progress (util_ep=0x7f19ec117cf0) at prov/rxm/src/rxm_cq.c:1961
#7 0x00007f1a0b43f008 in ofi_cq_progress (cq=0x7f19ec103450) at prov/util/src/util_cq.c:567
#8 0x00007f1a0b43e0e0 in ofi_cq_readfrom (cq_fid=0x7f19ec103450, buf=0x211aab0, count=16, src_addr=0x211aa30) at prov/util/src/util_cq.c:230
#9 0x00007f1a0fa8e45f in fi_cq_readfrom (src_addr=0x211aa30, count=16, buf=0x211aab0, cq=0x7f19ec103450) at /home/xliu9/src/daos_m/install/prereq/debug/ofi/include/rdma/fi_eq.h:400
#10 na_ofi_cq_read (max_count=16, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x211aa30, cq_events=0x211aab0,
context=0x7f19ec1067f0) at /home/xliu9/src/daos_m/build/external/debug/mercury/src/na/na_ofi.c:3235
...
(gdb) f 1
#1 0x00007f1a0b4af9e8 in rxm_open_conn (conn=0x7f19ec555bb0, msg_info=0x7f19ec555110) at prov/rxm/src/rxm_conn.c:203
203 if (!strcasestr(msg_info->fabric_attr->prov_name, "tcp")) {
(gdb) p msg_info->fabric_attr
$1 = (struct fi_fabric_attr *) 0x7f19ec555530
(gdb) p *msg_info->fabric_attr
$2 = {fabric = 0x0, name = 0x7f19ec555510 "IB-0xfe8", '0' <repeats 13 times>, prov_name = 0x0, prov_version = 7471104, api_version = 0}
(gdb) p msg_info->fabric_attr->prov_name
$3 = 0x0
(gdb) p msg_info->fabric_attr->prov_name[0]
Cannot access memory at address 0x0
As @shefty mentioned on another ticket, the problem may due to the verbs provider not formatting the fi_info correctly
The text was updated successfully, but these errors were encountered: