Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in rxm_open_conn on master branch (NULL provider name) #7300

Closed
liuxuezhao opened this issue Dec 9, 2021 · 4 comments
Closed

segfault in rxm_open_conn on master branch (NULL provider name) #7300

liuxuezhao opened this issue Dec 9, 2021 · 4 comments

Comments

@liuxuezhao
Copy link
Contributor

latest libfabric master branch (bb8bcc7), the test got a segfault -
(gdb) bt
#0 0x00007f1a10f4ffe5 in __strcasestr_sse42 () from /lib64/libc.so.6
#1 0x00007f1a0b4af9e8 in rxm_open_conn (conn=0x7f19ec555bb0, msg_info=0x7f19ec555110) at prov/rxm/src/rxm_conn.c:203
#2 0x00007f1a0b4b144e in rxm_process_connreq (ep=0x7f19ec117cf0, cm_entry=0x211a760) at prov/rxm/src/rxm_conn.c:677
#3 0x00007f1a0b4b18d2 in rxm_handle_event (ep=0x7f19ec117cf0, event=1, cm_entry=0x211a760, len=40) at prov/rxm/src/rxm_conn.c:758
#4 0x00007f1a0b4b19fe in rxm_conn_progress (ep=0x7f19ec117cf0) at prov/rxm/src/rxm_conn.c:784
#5 0x00007f1a0b4c592b in rxm_ep_do_progress (util_ep=0x7f19ec117cf0) at prov/rxm/src/rxm_cq.c:1941
#6 0x00007f1a0b4c5a0f in rxm_ep_progress (util_ep=0x7f19ec117cf0) at prov/rxm/src/rxm_cq.c:1961
#7 0x00007f1a0b43f008 in ofi_cq_progress (cq=0x7f19ec103450) at prov/util/src/util_cq.c:567
#8 0x00007f1a0b43e0e0 in ofi_cq_readfrom (cq_fid=0x7f19ec103450, buf=0x211aab0, count=16, src_addr=0x211aa30) at prov/util/src/util_cq.c:230
#9 0x00007f1a0fa8e45f in fi_cq_readfrom (src_addr=0x211aa30, count=16, buf=0x211aab0, cq=0x7f19ec103450) at /home/xliu9/src/daos_m/install/prereq/debug/ofi/include/rdma/fi_eq.h:400
#10 na_ofi_cq_read (max_count=16, actual_count=, src_err_addrlen=, src_err_addr=, src_addrs=0x211aa30, cq_events=0x211aab0,
context=0x7f19ec1067f0) at /home/xliu9/src/daos_m/build/external/debug/mercury/src/na/na_ofi.c:3235
...
(gdb) f 1
#1 0x00007f1a0b4af9e8 in rxm_open_conn (conn=0x7f19ec555bb0, msg_info=0x7f19ec555110) at prov/rxm/src/rxm_conn.c:203
203 if (!strcasestr(msg_info->fabric_attr->prov_name, "tcp")) {
(gdb) p msg_info->fabric_attr
$1 = (struct fi_fabric_attr *) 0x7f19ec555530
(gdb) p *msg_info->fabric_attr
$2 = {fabric = 0x0, name = 0x7f19ec555510 "IB-0xfe8", '0' <repeats 13 times>, prov_name = 0x0, prov_version = 7471104, api_version = 0}
(gdb) p msg_info->fabric_attr->prov_name
$3 = 0x0
(gdb) p msg_info->fabric_attr->prov_name[0]
Cannot access memory at address 0x0

As @shefty mentioned on another ticket, the problem may due to the verbs provider not formatting the fi_info correctly

@shefty
Copy link
Member

shefty commented Dec 9, 2021

We don't see this issue in our CI. The fabric_attr are only partially initialized. The uninitialized pieces (prov_name, api_version) are usually set by the libfabric core within the fi_getinfo() function. It looks like somehow that functionality got skipped, but I haven't identified how that could have occurred yet. The referenced msg_info is an fi_info structure that rxm maintains internally.

@shefty
Copy link
Member

shefty commented Dec 10, 2021

I believe I see the problem (requires using shared rx context, which our CI doesn't set -- that's a gap). I'm working on a fix.

@shefty
Copy link
Member

shefty commented Dec 10, 2021

PR #7308 should address this issue. I was able to reproduce the error with a modified version of fabtests (fi_msg).

@liuxuezhao
Copy link
Contributor Author

@shefty Thanks for the fixing, I verified that the segfault can be fixed by PR7308.

@shefty shefty closed this as completed Dec 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants