-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verbs+rxm - segfault in fi_ibv_wc_2_wce #3355
Comments
It seems that in fi_ibv_wc_2_wce,
should be
The original memset just wipes out the pointer that was allocated. As this is only called on error path, it explains why it wasn't seen before. |
Signed-off-by: Dmitry Gladkov <dmitry.gladkov@intel.com>
Thanks for the reporting this. Yes, you're right.
should be there instead. I've checked this solution (#3355) on the system (iWarp verbs device) where it is reproducible almost each time. |
Test:
MPIR_CVAR_OFI_USE_PROVIDER="verbs;ofi_rxm" srun --kill-on-bad-exit -N 32 -n 64 sh -c "ulimit -c unlimited ; ./mpi/collective/osu_alltoall -m 8"
gdb analysis:
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff6de3156 in fi_ibv_wc_2_wce (cq=0x69a260, wc=0x7ffff3a11cf0, wce=0x7ffff3a11d20) at prov/verbs/src/verbs_cq.c:278
278 (*wce)->wc = *wc;
(gdb) bt
#0 0x00007ffff6de3156 in fi_ibv_wc_2_wce (cq=0x69a260, wc=0x7ffff3a11cf0, wce=0x7ffff3a11d20) at prov/verbs/src/verbs_cq.c:278
#1 0x00007ffff6de325b in fi_ibv_poll_outstanding_cq (ep=0xcad1b0, cq=0x69a260) at prov/verbs/src/verbs_cq.c:305
#2 0x00007ffff6de345c in fi_ibv_cleanup_cq (ep=0xcad1b0) at prov/verbs/src/verbs_cq.c:363
#3 0x00007ffff6deaae2 in fi_ibv_msg_ep_close (fid=0xcad1b0) at prov/verbs/src/verbs_msg_ep.c:120
#4 0x00007ffff6e16b21 in fi_close (fid=0xcad1b0) at ./include/rdma/fabric.h:466
#5 0x00007ffff6e17108 in rxm_conn_close (handle=0xc0e128) at prov/rxm/src/rxm_conn.c:105
#6 0x00007ffff6da86d2 in ofi_cmap_process_connreq (cmap=0x6b55d0, addr=0x7fffec0008d0, handle_ret=0x7ffff3a11e68) at prov/util/src/util_av.c:1344
#7 0x00007ffff6e1724e in rxm_msg_process_connreq (rxm_ep=0x69a9d0, msg_info=0x7fffec486000, data=0x7fffec0008d0) at prov/rxm/src/rxm_conn.c:133
#8 0x00007ffff6e179b0 in rxm_conn_event_handler (arg=0x69a9d0) at prov/rxm/src/rxm_conn.c:237
#9 0x00007ffff7bc6dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff715c28d in lseek64 () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) p *wce
$5 = (struct fi_ibv_wce *) 0x0
The text was updated successfully, but these errors were encountered: