Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

verbs+rxm - segfault in fi_ibv_wc_2_wce #3355

Closed
fzago-cray opened this issue Sep 29, 2017 · 2 comments
Closed

verbs+rxm - segfault in fi_ibv_wc_2_wce #3355

fzago-cray opened this issue Sep 29, 2017 · 2 comments

Comments

@fzago-cray
Copy link
Contributor

fzago-cray commented Sep 29, 2017

Test:
MPIR_CVAR_OFI_USE_PROVIDER="verbs;ofi_rxm" srun --kill-on-bad-exit -N 32 -n 64 sh -c "ulimit -c unlimited ; ./mpi/collective/osu_alltoall -m 8"

gdb analysis:
Program terminated with signal 11, Segmentation fault.
#0 0x00007ffff6de3156 in fi_ibv_wc_2_wce (cq=0x69a260, wc=0x7ffff3a11cf0, wce=0x7ffff3a11d20) at prov/verbs/src/verbs_cq.c:278
278 (*wce)->wc = *wc;
(gdb) bt
#0 0x00007ffff6de3156 in fi_ibv_wc_2_wce (cq=0x69a260, wc=0x7ffff3a11cf0, wce=0x7ffff3a11d20) at prov/verbs/src/verbs_cq.c:278
#1 0x00007ffff6de325b in fi_ibv_poll_outstanding_cq (ep=0xcad1b0, cq=0x69a260) at prov/verbs/src/verbs_cq.c:305
#2 0x00007ffff6de345c in fi_ibv_cleanup_cq (ep=0xcad1b0) at prov/verbs/src/verbs_cq.c:363
#3 0x00007ffff6deaae2 in fi_ibv_msg_ep_close (fid=0xcad1b0) at prov/verbs/src/verbs_msg_ep.c:120
#4 0x00007ffff6e16b21 in fi_close (fid=0xcad1b0) at ./include/rdma/fabric.h:466
#5 0x00007ffff6e17108 in rxm_conn_close (handle=0xc0e128) at prov/rxm/src/rxm_conn.c:105
#6 0x00007ffff6da86d2 in ofi_cmap_process_connreq (cmap=0x6b55d0, addr=0x7fffec0008d0, handle_ret=0x7ffff3a11e68) at prov/util/src/util_av.c:1344
#7 0x00007ffff6e1724e in rxm_msg_process_connreq (rxm_ep=0x69a9d0, msg_info=0x7fffec486000, data=0x7fffec0008d0) at prov/rxm/src/rxm_conn.c:133
#8 0x00007ffff6e179b0 in rxm_conn_event_handler (arg=0x69a9d0) at prov/rxm/src/rxm_conn.c:237
#9 0x00007ffff7bc6dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff715c28d in lseek64 () from /lib64/libc.so.6
#11 0x0000000000000000 in ?? ()
(gdb) p *wce
$5 = (struct fi_ibv_wce *) 0x0

@fzago-cray
Copy link
Contributor Author

It seems that in fi_ibv_wc_2_wce,

memset(wce, 0, sizeof(*wce));

should be

memset(*wce, 0, sizeof(struct fi_ibv_wce));

The original memset just wipes out the pointer that was allocated.

As this is only called on error path, it explains why it wasn't seen before.

dmitrygx added a commit to dmitrygx/libfabric that referenced this issue Sep 30, 2017
Signed-off-by: Dmitry Gladkov <dmitry.gladkov@intel.com>
@dmitrygx
Copy link
Member

dmitrygx commented Oct 1, 2017

Thanks for the reporting this. Yes, you're right.

memset(*wce, 0, sizeof(struct fi_ibv_wce));

should be there instead.

I've checked this solution (#3355) on the system (iWarp verbs device) where it is reproducible almost each time.

shefty added a commit that referenced this issue Oct 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants