-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl/openib: delay UCX warning to add_procs() #6137
Conversation
If UCX is available, then pml/ucx will be used instead of pml/ob1 + btl/openib, so there is no need to warn about btl/openib not supporting Infiniband. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ok to me.
hi all not sure how, but this PR completely breaks OMPI over UCX (and btl???) - OMPI just crashes into core |
@hoopoepg an you please provide more details on the crash ? It obviously did not crash in my environment. What if you |
@ggouaillardet it works fine without openib (using your command line) |
@ggouaillardet it crashes on different location every time, could be on UCX init - ibv_open_device, or on access to memory. |
valgrind reports many errors. I will sort which ones are caused by this PR since I am unable to reproduce any crash in my environment (mlx4 + centos7 + ucx master) |
@ggouaillardet I'm using osu_bw to reproduce issue. env Red Hat Enterprise Linux Server release 7.4 (Maipo): mpirun -np 2 ./osu_bw
|
I found a gross bug of mine :-( can you please give the attached patch a try diff --git a/opal/mca/btl/openib/btl_openib.c b/opal/mca/btl/openib/btl_openib.c
index 9ec57c0..cc7a982 100644
--- a/opal/mca/btl/openib/btl_openib.c
+++ b/opal/mca/btl/openib/btl_openib.c
@@ -1048,6 +1048,7 @@ int mca_btl_openib_add_procs(
opal_show_help("help-mpi-btl-openib.txt", "ib port not selected",
true, opal_process_info.nodename,
ibv_get_device_name(openib_btl->device->ib_dev), openib_btl->port_num);
+ return OPAL_SUCCESS;
}
btl_rank = get_openib_btl_params(openib_btl, &lcl_subnet_id_port_cnt); |
@ggouaillardet doesn't help - same issue |
thanks for the report. can you please run the following commands and tell me which works and which crashes
also, could you please post a stack trace. I understand the crash location vary between runs, but I'd like to at least know understand if it crashes during last but not least, which UCX version are you using ? My system runs the same OFED release (with
|
UCX - master
|
MPI is master + your patch |
Thanks for the traces ! I am still unable to reproduce the issue, but I might have a lead on what is going on. Anyway, could you please apply the following patch (it only collect traces) on top of
out of curiosity, does the command below crashes ?
if you get a chance, could you checkout |
@ggouaillardet
|
Thanks ! I just realized I forgot to upload my patch with extended traces ... Anyway, I think I see what could be causing the issue, and I will upload a fix tomorrow. |
…led. Fixes an issue introduced in open-mpi/ompi@0a2ce58 Refs. open-mpi#6137 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Many thanks to Sergey Oblomov for reporting this issue and the countless traces provided when troubleshooting it. Refs. open-mpi#6137 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
…led. Fixes an issue introduced in open-mpi/ompi@0a2ce58 This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master. Refs. open-mpi#6137 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Many thanks to Sergey Oblomov for reporting this issue and the countless traces provided when troubleshooting it. This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master. Refs. open-mpi#6137 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.
Signed-off-by: Gilles Gouaillardet gilles@rist.or.jp