-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
btl/openib: delay UCX warning to add_procs() #6152
btl/openib: delay UCX warning to add_procs() #6152
Conversation
@hjelmn please review when you have a chance |
@hppritcha It looks ok to me on its own. Just waiting on #6184 and it's associated v4.0.x PR before approving. |
@ggouaillardet #6184 is moot now, is this still a WIP-DNM situation? |
@ggouaillardet has marked this as a blocker on the corresponding master PR #6184, but we just closed that PR without merging because BTL openib has been removed from master. @ggouaillardet describes the seriousness here: #6184 (comment) |
@ggouaillardet While this is no longer important for master, it'd be nice for v4.0.x Also is there someone else who could review this? hjlemn may not have the time anymore. |
@gpaulsen This PR is currently half baked. At the very least, I have to backport the two commits from #6184 (that was never merged since According to @hoopoepg the outcome is still worst in some cases (a workaround is to disable I will backport the two commits from now and test it on my cluster (I have a single IB QDR port, and no issue with this PR on master). Can you please bring the topic at today's telcon and decide how to move forward ? |
If UCX is available, then pml/ucx will be used instead of pml/ob1 + btl/openib, so there is no need to warn about btl/openib not supporting Infiniband. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp> (cherry picked from commit open-mpi/ompi@0a2ce58)
…led. Fixes an issue introduced in open-mpi/ompi@0a2ce58 This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master. Refs. open-mpi#6137 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Many thanks to Sergey Oblomov for reporting this issue and the countless traces provided when troubleshooting it. This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master. Refs. open-mpi#6137 Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
99d8576
to
8da4605
Compare
@gpaulsen I backported the other commits and it is running fine in my environment. |
Thanks. We discussed in today's Webex and would like to hear from either @xinzhao3, @jladd-mlnx for a suggestion if this should go into v4.0.x. |
From a design standpoint, this is a good PR. However, I don't think I'm the right person to review a change like this in the OpenIB BTL. |
@hppritcha @ggouaillardet @jsquyres @hjelmn, So, based on what I read, I THINK this PR is ready to go into v4.0.x. Do you agree? Can we remove the WIP label and merge? |
@gpaulsen Someone has to review it. |
@gpaulsen do you want to merge this PR? I think its ready. |
Can one of the admins verify this patch? |
If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.
Signed-off-by: Gilles Gouaillardet gilles@rist.or.jp
(cherry picked from commit 0a2ce58)