Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

btl/openib: delay UCX warning to add_procs() #6152

Merged
merged 3 commits into from
Jun 3, 2019

Conversation

ggouaillardet
Copy link
Contributor

If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.

Signed-off-by: Gilles Gouaillardet gilles@rist.or.jp

(cherry picked from commit 0a2ce58)

@ggouaillardet
Copy link
Contributor Author

@hjelmn per the discussion in #6137, there is an issue that has to be fixed before this PR can be merged.

@hppritcha
Copy link
Member

@hjelmn please review when you have a chance

@hjelmn
Copy link
Member

hjelmn commented Dec 20, 2018

@hppritcha It looks ok to me on its own. Just waiting on #6184 and it's associated v4.0.x PR before approving.

@hppritcha
Copy link
Member

@ggouaillardet #6184 is moot now, is this still a WIP-DNM situation?

@jsquyres
Copy link
Member

@ggouaillardet has marked this as a blocker on the corresponding master PR #6184, but we just closed that PR without merging because BTL openib has been removed from master.

@ggouaillardet describes the seriousness here: #6184 (comment)

@gpaulsen
Copy link
Member

gpaulsen commented Mar 18, 2019

@ggouaillardet While this is no longer important for master, it'd be nice for v4.0.x
Can you confirm the status of this PR please?

Also is there someone else who could review this? hjlemn may not have the time anymore.

@ggouaillardet
Copy link
Contributor Author

@gpaulsen This PR is currently half baked. At the very least, I have to backport the two commits from #6184 (that was never merged since btl/openib was removed from master first).

According to @hoopoepg the outcome is still worst in some cases (a workaround is to disable btl/openib) and I do not have the hardware to investigate it (it seems we need both an IB and Ethernet port in order to evidence the issue).

I will backport the two commits from now and test it on my cluster (I have a single IB QDR port, and no issue with this PR on master).

Can you please bring the topic at today's telcon and decide how to move forward ?

If UCX is available, then pml/ucx will be used instead of
pml/ob1 + btl/openib, so there is no need to warn about
btl/openib not supporting Infiniband.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>

(cherry picked from commit open-mpi/ompi@0a2ce58)
…led.

Fixes an issue introduced in open-mpi/ompi@0a2ce58

This is a one-off commit for the v4.0.x branch since btl/openib has been removed from master.

Refs. open-mpi#6137

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Many thanks to Sergey Oblomov for reporting this issue
and the countless traces provided when troubleshooting it.

This is a one-off commit for the v4.0.x branch since btl/openib has been removed
 from master.

Refs. open-mpi#6137

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet ggouaillardet force-pushed the topic/v4.0.x/ucx_warning branch from 99d8576 to 8da4605 Compare March 19, 2019 00:26
@ggouaillardet
Copy link
Contributor Author

@gpaulsen I backported the other commits and it is running fine in my environment.

@gpaulsen
Copy link
Member

Thanks.

We discussed in today's Webex and would like to hear from either @xinzhao3, @jladd-mlnx for a suggestion if this should go into v4.0.x.

@hppritcha hppritcha requested review from jladd-mlnx and removed request for hjelmn March 26, 2019 15:40
@hppritcha hppritcha assigned jladd-mlnx and unassigned hjelmn Mar 26, 2019
@gpaulsen gpaulsen modified the milestones: v4.0.1, v4.0.2 Mar 27, 2019
@jladd-mlnx
Copy link
Member

From a design standpoint, this is a good PR. However, I don't think I'm the right person to review a change like this in the OpenIB BTL.

@gpaulsen
Copy link
Member

@hppritcha @ggouaillardet @jsquyres @hjelmn,

So, based on what I read, I THINK this PR is ready to go into v4.0.x. Do you agree? Can we remove the WIP label and merge?

@jsquyres
Copy link
Member

@gpaulsen Someone has to review it.

@gpaulsen gpaulsen requested review from hppritcha and removed request for jladd-mlnx May 17, 2019 18:54
@hppritcha
Copy link
Member

@gpaulsen do you want to merge this PR? I think its ready.

@AboorvaDevarajan
Copy link
Member

Can one of the admins verify this patch?

@gpaulsen gpaulsen merged commit 18f1037 into open-mpi:v4.0.x Jun 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants