-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Openib BTL being selected and used when not desired #7107
Comments
I think the help message could be a bit better:
There's no indication there that this is about the @mwheinz What is your suggestion here? Specifically: the intent to use UCX is the same, regardless of whether Open MPI is compiled with UCX support or not. Are you asking for a re-wording of the message? Or something else? |
Jeff, My question is: if we aren't using verbs, why am I getting the warning at all? |
Heck, in the above example, I'm not using OPA +or+ InfiniBand at all! |
This problem (and a related hang) appears to be caused by the openib btl loading even though it is not desired. Using
prevents both the warning and the apparently related hang. However, it leaves the question of why the openib btl is being used at all. Using this updated command line:
I still see
And if I use
|
I would rephrase that: the problem is that the OFI MTL is not being used when you expect it to be. It is understandable that the openib BTL is being used. Specifically:
Hence, the real question is the first element in this chain: why did the OFI MTL elect not to be used? I suspect it has something to do with the selection logic in the OFI MTL -- perhaps something specific to do with the TCP OFI provider...? There was definitely discussion of not wanting to use the TCP OFI provider for the OFI MTL (and instead using the TCP BTL). I don't know/remember where all that conversation landed. But that's where I'd look.
Again, I think the question is: why did the OFI MTL elect not to run (and therefore failover to the OB1 PML and whatever BTLs elected to run)?
Note that the correct notation is:
Same question: why did the OFI MTL elect not to run? Per our off-issue discussion in Slack, if there's a hang with the openib BTL on HFI devices, that is a secondary issue. You may or may not be interested in addressing that issue (i.e., if PSM/PSM2 is never intended to be supported through openib, it may not be worth digging in to that). |
Agreed. At this point, I've established the following characteristics:
The most astonishing bit for #3 is that I am running a 2-process B2B benchmark but I also see this: [hdsmpriv01:39668] 81 more processes have sent help message help-mca-bml-r2.txt / unreachable proc |
I've updated the problem description to more accurately reflect my current understanding and attached a verbose log of a sample run of the IMB-EXT benchmark. |
It's probably due to BTL's being used for one-sided MPI operations, but CM/OFI MTL being used for point-to-point. |
I literally just figured that out - decided to try some other apps to collect more info and, yup, the problem only occurs for one-sided communications. Does this also occur with OMPI 3.1.x? |
Probably. I.e., I don't think the MTL's support one-sided operations. So this behavior is probably not new. |
Makes sense. |
@hppritcha has indicated that this may be fixed in 4.0.2. I will re-test with that and report back. |
I've confirmed this is fixed in 4.0.2. |
Thank you for taking the time to submit an issue!
Background information
When using the OFI MTL with 4.0.1 and 4.0.2 I am seeing the following warning:
After research, I discovered that the problem is that OMPI is selecting a BTL transport even though the OFI MTL was selected. Moreover, attempting to exclude openib as a transport simply causes OMPI to select a different BTL to use instead of using the selected MTL. If I exclude all BTLs except self, the benchmark begins reporting errors instead - although it then runs to completion anyway.
The problem does not occur when the PSM2 MTL is selected.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.1, 4.0.2
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Compiled from git clone of 4.0.[1|2]
Please describe the system on which you are running
Details of the problem
The text was updated successfully, but these errors were encountered: