-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BTL_OFI_BLACKLIST #6734
Add BTL_OFI_BLACKLIST #6734
Conversation
Co-authored-by: jlbyrne-hpe <john.l.byrne@hpe.com> Signed-off-by: guserav <erik.zeiske@web.de>
Can one of the admins verify this patch? |
ok to test |
Can you add some color as to why the OFI BTL shouldn't use the psm2 provider? We should have a note either in the commit log or in a comment about why it is excluded. I'm also not sure customers are going to love having to exclude both MTL and BTL providers. Maybe that's ok, but we should think about it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if a user supplies --mca btl_ofi_provider_include BLAH
on the command line? It does not look like you are checking the source of the include / exclude lists to see if the user supplied "include" but not "exclude" (and therefore the exclude list should be ignored), ...etc.
EDIT: Fixed mtl_ofi...
from original text to be btl_ofi...
, which is what I originally intended.
Mainly to satisfy the coding style. Signed-off-by: guserav <erik.zeiske@web.de>
878fc75
to
36fa464
Compare
There is no support in a release for one-sided using ofi. The OFI BTL was added to allow one-sided support for the 4.0.0 release and it was removed from the release during testing when some versions of psm2 broke things. See "OFI issues on Open MPI v4.0.0rc1 (Ralph H Castain) 20 Sep 2018" on the devel list. The hope is to enable one-sided support for some providers without causing trouble for others without doing a lot more work. It is suggested in the email above that a common OFI component, but this is a lot more work. The goal of the patch is to enable one-sided support for some providers with minimal effort without breaking anything.
I don't love having separate MTL/BTL include/exclude parameters, but it may not be possible to avoid having two sets of parameters. My understanding is that enabling one-sided for psm2 negatively impacts two-sided performance. Given that the use of MPI one-sided is rare, we may need one-sided include/exclude parameters whatever we do. (See my final question below.)
Despite the comment, the behavior is not exactly as with mtl. Including a provider will override it on the exclusion list, but the exclusion list is not just ignored as in mtl. And it was deliberately chosen not to make the include a list, but keep it as a single provider. However, as I noted above, one-sided is rare. Maybe the correct approach to this is for BTL OFI to be disabled unless a provider is specified instead of trying to enable it automatically if ofi is present? |
@jlbyrne-hpe Forgive me -- I mis-typed: I meant to say My concern is that the help messages that the include/exclude params are mutually exclusive, but the code does not treat them that way. Put differently: what happens if the user specifies both Make sense? If MTL OFI exhibiting this same behavior, then it is also incorrect and should be fixed. |
@jsquyres Understood. |
bot:ompi:retest |
@guserav @jlbyrne-hpe Let me take a step back and ask: what is the exact problem you are trying to solve here? The title is "Add BTL_OFI_BLACKLIST", but there is no BTL_OF_BLACKLIST term or variable anywhere here. It's the Can you specify exactly what problem you are trying to solve? Is there something wrong with the psm2 OFI provider and/or the OFI BTL? If so, have you talked to Intel about it? |
HPE worked together with an Intel intern and Howard Pritchard to develop the original BTL OFI which was tested on psm2, gni, and our in-progress zhpe provider. It was proposed and accepted into Open MPI and all seemed fine until the 4.0 release tests when it broke existing psm2 installations. I do not recall the exact issue, but the contention was that it the problems occurred only with certain versions of the Libfabric/OPA libraries, If this is true, the solution is to update things, but if you don't care about one-sided, why should you have to update things that are working? Of course, by this time, the intern had left and there were other priorities, so things were left as they were. So the problem we are trying to solve is how to get an OIpen MPI release which will support one-sided for OFI providers without breaking existing working two-sided installations. It was suggested by Howard that an exclusion list wiith psm2 on it by default might be the easiest way forward. The use of the term "blacklist" in the PR was perhaps an unfortunate choice of words. |
This seems like a big hammer to fix an OFI-version specific problem. It also restricts quite a bit more than just the Given that your goal is to restrict behavior on Intel hardware, I think Intel needs to be involved in the discussion. |
@jsquyres Actually, given the existence of BTL OFI in the tree, I consider it to be a low-effort fix to enable a feature a few users will care about while not breaking things for the majority of users who don't. In the 4.0.x branch, a system with psm2 and Ethernet would perform one-sided over Ethernet; on master with this patch the expectation for psm2 will still use Ethernet by default, but it can be overridden and other providers shouldn't have to worry. As to the other items on the exclusion list, this is the exclusion list from MTL OFI with psm2 added. In the absence of a common OFI component, a common OFI header would be useful to sharing things like this. We can certainly address any documentation issues you desire. I do think that given the lack of use of one-sided, the best choice is off-by-default, though. |
Well we analyzed this problem extensively on a call today with Intel, HPE, and LANL participants and determined that this blacklist approach to working around PSM2 libfabric provider limitations will not work, at least not while the PSM2 MTL is active. The underlying issue is that using PSM2 through libfabric and directly doesn't work. Although it appears to be okay to call what we are noticing is that the blacklist approach works if one selects the OFI MTL. It does not work if one goes with the default of PSM2 MTL. Even if one excludes the PSM2 provider in a blacklist for the OFI BTL, the very fact that one calls fi_getinfo and the PSM2 provider is present in the libfabric is sufficient to trigger the use of PSM2_finalize with libfabric. This results in messages like this:
Although it might be useful to include a blacklist in the OFI BTL, it won't solve this PSM2 provider/PSM2 being used as the MTL problem. So closing this PR and open a new one when a better solution is found. |
Built with ompi/master HEAD commit 5d51b23. I needed to configure ompi/master with '--disable-picky' to build with the PR commits. I did not need to use '--disable-picky' without the PR commits. Is this expected? |
No - developers should never |
Don’t initialize OFI BTL if only PSM2 provider is available (override with MCA parameter)