-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
5.0.4 and newer -- LSF Affinity hostfile bug #12794
Comments
Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration"). Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited. |
As for LSF support... damn... As for rankfile mapping.
I have never done that before, so let me know if that is correct? (couldn't upload xml files, had to be compressed). |
Best I can suggest is that you contact IBM through your LSF contract support and point out that if they want OMPI to function on LSF going forward, they probably need to put a little effort into supporting it. 🤷♂️ XML looks fine - thanks! Will update as things progress. |
prrte @ 42169d1cebf75318ced0306172d3a452ece13352 is the last good one, |
workaround: |
Ouch - I would definitely advise against doing so. It might work for a particular application, but almost certainly will cause breakage in general. |
but the documentation is still suggesting that open MPI could be built with lsf support |
Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem. |
which openmpi release? |
thanks you |
They seem to indicate that v5.0.3 is working, but all the v5.0.x appear to at least build for them. |
Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢 |
i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3
|
i use tarball to build, could that be the problem? |
You could see if there are some things important that we have in the configure step (see initial message).
I don't know what could be the issue. But I think it shouldn't be cluttered here, rather open a new issue IMHO. This issue is deeper (not a build issue). |
I did open a ticket |
@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request. |
This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop). |
@bgoglin Hmmm...we removed this code from PRRTE: flags |= HWLOC_TOPOLOGY_FLAG_INCLUDE_DISALLOWED; because we only want the topology to contain the CPUs the user is allowed to use (note: all CPUs will still be in the complete_cpuset field if we need them - we use the return from I'll work out the issue for LSF as a separate problem - we don't see problems elsewhere, so it has something to do with what LSF is doing. My question for you is: how do I ensure the cpuset returned by |
Just ignore this corner-case. @sb22bs said using this envvar is a workaround. It was designed for strange buggy cases, eg when cgroups are misconfigured. I can try to better document that this envvar is bad idea unless you really know what you are doing. Just consider that get_allowed_cpuset() is always correct. |
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I am testing this on 5.0.3 vs. 5.0.5 (only 5.0.5 has this problem).
I don't have 5.0.4 installed, so I don't know if that is affected, but I am quite confident
that this also occurs for 5.0.4 (since the submodule for prrte is the same for 5.0.4 and 5.0.5).
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From the sources. A little bit of
ompi_info -c
info:And env-vars:
Version numbers are of course different for 5.0.5, otherwise the same.
Please describe the system on which you are running
Operating system/version:
Alma Linux 9.4
Computer hardware:
Tested on various hardware, both with and without hardware threads (see below).
Network type:
Not relevant, I think.
Details of the problem
The problem relates to the interaction between LSF and OpenMPI.
A couple of issues that are shown here.
Bug introduced between 5.0.3 and 5.0.5
I encounter problems running simple programs (hello-world) in a multinode configuration:
This will run on 4 nodes, each using 2 cores.
Output from:
5.0.3
:This looks reasonable. And LSF affinity file corresponds to this binding.
Note, that these nodes does not have hyper-threading enabled.
So our guess is that LSF always puts affinity for HWT, which is OK.
It still obeys the default core binding which is what our end-users
would expect.
5.0.5
Clearly something went wrong when parsing the affinity hostfile.
The hostfile looks like this (for both 5.0.3 and 5.0.5):
(different job, hence different nodes/ranks)
So the above, indicates some regression for this handling. I tried to backtrack
something from prrte, but I am not skilled enough for the logic happening there.
I tracked the submodule hashes of OpenMPI between 5.0.3 and 5.0.4 to these:
3a70fac9a21700b31c4a9f9958afa207a627f0fa
b68a0acb32cfc0d3c19249e5514820555bcf438b
b68a0acb32cfc0d3c19249e5514820555bcf438b
So my suspicion is that also 5.0.4 has this.
Now, these things are relatively easily fixed.
I just do:
unset LSB_AFFINITY_HOSTFILE
and rely on cgroups. Then I get the correct behaviour.
Correct bindings etc.
By unsetting, I also fallback to the default OpenMPI binding:
5.0.3
Note here that it says
core
instead ofhwt
.5.0.5
So same thing happens, good!
Nodes with HW threads
This is likely related to the above, I just put it here for completeness.
As mentioned above I can do
unset LSB_AFFINITY_HOSTFILE
and get correct bindings.However, the above works only when there are no HWT.
Here is the same thing for a node with 2 HWT / core (EPYC milan, 32-core/socket in 2-socket)
Only requesting 4 cores here.
5.0.3
This looks OK. Still binding to the cgroup cores.
5.0.5
This looks bad, wrong core binding, should have been 6,7 on both nodes.
If you need more information, let me know!
The text was updated successfully, but these errors were encountered: