-
Notifications
You must be signed in to change notification settings - Fork 859
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPI_Comm_split_type not creating disjoint subgroups for certain cases #12812
Comments
Please provide the output of |
The way it looks like based on the output, is that rank 2 (which is bound to cores spanning two L3 domains, since cores 0-7 are on L3 domain 1, cores 8-15 are on L3 domain 2, etc.) created a communicator that includes all ranks that have also been bound to either the 1st or the 2nd L3 domain (which is why it ends up with 6 rank). On the other hand, rank 0 is only bound to cores on the first L3 domain, and it only includes ranks that have been bound to that L3 domain. |
|
Funny, that's kind of what I suspected based on the published documentation of the Genoa architecture but with the topo things would have been more clear. OMPI current Indeed, taking in account that split creates disjoint communicators, what should we expect from a non-symmetric case like the one here ? One potential outcome could be to eliminate processes spanning across multiple domains from the split operation and return something like this:
In this case it is not obvious to users why some processes (here 2 and 5) are not part of a larger set. |
Is it still possible to create disjoint communicators for non-symmetric cases, belonging to a specific communicator based on some rule, say "I only belong to the first domain (in some ordering)"? |
@mshanthagit that is also what I was about to suggest :-) @bosilca I agree that there is no clear and good solution in this scenario, so the primary goal has to be that it is consistent across processes. That being said, having processes being just by themselves in a comm probably is not desirable. Would it make sense to have a rule along the lines: if a process is part of multiple L3 domains, the Comm_split_type will be applied as if it would only be part of the first domain that it is part of? |
What is the I quickly looked into the code base, and we use this API extensively in the collective modules. We need to find a solution that would make sense for our internal collectives as well. |
If we can easily figure out where most of its resources are bound to, sure. But otherwise, I would define the 'first domain' as the domain of the first core that it is bound to.
yes, that's how we found the issue, from the usage in a collective component :-) |
The simple fix, aka strict matching, is relatively easy to do by changing |
@bosilca what does strict matching do? Say in the example above, how does the split happen? |
Strictly the same binding at a specific level. I gave the outcome for the example provided here few comments above. On Genoa with the bindings provided by the user (
|
@bosilca this is not consistent, comm(2) cannot think that it is alone in the communicator, while comm(0) and comm(1) think that it is part of their subgroup |
@bosilca I think it could lead to incorrect behavior (pardon me if I am wrong). Say I do an allreduce with the new comm after split, what's the behavior? What will rank 2 have? Ignore, I thought I saw 2 in other comms |
@mshanthagit as @edgargabriel noticed my example was incorrect. It should now be fixed: Rank 2 and 5 will be alone in their own communicator. |
Maybe we should do a 2-step process: first step is to add the simple solution that @bosilca suggested, and backport it to 5.0.x and 4.1.x. There is clearly value in this solution, since it fixes an inconsistency. If somebody has the cycles to work on my suggested approach, we can still do that at a later stage, maybe for 6.0. (There is a good chance that it might not happen though, in which case we have however at least the first fix) |
Just thinking out loud, the solution @bosilca suggested creates 5 communicators whereas one would expect 3 (as there are 3 L3 domains). Will there be any side effects? |
Honestly, I think that users binding processes as in the example here (overlapping several domains), deserve what they get and any split type is good, for as long as it is consistent across the board. The strict mode has the advantage of being a one-liner. |
I don't necessarily disagree with you @bosilca I would just caution that sometimes these decisions are not dominated by MPI requirements. In this particular instance, it was the compute performance that was significantly better than using |
Or maybe a different mapping pattern? Not sure exactly what you are trying to achieve, but seems like mapping to L3 is something we have already enabled, so I'm a tad confused. |
I am happy to take any help @rhc54 , I couldn't find a solution on how to map 8 processes onto a group of 3 CCD, that is the first three rank to the first L3 domain, the second three ranks to that 2nd L3 domain, and the last 2 ranks to the 3rd L3 domain, given that the node has a second package with another 3 L3 domains and we would like to repeat the same pattern. I tried mapping by L3 domains, but it didn't do what we wanted, not because the mapping didn't work correctly, but because I didn't find a syntax for how to express this slight imbalance of 3/3/2 ranks per L3 domains |
There is only so much we can do automatically in the MPI library. For everything else, the users can fall back to either a manual
|
@bosilca does that mean that a process that does not fulfill this criteria (i.e. utilize that specific hardware resource type instance, and no other instance of the same hardware resource type) should actually not be part of any resulting communicator, e.g. MPI_COMM_NULL? |
Yeah, we do that frequently - can you post your topology so I can give you the correct cmd line? |
@rhc54 the output of hwloc-ls is listed above on the ticket, is that what you are looking for or do you need additional information? |
I need the actually topology file output - the XML output so I can use it as input to PRRTE. |
Yes, we can return |
@rhc54 I pingged you on slack for that, thank you! |
@edgargabriel Sent you a note over the weekend to verify the fix - committed upstream. |
Thank you for taking the time to submit an issue!
Background information
MPI_Comm_split_type is creating groups with overlapping ranks in certain cases where ranks are bound to cores across resource domains (say L3). For example, consider the following where ranks 2 and 5 share two L3 domains.
(program using MPI_Comm_split_type(MPI_COMM_WORLD, OMPI_COMM_TYPE_L3CACHE, 0, info, &newcomm); on a Genoa machine with 8 cores per L3)
mpirun -np 8 --map-by ppr:8:numa:pe=3 --report-bindings ./a.out
[electra016:1374129] Rank 0 bound to package[0][core:0-2]
[electra016:1374129] Rank 1 bound to package[0][core:3-5]
[electra016:1374129] Rank 2 bound to package[0][core:6-8]
[electra016:1374129] Rank 3 bound to package[0][core:9-11]
[electra016:1374129] Rank 4 bound to package[0][core:12-14]
[electra016:1374129] Rank 5 bound to package[0][core:15-17]
[electra016:1374129] Rank 6 bound to package[0][core:18-20]
[electra016:1374129] Rank 7 bound to package[0][core:21-23]
Hello --- my rank: 0, my comm_size: 8
Hello --- my rank: 1, my comm_size: 8
Hello --- my rank: 7, my comm_size: 8
Hello --- my rank: 6, my comm_size: 8
Hello --- my rank: 5, my comm_size: 8
Hello --- my rank: 4, my comm_size: 8
Hello --- my rank: 3, my comm_size: 8
Hello --- my rank: 2, my comm_size: 8
From split comm: my rank: 0, my split_comm_size: 3
From split comm: my rank: 2, my split_comm_size: 6
From split comm: my rank: 4, my split_comm_size: 4
From split comm: my rank: 6, my split_comm_size: 3
From split comm: my rank: 1, my split_comm_size: 3
From split comm: my rank: 3, my split_comm_size: 4
From split comm: my rank: 5, my split_comm_size: 6
From split comm: my rank: 7, my split_comm_size: 3
As we can see from the above, there are only two ranks with comm_size 6! Although it doesn't print out the ranks within each communicator, here's what it would be:
comm(0): 0, 1, 2
comm(1): 0, 1, 2
comm(2): 0, 1, 2, 3, 4, 5
comm(3): 2, 3, 4, 5
comm(4): 2, 3, 4, 5
comm(5): 2, 3, 4, 5, 6, 7
comm(6): 5, 6, 7
comm(7): 5, 6, 7
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I tested with 5.0.x and 4.1.6
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source (5.0.x)
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Please describe, in detail, the problem that you are having, including the behavior you expect to see, the actual behavior that you are seeing, steps to reproduce the problem, etc. It is most helpful if you can attach a small program that a developer can use to reproduce your problem.
Details in the background section. Here is an example program:
====================
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
The text was updated successfully, but these errors were encountered: