-
Notifications
You must be signed in to change notification settings - Fork 894
use PMIx to manage CIDs #4674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use PMIx to manage CIDs #4674
Conversation
Just a suggestion: instead of creating a new API, just use the PMIx_Job_control_nb function and define new attributes. You could have one (or perhaps all?) proc call it to "init a counter" (and giving it a name), maybe allowing passage of another attribute for "scope" (using the PMIx scope values). Subsequent calls might use PMIx_Get to "get" the counter, adding an attribute to "increment the counter after get by all procs in scope". Needs refinement, of course, but might simplify things. |
Small self-debate over using job control. This isn't directly a "control", and so one could argue for a new API. On the other hand, one can envision using the counter(s) for control purposes, perhaps even using attributes to request that the host RM take certain actions upon reaching a specified counter level. Up to you - just passing along some thoughts. |
Thanks @rhc54 for the insight ! From a design point of view, i am still on the fence regarding whether the collective part should happen at the |
Personally, I'd opt for having PMIx (and thus, the host RM) perform the collective so it would be more broadly applicable to other programming paradigms. I see a common need for counters (both global and local) in various models. The only caveat is the need for the host RM to provide a fast collective, but we are seeing that coming into play (e.g., what Mellanox did for SLURM in the PMIx plugin). |
840ee22
to
af394d3
Compare
bot:mellanox:retest |
@ggouaillardet what is the main purpose of this PR? Performance optimization of Comm create operations? |
@artpol84 the goal is to improve CID allocation performance. long story short, with
|
I was touching the current algorithm some time ago and may not recall all of the details. But I think that it's not only
|
you are absolutely right ! |
bot:mellanox:retest |
Can you briefly explain how you use PMIx? Fence-like exchange? |
i added the new non master MPI tasks still perform a bunch of i am still on the fence whether the MPI stuff should be done at the PMIx level or not. |
And how server knows the CID? Is it stored in the Key? |
I’m referring to this: |
|
Via the newly introduced |
Just to be clear: I believe we had converged on not introducing a new PMIx API, but instead to use the PMIx_Job_control_nb API to create/increment generic counters (using PMIx scope values to indicate their range) that OMPI could use as CIDs. It would then be up to the RM to determine the method of ensuring uniqueness across the indicated scope. |
@artpol84 I don't agree with your description of the algorithm. If it is then the CID algorithm is wrong. |
@ggouaillardet, there is a very good reason to do the CID allocation the way we do it today. Your approach leaves gaps in the CID space seen by each process, as a CID allocated by whatever centralized entity will be globally uniquefor the entire job, independent on the processes participating in each communicator. Thus, non-overlapping communicators will not be able to have the same CID (unlike today). In other words you are converting a locally-dense CID list into a globally-dense CID list, and this is bad for memory consumption and/or performance. The locally-dense CID guarantees that from each process perspective the CID space is as compact as possible. This compactness ensure optimality in terms of memory needed to store the array, and in terms of access time when doing the cid-to-comm translation we need for each incoming MPI message. If we decide to relax the performance friendly array translation, then a CID selection becomes a single broadcast, as once the participants decide for a leader (operation local to each process), this leader can pick a CID, add/shift it's daemon ID to make it unique (as an example), and broadcast it to all participants. This will however require to support a sparse CID space, and thus use a hash table to translate CID into comm pointer on each process. The impact on injection rate will certainly be significant, especially in multi-threaded scenarios. |
@bosilca let me explain my description.
|
@bosilca I'm not sure I fully understand your argument. Surely this is simply a question of the algorithm used to identify the desired counter value, isn't it? For example, why can't we have a counter that tracks non-overlapping groups of participants? This would seem to result in the same values you desire, but without multiple steps. As for whether you have some "root" proc compute it or not - I have no preference either way. I do believe we should implement counter support in PMIx regardless as I can see uses for it beyond CID assignments. |
If I understood @ggouaillardet correctly this is not what he is proposing. He actually tries to reduce number of participants in the next CID calculation down to 1 proc per node.
I have concerns about performance of this operation for the following reasons:
|
Urrr...no, I don't think that is what @ggouaillardet is proposing. We propose that:
I'm not sure if it will be faster or slower than what we have today. There are certainly fewer required steps, but the overall time may be longer - need to test it to see. Probably will depend significantly on the inter-node transport underlying the RM for non-node-local counters. |
@rhc54 it seems that I misunderstood the concept. I was confused by the description here:
Specifically by the phrase |
If CID agreement is going to be handled by PMIx I'm very concerned about performance. Currently we have a relatively slow but generic algorithm compared to less-generic MPICH. We can't make it even slower. @ggouaillardet from @rhc54 description it seems like you are proposing an algorithm similar to what MPICH implements at MPI level but based on PMIx level. @ggouaillardet was there any particular use case that you are optimizing for? |
@bosilca @ggouaillardet @jladd-mlnx |
Is this also related to Mellanox verification or unrelated cluster? |
Unrelated - Cisco cluster |
I see, thank you |
@artpol84 this PR will not be merged in an immediate future. When you get some time, i invite you to read Trade-offs in Context Identifier Allocation in MPI at http://www.icl.utk.edu/files/publications/2016/icl-utk-988-2016.pdf Long story short, in order to convert a CID (e.g an |
This is what I meant by issues on the comm fastpath. |
Should we then create a new framework for CID management first ? @bosilca any thoughts ? |
+1 I was about to propose that. |
:bot:mellanox:retest |
8909fa4
to
2b9cda4
Compare
@artpol84 i fixed the crash ... i have no idea why only SHMEM+UCX was the only one crashing ... |
ddf9793
to
1317f5a
Compare
:bot:mellanox:retest |
👍 |
1317f5a
to
31a4c0f
Compare
@bosilca @artpol84 the second commit creates the it has both blocking (e.g. When moving Last but not least, the |
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Involve PMIx so all tasks on the same node can pick locally agree on the next CID in one iteration, and hopefully reduce the total number of iterations it takes to find the lowest available CID. Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
31a4c0f
to
60c6576
Compare
Can one of the admins verify this patch? |
@ggouaillardet This looks like a good direction. Any idea where this PR stands? Could it be rebased and potentially applied to master? |
The IBM CI (PGI) build failed! Please review the log, linked below. Gist: https://gist.github.com/9d84cc55f41f04fe20d100ec76c474af |
Should really take a look at what we are doing in the sessions prototype. There we break the need for most CID agreement. |
Let's just close this one as it is terribly stale and would need to be redone anyway. |
Optimize CID assignment by using PMIx
This is currently at very early prototype stage.
Use PMIx server so an unused CID can be retrieved in one shot for all the tasks running on the same node when working on an intra communicator.
when invoking
ompi_comm_nextcid_nb()
, first gather the list of participating ranks on the local task with the lower rank. Then this "master" task will ask PMIx a CID unused by all the local participants and then "broadcast" it to the non master tasks.Several things have to be fixed yet
PMIx_CID
is not a great name, right)PMIx_CID
Currently, the optimized path is only taken with intra communicators, and is a three step tango
MPI
)PMIx
call)MPI
)There are two ways of improving this on top of my head
MPI
level on inter communicators tooPMIx
andPMIx
performs a (kind of) collective operation (that would work regardless whether the communicator is intra or inter)