Skip to content

use PMIx to manage CIDs #4674

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

ggouaillardet
Copy link
Contributor

Optimize CID assignment by using PMIx

This is currently at very early prototype stage.

Use PMIx server so an unused CID can be retrieved in one shot for all the tasks running on the same node when working on an intra communicator.

when invoking ompi_comm_nextcid_nb(), first gather the list of participating ranks on the local task with the lower rank. Then this "master" task will ask PMIx a CID unused by all the local participants and then "broadcast" it to the non master tasks.

Several things have to be fixed yet

  • return the CID to PMIx when no more used
  • finalize the PMIx interface (PMIx_CID is not a great name, right)
  • support external PMIx that to not implement PMIx_CID

Currently, the optimized path is only taken with intra communicators, and is a three step tango

  • gather the participants on the "master" task (MPI)
  • get a CID for the participants (master makes a single PMIx call)
  • broadcast the assigned CID by the master to the other tasks (MPI)

There are two ways of improving this on top of my head

  • perform the broadcast/gather at the MPI level on inter communicators too
  • all tasks call PMIx and PMIx performs a (kind of) collective operation (that would work regardless whether the communicator is intra or inter)

@rhc54
Copy link
Contributor

rhc54 commented Dec 28, 2017

Just a suggestion: instead of creating a new API, just use the PMIx_Job_control_nb function and define new attributes. You could have one (or perhaps all?) proc call it to "init a counter" (and giving it a name), maybe allowing passage of another attribute for "scope" (using the PMIx scope values). Subsequent calls might use PMIx_Get to "get" the counter, adding an attribute to "increment the counter after get by all procs in scope".

Needs refinement, of course, but might simplify things.

@rhc54
Copy link
Contributor

rhc54 commented Dec 28, 2017

Small self-debate over using job control. This isn't directly a "control", and so one could argue for a new API. On the other hand, one can envision using the counter(s) for control purposes, perhaps even using attributes to request that the host RM take certain actions upon reaching a specified counter level.

Up to you - just passing along some thoughts.

@ggouaillardet
Copy link
Contributor Author

Thanks @rhc54 for the insight !

From a design point of view, i am still on the fence regarding whether the collective part should happen at the MPI or PMIx level.
If PMIx happens to be the way to go, then i can definitely use PMIx_Job_control_nb that could return a local key name, and then PMIx_Get this key. That would allow to reuse some code (e.g. notify all the peers when the key has been set), and also spare the need to define a new API.

@rhc54
Copy link
Contributor

rhc54 commented Dec 29, 2017

Personally, I'd opt for having PMIx (and thus, the host RM) perform the collective so it would be more broadly applicable to other programming paradigms. I see a common need for counters (both global and local) in various models. The only caveat is the need for the host RM to provide a fast collective, but we are seeing that coming into play (e.g., what Mellanox did for SLURM in the PMIx plugin).

@ggouaillardet ggouaillardet force-pushed the topic/ompi_comm_nextcid_nb branch 2 times, most recently from 840ee22 to af394d3 Compare January 5, 2018 07:39
@artpol84
Copy link
Contributor

bot:mellanox:retest

@artpol84
Copy link
Contributor

@ggouaillardet what is the main purpose of this PR? Performance optimization of Comm create operations?
If so can you share performance evaluation results?

@ggouaillardet
Copy link
Contributor Author

@artpol84 the goal is to improve CID allocation performance.

long story short, with n MPI tasks on N nodes

  • all MPI tasks have a agree on the common lowest available CID
  • in the worst cases, that can take up to n iterations if the CID spaces are very fragmented
  • with this PR, the worst case would take "only" N iterations
  • by using PMIx, the optimization also works with direct launch
  • at this stage, this is work in progress, and when the code gets stable, i will run some benchmarks
    (or i will likely ask some help since i currently do not have access to a large cluster)

@artpol84
Copy link
Contributor

in the worst cases, that can take up to n iterations if the CID spaces are very fragmented

I was touching the current algorithm some time ago and may not recall all of the details. But I think that it's not only n or N, fragmentation also has its say.
Consider 2 ranks with following used CIDs:
rank=0: 0, 2, 4, 6, 8
rank=1: 1, 3, 5, 7, 9
Lowest common CID will be 10 and I believe it will take current algorithm 5 steps:

  • step1: r0:1, r1:2
  • step2: r0:3, r1:4
  • ...

@ggouaillardet
Copy link
Contributor Author

you are absolutely right !
that being said, by using PMIx, a MPI tasks on a local node can agree upon the lowest common available CID in one shot, so that should decrease the total number of iterations.

@artpol84
Copy link
Contributor

bot:mellanox:retest

@artpol84
Copy link
Contributor

Can you briefly explain how you use PMIx? Fence-like exchange?

@ggouaillardet
Copy link
Contributor Author

i added the new PMIx_CID API. as of now, the master process (e.g. the local task with the lowest rank) ask the server a CID for the list of local tasks. the list of local tasks is "locally gathered" at the MPI level, and then, once the algorithm completes, the result is "locally broadcasted" at the MPI level too.

non master MPI tasks still perform a bunch of allreduce but with a neutral argument.

i am still on the fence whether the MPI stuff should be done at the PMIx level or not.

@artpol84
Copy link
Contributor

And how server knows the CID? Is it stored in the Key?

@artpol84
Copy link
Contributor

I’m referring to this:
‘master process (e.g. the local task with the lowest rank) ask the server a CID for the list of local tasks.’

@ggouaillardet
Copy link
Contributor Author

pmix_rank_info_t was added a pmix_bitmap_ of available CIDs

@ggouaillardet
Copy link
Contributor Author

Via the newly introduced PMIx_CID() API.
The client basically forwards the request to the server, that replies back the new CID.

@rhc54
Copy link
Contributor

rhc54 commented Jan 10, 2018

Just to be clear: I believe we had converged on not introducing a new PMIx API, but instead to use the PMIx_Job_control_nb API to create/increment generic counters (using PMIx scope values to indicate their range) that OMPI could use as CIDs. It would then be up to the RM to determine the method of ensuring uniqueness across the indicated scope.

@bosilca
Copy link
Member

bosilca commented Jan 10, 2018

@artpol84 I don't agree with your description of the algorithm. If it is then the CID algorithm is wrong.
Starting with:
rank=0: 0, 2, 4, 6, 8
rank=1: 1, 3, 5, 7, 9
the first step should be instead P0:1, P1:2, but then the second proposal is P0:3, P1:2 (and not P0:3, P1:4 as suggested). The reason is that at that point there is not enough information exchanged between processes to decide that 2 is not a good candidate (as an example P1 does not know if P0 has key 2 reserved).

@bosilca
Copy link
Member

bosilca commented Jan 10, 2018

@ggouaillardet, there is a very good reason to do the CID allocation the way we do it today.

Your approach leaves gaps in the CID space seen by each process, as a CID allocated by whatever centralized entity will be globally uniquefor the entire job, independent on the processes participating in each communicator. Thus, non-overlapping communicators will not be able to have the same CID (unlike today). In other words you are converting a locally-dense CID list into a globally-dense CID list, and this is bad for memory consumption and/or performance. The locally-dense CID guarantees that from each process perspective the CID space is as compact as possible. This compactness ensure optimality in terms of memory needed to store the array, and in terms of access time when doing the cid-to-comm translation we need for each incoming MPI message.

If we decide to relax the performance friendly array translation, then a CID selection becomes a single broadcast, as once the participants decide for a leader (operation local to each process), this leader can pick a CID, add/shift it's daemon ID to make it unique (as an example), and broadcast it to all participants. This will however require to support a sparse CID space, and thus use a hash table to translate CID into comm pointer on each process. The impact on injection rate will certainly be significant, especially in multi-threaded scenarios.

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

@bosilca let me explain my description.

  • At the first step both ranks select the lowest available CID, rank0 - 1, rank1 - 2.
  • They do allreduce with MAX operation that I believe will result in CID=2 from rank1.
  • On the next stage each rank checks if this MAX value is available locally, rank0 will find it busy and set his flag to 0, rank1 obviously will be ok with this CID.
  • Next they do another allreduce with MIN operation to converge on the flag. As everybody will see that flag is 0 (rank0 wil determine this) they will go ahead with (current_cid + 1) which is going to be 3.
  • On the second step both ranks will search starting from CID=3 and this will result in rank0 - 3, rank1 - 4.

@rhc54
Copy link
Contributor

rhc54 commented Jan 10, 2018

@bosilca I'm not sure I fully understand your argument. Surely this is simply a question of the algorithm used to identify the desired counter value, isn't it? For example, why can't we have a counter that tracks non-overlapping groups of participants? This would seem to result in the same values you desire, but without multiple steps.

As for whether you have some "root" proc compute it or not - I have no preference either way. I do believe we should implement counter support in PMIx regardless as I can see uses for it beyond CID assignments.

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

@bosilca

Your approach leaves gaps in the CID space seen by each process, as a CID allocated by whatever centralized entity will be globally uniquefor the entire job, independent on the processes participating in each communicator.

If I understood @ggouaillardet correctly this is not what he is proposing. He actually tries to reduce number of participants in the next CID calculation down to 1 proc per node.
In his proposal:

  • every process will publish his bitmap of available CIDs to PMIx in the format that is know by PMIx server.
  • During next CID calculation processes will determine a master for each node. This master is going to request next available CID on this node from the PMIx server. And since PMIx server will have the full bitmaps it will be able to calculate node-local next CID.
  • Then they perform an inter-node communication using MPI in the fashion that we have now.

I have concerns about performance of this operation for the following reasons:

  • It involves PMIx server on the "critical path". Server that is not bound to any core and most probably will not be running at the time of invocation and this will introduce delays. If this approach would assume (and maybe it is) fetching data directly from the shared memory key storage it could be better, but still key fetch overhead may be relatively high.
  • Second - as I noted this is not just a problem of n processes or N nodes. The main issue here is the fragmentation of the CID space. In the example that I provided for 2 procs you will need 5 steps even though you have only 2 procs. If I am missing something and @bosilca version is correct - it will be even more steps.

@rhc54
Copy link
Contributor

rhc54 commented Jan 10, 2018

Urrr...no, I don't think that is what @ggouaillardet is proposing. We propose that:

  • some proc at the beginning of time initializes a counter using PMIx_Job_control_nb, possible requesting the "non-overlapping participants" algorithm for it, and specifying its scope (e.g., node-local vs nspace-wide).

  • when a group of procs wants a new CID, they request one, specifying the counter to be used to generate it. The request flows up to the RM (if the counter isn't node-local - otherwise, the PMIx server simply handles it), which computes/retrieves the next value. This value is communicated down to all participants.

I'm not sure if it will be faster or slower than what we have today. There are certainly fewer required steps, but the overall time may be longer - need to test it to see. Probably will depend significantly on the inter-node transport underlying the RM for non-node-local counters.

@artpol84
Copy link
Contributor

@rhc54 it seems that I misunderstood the concept. I was confused by the description here:

i added the new PMIx_CID API. as of now, the master process (e.g. the local task with the lowest
rank) ask the server a CID for the list of local tasks. the list of local tasks is "locally gathered"
at the MPI level, and then, once the algorithm completes, the result is "locally broadcasted"
at the MPI level too.

Specifically by the phrase master process ask the server a CID for the list of local tasks. This seemed to be a node-local related operation.

@artpol84
Copy link
Contributor

artpol84 commented Jan 10, 2018

If CID agreement is going to be handled by PMIx I'm very concerned about performance. Currently we have a relatively slow but generic algorithm compared to less-generic MPICH. We can't make it even slower.

@ggouaillardet from @rhc54 description it seems like you are proposing an algorithm similar to what MPICH implements at MPI level but based on PMIx level.

@ggouaillardet was there any particular use case that you are optimizing for?

@artpol84
Copy link
Contributor

artpol84 commented Jan 18, 2018

@bosilca @ggouaillardet @jladd-mlnx
I was thinking about alternative solutions to this issue and I wonder if we can just increase CID to be 64 bits.
This will open a huge space of possible CIDs and we will be able to use Allreduce with MAX operation and always converge in one iteration.
I guess that this approach may possibly introduce the issue on the comm fastpath, but this is a kind of solution that I'd better invest my time as it seems very beneficial if efficient implementation in OMPI can be found.
I also believe that this will outperform even centralized approach that @rhc54 was referring as a previous implementation, as there mpirun seems a bottleneck + OOB-based communication.

@artpol84
Copy link
Contributor

One solid signature was in the dynamics (comm_spawn and friends) that was showing on one cluster

Is this also related to Mellanox verification or unrelated cluster?

@rhc54
Copy link
Contributor

rhc54 commented Jan 18, 2018

Unrelated - Cisco cluster

@artpol84
Copy link
Contributor

I see, thank you

@ggouaillardet
Copy link
Contributor Author

@artpol84 this PR will not be merged in an immediate future.

When you get some time, i invite you to read Trade-offs in Context Identifier Allocation in MPI at http://www.icl.utk.edu/files/publications/2016/icl-utk-988-2016.pdf

Long story short, in order to convert a CID (e.g an int) to an ompi_communicator_t *, we simply opal_pointer_array_get_item() which is basically accessing an array.
The pro is it is very fast. The cons is we want to always allocate the lowest CID in order to keep the size of this array low. @bosilca discusses other approaches in this paper, and this PR is a different (and hopefully more efficient) way to allocate the lowest CID.

@artpol84
Copy link
Contributor

artpol84 commented Jan 18, 2018

Long story short, in order to convert a CID (e.g an int) to an ompi_communicator_t *, we simply opal_pointer_array_get_item() which is basically accessing an array.
The pro is it is very fast. The cons is we want to always allocate the lowest CID in order to keep the size of this array low. @bosilca discusses other approaches in this paper, and this PR is a different (and hopefully more efficient) way to allocate the lowest CID.

This is what I meant by issues on the comm fastpath.
I also noticed that OB1 only supports 16bit CIDs. And I the same applies to UCX.
I don't know if 64K of communicators is sufficient, but if it is I think that current algo and it's descendants a too complex because they are trying to solve this for 32bit while this CID space can never be used because of PML limitations.
In this circumstances believe that even Reduce of a bitmap (MPICH way) could be better suited as it has a fixed complexity. For apps that create just a few comms it won't be a fastpath, but for those who do that frequently - this might be an opt.

@ggouaillardet
Copy link
Contributor Author

Should we then create a new framework for CID management first ?
And then provide various components optimized for various use cases ?

@bosilca any thoughts ?

@artpol84
Copy link
Contributor

+1 I was about to propose that.

@ggouaillardet
Copy link
Contributor Author

:bot:mellanox:retest

@ggouaillardet
Copy link
Contributor Author

@artpol84 i fixed the crash ...
the root cause was isend(MPI_INT) but irecv(MPI_BYTE)

i have no idea why only SHMEM+UCX was the only one crashing ...

@ggouaillardet ggouaillardet force-pushed the topic/ompi_comm_nextcid_nb branch 2 times, most recently from ddf9793 to 1317f5a Compare January 22, 2018 02:22
@ggouaillardet
Copy link
Contributor Author

:bot:mellanox:retest

@artpol84
Copy link
Contributor

👍

@ggouaillardet ggouaillardet force-pushed the topic/ompi_comm_nextcid_nb branch from 1317f5a to 31a4c0f Compare January 25, 2018 07:26
@ggouaillardet
Copy link
Contributor Author

@bosilca @artpol84 the second commit creates the cid framework in order to allocate CIDs.

it has both blocking (e.g. nextcid) and non blocking (e.g. nextcid_nb) variants.
the rationale is only MPI_Comm_idup() requires a non blocking allocation mechanism, and other subroutines (that are likely to be used more frequently) so they could be optimized by using blocking collectives that have less overhead.

When moving comm_cid into the cid/basic component, i noted the activate() function shares some structures with getnextcid(), so i created activate and activate_nb callbacks. They might not be specific to a given component, so the implementation is in cid/base. I am not sure that was the right move, I could have left everything in ompi/communicator, or directly call cid/base and do not add yet an other callback. Anyway, I am very open to reconsider this.

Last but not least, the query() callback has the ompi_mpi_thread_multiple parameter, so some components can be optimized for non MPI_THREAD_MULTIPLE apps.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Involve PMIx so all tasks on the same node can pick locally agree
on the next CID in one iteration, and hopefully reduce the total number
of iterations it takes to find the lowest available CID.

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
@ggouaillardet ggouaillardet force-pushed the topic/ompi_comm_nextcid_nb branch from 31a4c0f to 60c6576 Compare January 26, 2018 01:47
@open-mpi open-mpi deleted a comment from ibm-ompi Feb 17, 2018
@open-mpi open-mpi deleted a comment from ibm-ompi Feb 17, 2018
@AboorvaDevarajan
Copy link
Member

Can one of the admins verify this patch?

@gpaulsen
Copy link
Member

gpaulsen commented Jun 7, 2019

@ggouaillardet This looks like a good direction. Any idea where this PR stands? Could it be rebased and potentially applied to master?

@ibm-ompi
Copy link

The IBM CI (PGI) build failed! Please review the log, linked below.

Gist: https://gist.github.com/9d84cc55f41f04fe20d100ec76c474af

@hjelmn
Copy link
Member

hjelmn commented Feb 27, 2020

Should really take a look at what we are doing in the sessions prototype. There we break the need for most CID agreement.

@rhc54
Copy link
Contributor

rhc54 commented Feb 27, 2020

Let's just close this one as it is terribly stale and would need to be redone anyway.

@rhc54 rhc54 closed this Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants