-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MPICH support enhancements #398
Comments
I had run across this design document on the wire protocol. If there are other specific code/document references it would be good to have them here. I wonder if any of the versions of MPI that we support in production have this capability or if it is possible to graft this into the older versions? |
I had seen that, but didn't link it because I'm not sure that's entirely in line with the actual implementation in use. The 2.0 interface wasn't stabilized when that went up, I'll ask Pavan about docs for it. As to older versions, I'm not sure how far back it goes precisely, but this was the default mechanism for all mpich derivatives as of at least 7 years ago, and should be available in all production builds unless explicitly removed at configuration time. It's the mechanism that Hydra, and MPD before it, use to bootstrap a job, so it's carefully supported. MPICH1 probably won't take it though unfortunately, I think PMI came in after development stopped on that branch. |
Hmm... it looks like we build mvapich with pmgr and no pm option for some reason.
In this configuration it assumes pmgr bootstrap rather than simple PMI, which I suppose ties back to #221. In principle, we could actually support all of these off of the same backend. Normally that's not possible, because there's no way to negotiate, but since we control the daemon and they all take a socket or specified connector, we can disambiguate based on which the job decides to use. The one mpich2 we have in dotkit is configured to use simple, I didn't check the others, but Intel MPI should also be using one of the two. |
Funny meta-thought on this. We'll need a bootstrap system for running an initial flux instance across systems that don't have a resource manager set up at some point, and it occurs to me that the rsh-tree hydra backend may be a convenient way to do that until we have something of our own. |
@trws I could not decipher the Argonne end of the call this morning due to the bad audio. Would you mind summarizing the next steps here? (Thanks) |
Thanks for the reminder @garlick, I meant to do this earlier, but had a meeting with Robin about using flux/capacitor for some of the high-throughput users they've been seeing on Catalyst and it slipped out of cache. One or two feature issues will be popping up shortly from that. next-steps:
Further, as discussed earlier in this issue, the official recommendation from Pavan et. al. on supporting PMI in mpich-derivatives is to support the PMI "simple" v1 wire protocol. Secondarily the PMI API, as we're doing now also works but depends heavily on library path and configuration control. |
@trws, For now I was thinking on the call that if we want to optionally produce a nodefile, this is something that could be done in an initial program... |
It's a bit more than just a list of hostnames. The general format is either hostname replicated once per core to be used on that host, or Anyway, I'm not quite sure I understand what you mean by using an initial program in place of a command. It is something that could be done entirely in post-processing, a basic one could just use your lua-hostlist to do an expand and replace spaces with newlines and it would be almost there, but since we may want to provide specific information based on the way the job is allocated, it seemed like it might be useful to have it in our control. |
Ok, I was thinking it was just a list of hostnames (e.g. I was thinking a per instance hostlist could be generated optionally via an rc-like scriptlet once we had an initial program structure that could accomplish that. Sounds like you'd need to generate possibly a different one per |
Are we expecting Flux to launch a hydra-bootstrapped MPI program directly, or through a sub instance? |
Sub instance. It's just so someone could run mpiexec inside a flux instance and have it work as expected. The user would do something like:
The mpiexec then would use flux under the covers to bootstrap itself, then bootstrap the job. It's an extra level of portability is all, since a lot of people have batch scripts that use hydra options for rank distribution and binding options, which we then all of a sudden support without much extra effort. The nodelist would be the same for the lifetime of an instance, so it could be done with the initial program infrastructure as @grondo mentioned. I had just been thinking that we might not want to have to do that for every job, and having hydra request it from us seemed like a reasonable way to only pay for it when they actually need it without forcing the user to request it. |
@trws, I definitely prefer if hydra requests the nodelist only when it needs it. I wonder if they've thought of having the code for getting the nodelist and launching their daemons dynamically loadable? Then RMs could provide their own hooks for this stuff instead of adding it directly to hydra codebase (maybe they already do this, I don't know) On the initial program thing I was getting ahead of myself there. I was thinking eventually we'd have an initial program framework with unit or rc files that were activated by options somehow passed to the script itself, and to get hydra working you'd have to enable the option. Obviously hydra directly requesting the file is much better since it will work without an extra option. |
@grondo that did actually come up in the conversation at Cluster, though I had forgotten it until you mentioned it. Hydra uses dynamic loading to pick up libraries for this kind of thing from some versions of PBS descendants, but frequently runs into problems with installations using the wrong system because they don't find the library the user thought it would find. It's the same issue we're having with getting PMI to behave sanely. Having a way to generically hook in would certainly be good, but as it is hydra is set up to work by specializing itself based on what it finds on the target. In our case it would probably look for |
I should say, I suggested offering a library call, and was explicitly asked for something else in that conversation. Badly configured library setups apparently account for a non-trivial percentage of their support volume. |
So is the idea that mpiexec will use our scalable program launch to start the hydra daemons? (sorry, I know you asked that this morning but I either forgot or didn't hear the answer) |
Only for the case where hydra is being used as the bootstrapper, which should only be for users who have pre-existing hydra/mpiexec-based batch scripts they want to keep using. It's just the easier way for them to set up hydra support for flux in the short term, since it already has its own overlay etc. As I think I mentioned earlier though, the idea is really to support two models so that a user can do either of these and expect them to work:
The hydra daemons would get launched in the first case, so they can provide support for any and all options that version of mpiexec accepts and making porting a code from say argonne to livermore a bit easier. The second would use our launch to run the job directly and provide flux-based PMI support or other for bootstrapping, no hydra involvement of any kind. |
Understood - I meant to ask if in the first case, mpiexec is using rsh or equivalent to start its daemons on the hosts in the nodelist, or if flux would be launching them as a parallel program. |
Sorry, I wasn't sure. It can technically do either, but normally it uses whatever the native mechanism is since that tends to be rather faster than doing an rsh-tree on the raw nodes. In our case, the biggest part of supporting flux in hydra is having it use our job-launch facilities, like it uses srun in slurm currently. That's just step two, and something we didn't talk about on the call today, but we should decide on what interface we want to give them for that. Since hydra expects raw execution, |
Just a follow up note here on Intel MPI. A discussion on slurm-dev mail list revealed this reference that implies that you can switch the Intel MPI PMI library at runtime by setting |
FYI -- more info about OpenMPI PMIx and mvapich PMIX from recent discussion on slurm mail list: https://groups.google.com/forum/#!topic/slurm-devel/eCO9gBmTsTg |
Ugh... mvapich pmix... why... Well, the list to support everyone "natively" now looks like:
fun... |
This might be useful later so I paste it here, the beta SLURM PMIx module: https://github.com/artpol84/slurm/tree/pmix-step2/src/plugins/mpi/pmix |
Oops, should my last two comments go in #365? |
It looks like there are ubuntu packages for mpich. On my 14.04LTS system I was able to install
it appears to have been compiled without pmi options:
I confirmed that this version does something when I set PMI_FD in the environment of mpi_hello compiled with it, so based on simple_pmi.c I conclude it must be capable of using the simple v1 wire protocol?
If we're going to implement the simple wire protocol, at least there is a straightforward way to build MPI programs that use it for test. A side note for travis-ci: on Ubuntu 12.04, the packages are mpich2, libmpich2-dev |
Yeah. If you give no PMI option that's what you get. You can also give mpich derivatives multiple PMIs then select the one you want at runtime by name with another environment variable. The default one will implement both the v1 and v2 wire protocols iirc, but apparently the v2 protocol is out of favor at the moment. OpenMPI used to be able to do something similar, but I haven't tried since this whole PMIx thing happened. Sent with Good (www.good.com) From: Jim Garlick It looks like there are ubuntu packages for mpich. On my 14.04LTS system I was able to install mpich-3.0.4-6ubuntu1 it appears to have been compiled without pmi options: MPICH Version: 3.0.4 I confirmed that this version does something when I set PMI_FD in the environment of mpi_hello compiled with it, so based on simple_pmi.chttps://github.com/adk9/mpich/blob/master/src/pmi/simple/simple_pmi.c I conclude it must be capable of using the simple v1 wire protocol? $ PMI_FD=42 ./mpi_hello If we're going to implement the simple wire protocol, at least there is a straightforward way to build MPI programs that use it for test. A side note for travis-ci: on Ubuntu 12.04, the packages are mpich2, libmpich2-devhttp://packages.ubuntu.com/precise/devel/mpich2 — |
As discussed in flux-framework#398, as a precursor to implementing the PMI simple v1 wire protocol, pull the mpich2 package into the travis-ci environment in place of OpenMPI.
spelunking src/pmi/simple the wire protocol appears to consist of: Server sets PMI_Init
PMI_Get_universe_size
PMI_Get_appnum
PMI_Barrier
PMI_Finalize
PMI_KVS_Get_my_name
PMI_KVS_Put
PMI_KVS_Get
PMI_Publish_name
PMI_Unpublish_name
PMI_Lookup_name
PMI_Spawn_multiple That's all there is. PMI functions not listed for the most part return data from the environment variables or obtained during |
I also verified that the ubuntu 14.04 mpich supports an alternative to file descriptor passing. Set
After which the |
The host:port option I think was the original, but is no longer used by the standard launchers. It's a relic from before MPD if I recall correctly. Not to say it shouldn't be considered if it's more convenient for some reason, but that code path may be less frequently tested these days. Sent with Good (www.good.com) From: Jim Garlick I also verified that the ubuntu 14.04 mpich supports an alternative to file descriptor passing. Set PMI_ID and PMI_PORT=:. The client then connects to the server on PMI_PORT and runs through the following handshake: C: cmd=initack pmiid=\n After which the cmd=init handshake begins as above. — |
travis hasn't whiltelisted mpich yet - see travis-ci/apt-package-safelist#406 |
I thought it might be good anyway to have mpich built to the side in travis, for more control over pm options we want to support and therefore should test against. Got this working for gcc but there is a known bug that causes clang to segfault building mpich. I can't reproduce this on my ubuntu 14.04 system with clang-3.4-1ubuntu3. |
GCC is installed even in the clang builder. You can force mpich to build On Sun, Sep 27, 2015 at 10:59 AM, Jim Garlick notifications@github.com
|
Thanks @grondo that did the trick! |
Drop the somewhat contrived boot_pmi.c class from the broker, and rewrite the PMI bootstrap code using pmi-client.h interfaces directly. I think this clarifies the code even though it is quite verbose. If PMI doesn't implement pmi_get_id(), derive the session-id from the "appnum" (numerical jobid). Don't attempt to call pmi_get_clique_ranks() unless epgm is enabled. Neither pmi_get_id() nor pmi_get_clique_ranks() are implemented in the "simple v1" PMI wire protocol, so allowing these functions to be unimplemented enables Flux to be launched by mpiexec.hydra, which addresses one goal of flux-framework#398.
Drop the somewhat contrived boot_pmi.c class from the broker, and rewrite the PMI bootstrap code using pmi-client.h interfaces directly. I think this clarifies the code even though it is quite verbose. If PMI doesn't implement pmi_get_id(), derive the session-id from the "appnum" (numerical jobid). Don't attempt to call pmi_get_clique_ranks() unless epgm is enabled. Neither pmi_get_id() nor pmi_get_clique_ranks() are implemented in the "simple v1" PMI wire protocol, so allowing these functions to be unimplemented enables Flux to be launched by mpiexec.hydra, which addresses one goal of flux-framework#398.
Drop the somewhat contrived boot_pmi.c class from the broker, and rewrite the PMI bootstrap code using pmi-client.h interfaces directly. I think this clarifies the code even though it is quite verbose. If PMI doesn't implement pmi_get_id(), derive the session-id from the "appnum" (numerical jobid). Don't attempt to call pmi_get_clique_ranks() unless epgm is enabled. Neither pmi_get_id() nor pmi_get_clique_ranks() are implemented in the "simple v1" PMI wire protocol, so allowing these functions to be unimplemented enables Flux to be launched by mpiexec.hydra, which addresses one goal of flux-framework#398.
We've have both process and process manager support for the PMI-1 simple wire protocol and can launch MPICH programs directly. We have a test case of hydra launching Flux. I think all that's left is to support mpirun-hydra under Flux, and I'm not sure that we really need that. Closing this issue. Let's open a new one focused on mpirun-hydra if we need it. |
Agreed, and great work on those by the way @garlick. MPI support in flux has benefitted tremendously from your efforts both on MPICH and OpenMPI support.
On December 30, 2016 at 2:23:19 PM PST, Jim Garlick <notifications@github.com> wrote:
We've have both process and process manager support for the PMI-1 simple wire protocol and can launch MPICH programs directly. We have a test case of hydra launching Flux. I think all that's left is to support mpirun-hydra under Flux, and I'm not sure that we really need that. Closing this issue. Let's open a new one focused on mpirun-hydra if we need it.
-
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#398 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AAoStVUkvuADcpZpVQVzotm7--7HLeS_ks5rNYRMgaJpZM4F7SBM>.
|
Thanks :-) |
There have been a number of tickets on this topic, including #221 and others on the actual PMI API support. To try and get an MPI perspective, I snagged Pavan today to get a rundown of how MPICH expects to interact with PMIs and resource managers so hopefully we can make this as painless as possible for users (and hopefully us too).
Our current solution of providing a PMI library at the API level certainly works, but is not and will not be binary compatible with MPICH/MVAPICH/Intel MPI/etc. because they build with "simple" PMI by default (see source here. Using LD_PRELOAD or the like and building MPICH to suit are fine for testing, but being able to support the wire protocol would be a nice add-in, and make it completely native to just use
flux run mpich-based-thing
without any extra user work.This would mean providing a socket that offers that wire protocol, probably in a module or launcher plugin, which maps pretty much one-to-one onto the PMI API.
The other thing that came up was the option of supporting flux as a hydra target, which they appear to be interested in. Requests for this:
The text was updated successfully, but these errors were encountered: