-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Singleton MPI initialization and spawn #10590
Comments
@awlauria After your recent update of submodule pointers, things got even worse. Now I cannot even call
|
@dalcinl sigh I didn't expect it to solve this issue, but good to know. Thanks |
FWIW: the error appears to be in the MPI layer, and not in PMIx or PRRTE. A quick look identifies the following culprit code (taken from main branch): val = NULL;
OPAL_MODEX_RECV_VALUE(rc, PMIX_LOCAL_PEERS,
&pname, &val, PMIX_STRING);
if (PMIX_SUCCESS == rc && NULL != val) {
peers = opal_argv_split(val, ',');
free(val);
} else {
ret = opal_pmix_convert_status(rc);
error = "local peers";
goto error;
}
/* if we were unable to retrieve the #local peers, set it here */
if (0 == opal_process_info.num_local_peers) {
opal_process_info.num_local_peers = opal_argv_count(peers) - 1;
} You can see that not finding "local_peers", which you won't find in the case of a singleton, incorrectly results in return of an error, even though the following code correctly knows how to deal with that situation. |
I took a crack at this and got singleton without spawn working, but with spawn I'm getting:
I have a lead on what needs fixing (something with starting the |
* Fixes open-mpi#10590 * Singletons will not have a PMIx value for `PMIX_LOCAL_PEERS` so make that optional instead of required. * `&` is being confused as an application argument in `prte` instead of the background character * Replace with `--daemonize` which is probably better anyway Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Two changes required to fix this issue:
With those two fixes I was able to run a singleton without
|
PRRTE change has been committed - thanks Josh! Will now port it over to PRRTE v3.0. Note: the fix was done in the schizo/prrte component and therefore only applies to |
Open MPI
|
* Fixes open-mpi#10590 * Singletons will not have a PMIx value for `PMIX_LOCAL_PEERS` so make that optional instead of required. * `&` is being confused as an application argument in `prte` instead of the background character * Replace with `--daemonize` which is probably better anyway Signed-off-by: Joshua Hursey <jhursey@us.ibm.com> (cherry picked from commit 16a1fa6)
Using the singleton test suite
The
@open-mpi/ucx Can you all check to see if we are missing a commit from |
@hoopoepg Hey Sergey, can we see about why singleton Comm_spawn_multiple() runs fail with UCX? Are we missing something that's currently in main, but not in v5.0? |
it is really strange - we didn't anything specific for spawn functionality. |
@karasevb, can you please take a look? |
I cannot reproduce the Singleton MPI_Comm_spawn_multiple failure with UCX. I've built v5.0.x (647d793) + patches that @jjhursey mentioned and it works well with ob1/ucx pmls. |
@karasevb I created a fresh build of Open MPI Below is the current state of Singletons on the various branches/releases (using
Odd data point. With the
I'll see if I can track that down. |
@jjhursey I'm not sure you got my request for advice for a related issue in mpi4py/mpi4py#247. I'm just asking for your comment about whether this is a know issue that can be somehow worked around via MCA params, or I should just disable tests as known failure. |
@janjust do you have any idea how to fix this? |
Sorry I haven't gotten to it just yet. I'm planning on looking into it today. |
hm, weird, I'm not really sure why it's segving in Finalize, but can take a look, could be a double free |
@janjust I spent some time today tracking this down. I suspected that it was due to the environment I was running in for testing (isolated Docker container). I found the fix and posted a PR: #10758 I posted a summary of my investigation to the PR. Locally, I applied a similar fix to |
* Fixes open-mpi#10590 * Singletons will not have a PMIx value for `PMIX_LOCAL_PEERS` so make that optional instead of required. * `&` is being confused as an application argument in `prte` instead of the background character * Replace with `--daemonize` which is probably better anyway Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* Fixes open-mpi#10590 * Singletons will not have a PMIx value for `PMIX_LOCAL_PEERS` so make that optional instead of required. * `&` is being confused as an application argument in `prte` instead of the background character * Replace with `--daemonize` which is probably better anyway Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
* Fixes open-mpi#10590 * Singletons will not have a PMIx value for `PMIX_LOCAL_PEERS` so make that optional instead of required. * `&` is being confused as an application argument in `prte` instead of the background character * Replace with `--daemonize` which is probably better anyway Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Looks like singleton MPI init and spawn is broken in the main branch.
Look at this reproducer:
Now I run that code using Open MPI v4.1.2 (system package from Fedora 36) the following two ways:
Note that the second way does not use
mpiexec
(that is, what the MPI standard calls singleton MPI initialization).Next I run the code with ompi/main. I've configured with:
The first way (using
mpiexec
) seems to works just fine. The second way (singleton MPI init) fails:PS: Lack of singleton MPI initialization complicate some Python users wanting to dynamically spawn MPI processes as needed via mpi4py without requiring the parent process to be launched through
mpiexec
.The text was updated successfully, but these errors were encountered: