Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v4.2 GDS issue with singleton spawn #2705

Closed
jjhursey opened this issue Aug 22, 2022 · 8 comments
Closed

v4.2 GDS issue with singleton spawn #2705

jjhursey opened this issue Aug 22, 2022 · 8 comments

Comments

@jjhursey
Copy link
Member

Background information

What version of the PMIx Reference Library are you using?

Testing note on this comment

Describe how PMIx was installed

Built with Open MPI main and manually adjusted the submodule pointers.

Please describe the system on which you are running

  • Operating system/version: RHEL 8.4
  • Computer hardware: ppc64le
  • Network type: Single node shared memory

Details of the problem

Using PRRTE master and OpenPMIx master we have been able to get singleton MPI_Comm_spawn working. However, if we move to OpenPMix v4.2 then it fails. See open-mpi/ompi#10688

The MPI tests I'm using are located here

shell$ mpirun --np 1 ./simple_spawn ./simple_spawn
Spawning './simple_spawn' ... OK
shell$ ./simple_spawn ./simple_spawn
[f5n18:3788975] PMIX ERROR: PROC-ENTRY-NOT-FOUND in file server/pmix_server.c at line 3588
[f5n18:3788914] pml_ucx.c:191  Error: Failed to receive UCX worker address: Take next option (-46)
[f5n18:3788914] OPAL ERROR: Error in file dpm/dpm.c at line 480

I noticed that if I force the hash GDS component that it works correctly

shell$ export PMIX_MCA_gds=hash
shell$ ./simple_spawn ./simple_spawn
Spawning './simple_spawn' ... OK

So this seems to only impact the singleton spawn case, and is related to the GDS component (verbose output indicates that it is using ds21)

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2022

Is it the singleton or the child that is complaining?

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2022

Oh, and what key are they looking for?

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2022

Just off the top of my head: I'd guess that the error comes from the child and that it is looking for a modex key. The problem looks to be in the dpm code - once you have spawned the prte server, you have to PMIx_Commit the singleton's modex data to it so the children can look it up.

@jjhursey
Copy link
Member Author

It looks like modex information (if I force OMPI_MCA_pml=ob1 it ends up failing in the modex in the btls instead of UCX). From what I can tell it is the child that is complaining (in addition to the server).

@jjhursey
Copy link
Member Author

It looks like we are calling PMIx_Commit() just after the server starts:

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2022

Okay, here's the easiest solution. Once you detect that you are a singleton, push PMIX_MCA_gds=hash into your environment. This will make no difference to the singleton as that is all it can use, but it will cause prte to restrict itself to the hash, which will force all child procs to use their hash components.

See if that works.

@jjhursey
Copy link
Member Author

Yeah, that works. I'll post a PR to OMPI to make that change.

It bugged me that it worked with OpenPMIx master but not the v4.2 branch. Is that because dstore is not in the main branch anymore so I'm effectively just using hash there?

@rhc54
Copy link
Contributor

rhc54 commented Aug 22, 2022

Correct. The problem with dstore is that it is way behind in how to handle anything other than the original job info. The hash component is far more versatile. Once we get to PMIx v5, dstore will go away and we will have a shmem hash component instead - will solve many problems.

jjhursey added a commit to jjhursey/ompi that referenced this issue Aug 22, 2022
 * See openpmix/openpmix#2705

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
jjhursey added a commit to jjhursey/ompi that referenced this issue Aug 25, 2022
 * See openpmix/openpmix#2705

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 245424c)
MamziB pushed a commit to MamziB/ompi that referenced this issue Oct 26, 2022
 * See openpmix/openpmix#2705

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
MamziB pushed a commit to MamziB/ompi that referenced this issue Oct 26, 2022
 * See openpmix/openpmix#2705

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
yli137 pushed a commit to yli137/ompi that referenced this issue Jan 10, 2024
 * See openpmix/openpmix#2705

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants