mpi4py: Run mpi4py tests with a relocated Open MPI installation #12526

dalcinl · 2024-05-06T17:50:45Z

Addressing #12349 (comment).

github-actions · 2024-05-06T17:51:21Z

Hello! The Git Commit Checker CI bot found a few problems with this PR:

1ca576e: mpi4py: Run mpi4py tests with a relocated Open MPI...

check_signed_off: does not contain a valid Signed-off-by line

1163918: mpi4py: Support for workflow_dispatch trigger

check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

dalcinl · 2024-05-06T18:12:20Z

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

wenduwan · 2024-05-06T19:53:52Z

bot:aws:retest

.github/workflows/ompi_mpi4py.yaml

.github/workflows/ompi_mpi4py_tests.yaml

rhc54 · 2024-05-06T22:41:47Z

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

I'm unsure what error you are referring to here. I tested singleton spawn, and that worked fine. I then ran your loop spawn program for 500 iterations, testing it with 1-3 processes and spawning 1-3 processes - and that worked fine. I did have to allow oversubscription when combining 3 procs with spawning 3 procs, or else it would run out of slots after 3 iterations, so it looks like there might be a race condition on recovering resources. Still, with that caveat, spawn is working.

Is there something else that is wrong?

dalcinl · 2024-05-07T06:18:41Z

I tested singleton spawn,

Indeed, the spawn tests run in isolation of other tests seem to be working fine.
Did you finally merge these changes related to not using accept/connect and move to something more robust?

Is there something else that is wrong?

Yes, look at the failing mpi4py/run_spawn/mpi4py-tests results [link]. After running all the spawn tests, looks like ompi's internal state ends somehow broken, and attempting to create an intercommunicator fails with MPI_ERR_INTERN.

dalcinl · 2024-05-07T07:54:17Z

@rhc54 The following command line is the minimal set of tests to trigger the issue:

env MPI4PY_TEST_SPAWN=1 mpiexec -n 2 python test/main.py -v test_comm test_spawn test_ulfm

I'm seeing the following lines in the output:

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��
[kw61149:364081] OPAL ERROR: Unreachable in file ../../../ompi-main/ompi/communicator/comm.c at line 2372
[kw61149:364081] 0: Error in ompi_comm_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR

rhc54 · 2024-05-07T12:36:34Z

Did you finally merge these changes related to not using accept/connect and move to something more robust?

Afraid not - been off doing other things, but am returning to these issues now. No timeline for completion.

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��

Odd - I'm pretty sure we fixed that one. It took me some time and help from a user that could reproduce it to track it down, and the fix went into PMIx master two weeks ago.

Checking the state of OMPI's PMIx submodule, it is (a) out-of-date by more than a month, and therefore missing that fix, and (b) in some odd detached state. It's almost like someone cherry-picked some change into it rather than updating the submodule pointer?

Anyway, the problem is indeed fixed - just not over there for some reason.

janjust · 2024-05-07T15:17:37Z

I'll push a submodule update to this PR shortly

wenduwan · 2024-05-08T18:21:18Z

@janjust do you have bandwidth to update this PR? I can do it otherwise.

wenduwan · 2024-05-08T18:58:25Z

Turns out I don't have permission to push to the PR. Opened #12532 and will rebase this one.

wenduwan · 2024-05-23T20:04:24Z

Howard leapfrogged me with #12565 will likely merge that instead.

wenduwan · 2024-05-30T19:54:08Z

@dalcinl Could you please rebase the PR?

Hats off to @hppritcha for chasing down the problem(s).

Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>

dalcinl · 2024-05-31T10:09:33Z

@wenduwan If everything goes right, then maybe we should get rid of all the stuff related to running spawn tests conditionally? What do you think? Should we do it now in this PR, or would you rather prefer I submit another one?

wenduwan · 2024-05-31T15:12:08Z

Yes you are right. We are doing it in #12591

wenduwan · 2024-05-31T15:13:09Z

You have helped a lot. I will rebase #12591 after merging this PR.

github-actions bot added the Target: main label May 6, 2024

dalcinl force-pushed the test/ompi-relocate branch from 1ca576e to 007eb6a Compare May 6, 2024 17:52

dalcinl mentioned this pull request May 6, 2024

Singleton init mode, MPI_Comm_spawn, and OPAL_PREFIX #12349

Open

wenduwan reviewed May 6, 2024

View reviewed changes

.github/workflows/ompi_mpi4py.yaml Show resolved Hide resolved

.github/workflows/ompi_mpi4py_tests.yaml Outdated Show resolved Hide resolved

wenduwan added the mpi4py-all Run the optional mpi4py CI tests label May 6, 2024

dalcinl force-pushed the test/ompi-relocate branch from 007eb6a to 37427d5 Compare May 7, 2024 06:00

dalcinl added 2 commits May 31, 2024 13:05

mpi4py: Support for workflow_dispatch trigger

c6efa05

Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>

mpi4py: Run mpi4py tests with a relocated Open MPI installation

bead2b3

Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>

dalcinl force-pushed the test/ompi-relocate branch from 37427d5 to bead2b3 Compare May 31, 2024 10:05

wenduwan approved these changes May 31, 2024

View reviewed changes

wenduwan merged commit 55a2ac8 into open-mpi:main May 31, 2024

wenduwan mentioned this pull request May 31, 2024

mpi4py: enable spawn tests workflow by default #12591

Merged

mpi4py: Run mpi4py tests with a relocated Open MPI installation #12526

mpi4py: Run mpi4py tests with a relocated Open MPI installation #12526

Uh oh!

Conversation

dalcinl commented May 6, 2024

Uh oh!

github-actions bot commented May 6, 2024

Uh oh!

dalcinl commented May 6, 2024

Uh oh!

wenduwan commented May 6, 2024

Uh oh!

Uh oh!

Uh oh!

rhc54 commented May 6, 2024

Uh oh!

dalcinl commented May 7, 2024

Uh oh!

dalcinl commented May 7, 2024

Uh oh!

rhc54 commented May 7, 2024

Uh oh!

janjust commented May 7, 2024

Uh oh!

wenduwan commented May 8, 2024

Uh oh!

wenduwan commented May 8, 2024

Uh oh!

wenduwan commented May 23, 2024

Uh oh!

wenduwan commented May 30, 2024

Uh oh!

dalcinl commented May 31, 2024

Uh oh!

wenduwan commented May 31, 2024

Uh oh!

wenduwan commented May 31, 2024

Uh oh!

Uh oh!