Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi4py: Run mpi4py tests with a relocated Open MPI installation #12526

Merged
merged 2 commits into from
May 31, 2024

Conversation

dalcinl
Copy link
Contributor

@dalcinl dalcinl commented May 6, 2024

Addressing #12349 (comment).

Copy link

github-actions bot commented May 6, 2024

Hello! The Git Commit Checker CI bot found a few problems with this PR:

1ca576e: mpi4py: Run mpi4py tests with a relocated Open MPI...

  • check_signed_off: does not contain a valid Signed-off-by line

1163918: mpi4py: Support for workflow_dispatch trigger

  • check_signed_off: does not contain a valid Signed-off-by line

Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks!

@dalcinl
Copy link
Contributor Author

dalcinl commented May 6, 2024

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

@wenduwan
Copy link
Contributor

wenduwan commented May 6, 2024

bot:aws:retest

@rhc54
Copy link
Contributor

rhc54 commented May 6, 2024

@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested.

I'm unsure what error you are referring to here. I tested singleton spawn, and that worked fine. I then ran your loop spawn program for 500 iterations, testing it with 1-3 processes and spawning 1-3 processes - and that worked fine. I did have to allow oversubscription when combining 3 procs with spawning 3 procs, or else it would run out of slots after 3 iterations, so it looks like there might be a race condition on recovering resources. Still, with that caveat, spawn is working.

Is there something else that is wrong?

@wenduwan wenduwan added the mpi4py-all Run the optional mpi4py CI tests label May 6, 2024
@dalcinl dalcinl force-pushed the test/ompi-relocate branch from 007eb6a to 37427d5 Compare May 7, 2024 06:00
@dalcinl
Copy link
Contributor Author

dalcinl commented May 7, 2024

I tested singleton spawn,

Indeed, the spawn tests run in isolation of other tests seem to be working fine.
Did you finally merge these changes related to not using accept/connect and move to something more robust?

Is there something else that is wrong?

Yes, look at the failing mpi4py/run_spawn/mpi4py-tests results [link]. After running all the spawn tests, looks like ompi's internal state ends somehow broken, and attempting to create an intercommunicator fails with MPI_ERR_INTERN.

@dalcinl
Copy link
Contributor Author

dalcinl commented May 7, 2024

@rhc54 The following command line is the minimal set of tests to trigger the issue:

env MPI4PY_TEST_SPAWN=1 mpiexec -n 2 python test/main.py -v test_comm test_spawn test_ulfm

I'm seeing the following lines in the output:

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��
[kw61149:364081] OPAL ERROR: Unreachable in file ../../../ompi-main/ompi/communicator/comm.c at line 2372
[kw61149:364081] 0: Error in ompi_comm_get_rprocs
setUpClass (test_ulfm.TestULFMInter) ... ERROR

@rhc54
Copy link
Contributor

rhc54 commented May 7, 2024

Did you finally merge these changes related to not using accept/connect and move to something more robust?

Afraid not - been off doing other things, but am returning to these issues now. No timeline for completion.

[kw61149:364081] [[389,1],1] selected pml ob1, but peer [[389,1],0] on kw61149 selected pml ��

Odd - I'm pretty sure we fixed that one. It took me some time and help from a user that could reproduce it to track it down, and the fix went into PMIx master two weeks ago.

Checking the state of OMPI's PMIx submodule, it is (a) out-of-date by more than a month, and therefore missing that fix, and (b) in some odd detached state. It's almost like someone cherry-picked some change into it rather than updating the submodule pointer?

Anyway, the problem is indeed fixed - just not over there for some reason.

@janjust
Copy link
Contributor

janjust commented May 7, 2024

I'll push a submodule update to this PR shortly

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

@janjust do you have bandwidth to update this PR? I can do it otherwise.

@wenduwan
Copy link
Contributor

wenduwan commented May 8, 2024

Turns out I don't have permission to push to the PR. Opened #12532 and will rebase this one.

@wenduwan
Copy link
Contributor

Howard leapfrogged me with #12565 will likely merge that instead.

@wenduwan
Copy link
Contributor

@dalcinl Could you please rebase the PR?

Hats off to @hppritcha for chasing down the problem(s).

dalcinl added 2 commits May 31, 2024 13:05
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
@dalcinl dalcinl force-pushed the test/ompi-relocate branch from 37427d5 to bead2b3 Compare May 31, 2024 10:05
@dalcinl
Copy link
Contributor Author

dalcinl commented May 31, 2024

@wenduwan If everything goes right, then maybe we should get rid of all the stuff related to running spawn tests conditionally? What do you think? Should we do it now in this PR, or would you rather prefer I submit another one?

@wenduwan
Copy link
Contributor

Yes you are right. We are doing it in #12591

@wenduwan
Copy link
Contributor

You have helped a lot. I will rebase #12591 after merging this PR.

@wenduwan wenduwan merged commit 55a2ac8 into open-mpi:main May 31, 2024
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mpi4py-all Run the optional mpi4py CI tests Target: main
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants