-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi4py: Run mpi4py tests with a relocated Open MPI installation #12526
Conversation
Hello! The Git Commit Checker CI bot found a few problems with this PR: 1ca576e: mpi4py: Run mpi4py tests with a relocated Open MPI...
1163918: mpi4py: Support for workflow_dispatch trigger
Please fix these problems and, if necessary, force-push new commits back up to the PR branch. Thanks! |
1ca576e
to
007eb6a
Compare
@wenduwan An obvious problem is that Spawn is still broken for other reasons. Even if you add the label to run the spawn tests, I believe these will fail before we reach the point where relocation is tested. |
bot:aws:retest |
I'm unsure what error you are referring to here. I tested singleton spawn, and that worked fine. I then ran your loop spawn program for 500 iterations, testing it with 1-3 processes and spawning 1-3 processes - and that worked fine. I did have to allow oversubscription when combining 3 procs with spawning 3 procs, or else it would run out of slots after 3 iterations, so it looks like there might be a race condition on recovering resources. Still, with that caveat, spawn is working. Is there something else that is wrong? |
007eb6a
to
37427d5
Compare
Indeed, the spawn tests run in isolation of other tests seem to be working fine.
Yes, look at the failing mpi4py/run_spawn/mpi4py-tests results [link]. After running all the spawn tests, looks like ompi's internal state ends somehow broken, and attempting to create an intercommunicator fails with MPI_ERR_INTERN. |
@rhc54 The following command line is the minimal set of tests to trigger the issue: env MPI4PY_TEST_SPAWN=1 mpiexec -n 2 python test/main.py -v test_comm test_spawn test_ulfm I'm seeing the following lines in the output:
|
Afraid not - been off doing other things, but am returning to these issues now. No timeline for completion.
Odd - I'm pretty sure we fixed that one. It took me some time and help from a user that could reproduce it to track it down, and the fix went into PMIx master two weeks ago. Checking the state of OMPI's PMIx submodule, it is (a) out-of-date by more than a month, and therefore missing that fix, and (b) in some odd detached state. It's almost like someone cherry-picked some change into it rather than updating the submodule pointer? Anyway, the problem is indeed fixed - just not over there for some reason. |
I'll push a submodule update to this PR shortly |
@janjust do you have bandwidth to update this PR? I can do it otherwise. |
Turns out I don't have permission to push to the PR. Opened #12532 and will rebase this one. |
Howard leapfrogged me with #12565 will likely merge that instead. |
@dalcinl Could you please rebase the PR? Hats off to @hppritcha for chasing down the problem(s). |
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
Signed-off-by: Lisandro Dalcin <dalcinl@gmail.com>
37427d5
to
bead2b3
Compare
@wenduwan If everything goes right, then maybe we should get rid of all the stuff related to running spawn tests conditionally? What do you think? Should we do it now in this PR, or would you rather prefer I submit another one? |
Yes you are right. We are doing it in #12591 |
You have helped a lot. I will rebase #12591 after merging this PR. |
Addressing #12349 (comment).