-
Notifications
You must be signed in to change notification settings - Fork 868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi4py: Remaining spawn/accept/connect issues #12307
Comments
@dalcinl Are you using Open MPI v5.0.2? |
@janjust Last time I tried it was ompi@master. At this point I'm losing track of all the cumulated issues with their respective branches. |
@dalcinl thanks for putting this together. Just to check you are not trying with these failures to run oversubscribed, correct? |
Oh, hold on... Yes, I may eventually run oversubscribed if the tests spawn too many processes. But I'm setting the proper MCA parameters to allow for that. Am I missing something? Also, see above, I'm reporting failures even in singleton init mode, and in that case I don't think I'm not oversubscribing the machine. Also note that I'm reporting that the deadlocks are not always reproducible, so any potential issue with oversubscription does not seems to be perfectly reproducible. |
I can repeat my local tests tomorrow with current main and then report the outcome. |
Folks, I've updated the description. All my local tests are with ompi@main. @janjust My CI also failed with deadlock using ompi@v5.0.x, see here. |
I think i have a fix in prrte for the singleton problem:
odd that you don't see the assert that I observed. I''m having problems reproducing this one:
could you do a run with
so i can try better to reproduce? |
@hppritcha This is what I get from
|
I could possibly help, but I can do nothing without at least some description of what these test do. Just naming a test in somebody's test suite and saying "it doesn't work" isn't very helpful for those of us not dedicated to using that test suite. Ultimately, I don't really care - if I can help, then I will. But if you'd rather not provide the info, then that's okay too - Howard can struggle on his own. |
@rhc54 I'll submit a trivial reproducer here as soon as I can. The issue is not particular of my testsuite, any spawn example in singleton init mode with a relocated ompi install tree should suffice (issue: setting OPAL_PREFIX is not enough, PATH has to be updated as well for spawn to succeed). |
I'm not concerned about that one - @hppritcha indicates he is already addressing it. I'm talking about the other ones you cite here. @hppritcha FWIW: I'm reworking dpm.c to use |
Sorry, I mixed up issues, I was talking about #12349. Regarding spawn testsuites, what mine does that probably no other one does is to issue spawn/spawn_multiple calls in rapid succession from both COMM_SELF and COMM_WORLD, asking for a lot of short-lived child processes, maybe oversubscribing heavily the machine, including testing things like spawn arguments only relevant at the root process. The failures smell to me as race conditions. Maybe the key to the issue is the flaky "publish/lookup" handshake you mentioned above. Your update may very well fix things for good. |
@rhc54 @hppritcha Here you have a C reproducer, as simple as it can get. #include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int maxnp = argc >= 2 ? atoi(argv[1]) : 1;
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm comm;
MPI_Comm_get_parent(&comm);
if (MPI_COMM_NULL == comm) {
for (int i=0; i<100; i++) {
if (0 == rank) printf("%d\n",i);
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, maxnp,
MPI_INFO_NULL, 0,
MPI_COMM_SELF, &comm,
MPI_ERRCODES_IGNORE);
MPI_Barrier(comm);
MPI_Comm_disconnect(&comm);
}
} else {
MPI_Barrier(comm);
MPI_Comm_disconnect(&comm);
}
MPI_Finalize();
return 0;
} Build and run like below. Other test failing cases can be generated by changing the -np arg to mpiexec and the cmdline arg to the program. You can also change SELF -> WORLD in the C code above.
|
I can somewhat reproduce this - it either hangs at the very end, or it segfaults somewhere before the end. Always happens in that "next_cid" code. Afraid that code is deep voodoo and I have no idea what it is doing, or why. My rewrite gets rid of all that stuff, but it will take me awhile to complete it. I've also been asked to leave the old code in parallel, so I'll add an MCA param to select between the two methods. |
I hope your new method will become the default... The old code is evidently broken. |
@rhc54 After your diagnosis, what would you suggest for the mpi4py test suite? Should I just skip all these spawn tests as know failures, at least until we all have your new implementation available? |
You might as well skip them - they will just keep failing for now, and I doubt anyone will take the time to try and work thru that code in OMPI to figure out the problem. |
FWIW: in fairness, the existing code seems to work fine when not heavily stressed as we haven't heard complaints from the field. Not saying there's anything wrong with your test - it is technically correct and we therefore should pass it. Just noting that the test is higher-stress than we see in practice. |
some recent changes broke singleton support - twice in a month. First, remove problematic PMIX_RELEASE of jdata when its not ready to be removed. For some reason this showed up in singleton mode with debug enabled. Various asserts would fail when this PMIX_RELEASE was invoked. This was due to the fact that the jdata had been put on a list of jdata's so the opal_list destructor was having a fit trying to release a jdata which was still in a list. It turns out this jdata is being released in the code starting at line 95 of prte_finalize.c. I assume with debug not enabled that the jdata is released twice, rather than failing in the assert in prted_comm.c Some work to add in session id's for tracking allocations also broke singleton support. This patch restores the singletone functionality. Related to issue open-mpi/ompi#12307 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
some recent changes broke singleton support - twice in a month. First, remove problematic PMIX_RELEASE of jdata when its not ready to be removed. For some reason this showed up in singleton mode with debug enabled. Various asserts would fail when this PMIX_RELEASE was invoked. This was due to the fact that the jdata had been put on a list of jdata's so the opal_list destructor was having a fit trying to release a jdata which was still in a list. It turns out this jdata is being released in the code starting at line 95 of prte_finalize.c. I assume with debug not enabled that the jdata is released twice, rather than failing in the assert in prted_comm.c Some work to add in session id's for tracking allocations also broke singleton support. This patch restores the singletone functionality. Related to issue open-mpi/ompi#12307 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
some recent changes broke singleton support - twice in a month. First, remove problematic PMIX_RELEASE of jdata when its not ready to be removed. For some reason this showed up in singleton mode with debug enabled. Various asserts would fail when this PMIX_RELEASE was invoked. This was due to the fact that the jdata had been put on a list of jdata's so the opal_list destructor was having a fit trying to release a jdata which was still in a list. It turns out this jdata is being released in the code starting at line 95 of prte_finalize.c. I assume with debug not enabled that the jdata is released twice, rather than failing in the assert in prted_comm.c Some work to add in session id's for tracking allocations also broke singleton support. This patch restores the singletone functionality. Related to issue open-mpi/ompi#12307 Signed-off-by: Howard Pritchard <howardp@lanl.gov>
We skipped all the connect/accept/spawn tests for years, that's why we are in this mess. |
Let's be fair here - we actually do run connect/accept/spawn tests, but they are low-stress versions.. The only reason this one fails is because it is a high-stress test with a tight loop over comm-spawn. The current code works just fine for people who actually use it, which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn). Modifying it to support high-stress operations isn't trivial, but probably doable. I concur with other comments, though, that this isn't a high priority. |
I have had quite a bit of emails over the years asking about spawn-related issues from Python folks using mpi4py. That's the reason I stress-test implementations within my test suite.
I'm not even sure how high-stress is precisely defined. How could I modify my tests to be lower-stress? I've already backeted my spawn calls with barriers, but that's clearly not enough. Should I use sleep() or something like that? Should I serialize all spawn calls from COMM_SELF? I'm really afraid that if I stop testing spawn functionality and don't keep an eye on it, at some point it will become simply unusable. |
This storyline (nobody uses this feature) is getting old. I had similar contact as @dalcinl, people tried to use but it was broken so they found another way. I have no idea what low-stress and high-stress testing could be. It works or it doesn't. |
Sigh - seems a rather pointless debate, doesn't it? Fact is, nobody in the OMPI community has historically been inclined to spend time worrying about it, so the debate is rather moot. Kudos to @hppritcha for trying to take it on, or at least portions of it (e.g., the singleton comm-spawn case). |
Looking at the last 2 years of updates in the DPM related code, many of us (you/ICM/LANL/Amazon/UTK) tried to do so. Smaller steps but it got to a point where it kind-of-work. The only thing left is to have a solution that make it works everywhere, because this is a critical feature outside the HPC market. |
Agreed - my "low stress" was simply a single invocation of "comm_spawn" by a process in a job. "High-stress" is when a process in a job sits there and calls "comm_spawn" in a loop. Fills the system with lots of coupled jobs, requires that the system (both MPI and RTE) be able to fully cleanup/recover between jobs, etc. We have historically been satisfied with making the "low stress" operation work. Occasionally, I'd take a crack at the "loop spawn" test and at least keep it from hanging, but it was always a struggle and didn't last for very long. And my "loop spawn" test was very simple, just looping over spawn and disconnecting. Never actually had the conjoined jobs do anything. I agree that it is an important feature outside HPC, and it obviously should work. Perhaps my new approach will succeed where we previously failed - have to wait and see. |
There are remaining issues related to spawn when running the mpi4py testsuite. I'm able to reproduce them locally.
First, you need to switch to branch
testing/ompi-dpm
, otherwise some of the reproducers below will be skip as know failures.I'm configuring ompi@main the following way:
I've enabled oversubscription via both Open MPI and PRTE config files.
Afterwards, try the following:
Sometimes I get a segfault, sometimes a deadlock, and a few times the run may run to completion.
The following narrowing of tests may help figure out the problem:
It may run OK many times, but eventually I get a failure and the following output:
This other narrowed down test also have issues, but it does not always fail:
It may run occasionally, but most of the times it deadlocks.
cc @hppritcha
The text was updated successfully, but these errors were encountered: