-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hangs on failures on master #1379
Comments
@rhc54 and I chatted about this on the phone. He's looking into it. |
@jsquyres I think I have this fixed, or at least the problem associated with MPI_Abort. I'm not sure if/why it would show up in 2.x as it was due to a change in the IOF a couple of days ago. Let me know if you see any continuing problems, and if there is something going on in the 2.x branch. |
I think my MTT problems were all caused by this master issue, but these failures increased the load on my servers, thereby causing a cascade of other failures. So it's kinda hard to tell if the real root was originally on the master. We'll let it percolate through MTT over the next several days and see what happens. |
I found another error that was also causing problems in certain cases and fixed it too. It should have made it into tonight's tarball, so hopefully we'll see the impact soon. |
My MTT tonight is looking very clean, so hopefully this has resolved the problem. |
|
v2.0.x: man: Fix typos
In the Cisco MTT cluster, we're seeing a large amount of hangs on tests that are supposed to fail (e.g., they call MPI_ABORT). Specifically, test MPI processes do not die, even if their HNP and local orted are gone. The MPI processes keep spinning and consuming CPU cycles.
I'm seeing this across a variety of configure command line options. I.e., it doesn't seem to be specific to a single problematic configure option.
It looks like the hangs are of two flavors:
The Intel test MPI_Abort_c is an example of case 1. In this test, MPI_COMM_WORLD rank 0 calls MPI_ABORT, and everyone else calls an MPI_ALLREDUCE.
It looks like the MCW rank 0 process is gone/dead, and all the others are stuck in the MPI_ALLREDUCE. The HNP and local orted is gone, too. I.e., somehow the RTE thread in the MPI processes somehow didn't kill these processes either when they got the abort signal, or the HNP / local orted went away.
I see the same pattern in the IBM test environment/abort: MCW 0 calls abort, everyone else calls sleep. In this case, MCW 0 and the HNP and the local orted are all gone, but all the other processes are stuck looping in sleep().
The Intel test MPI_Errhandler_fatal_f is an example of case 2. In this test, processes don't seem to get past MPI_INIT:
I see a bunch of tests like this (hung in MPI_INIT) -- not just Fortran tests, and not just tests that are supposed to fail. In these cases, it looks like a server gets overloaded with CPU load and things start slowing down, and then even positive tests start getting stuck in the PMIX fence in MPI_INIT (i.e., not just tests that are supposed to fail).
I've also seen similar stack traces where PMIX is stuck on a fence, but in MPI_FINALIZE. E.g., in the t_winerror test:
The text was updated successfully, but these errors were encountered: