Hangs on failures on master #1379

jsquyres · 2016-02-18T19:39:57Z

In the Cisco MTT cluster, we're seeing a large amount of hangs on tests that are supposed to fail (e.g., they call MPI_ABORT). Specifically, test MPI processes do not die, even if their HNP and local orted are gone. The MPI processes keep spinning and consuming CPU cycles.

I'm seeing this across a variety of configure command line options. I.e., it doesn't seem to be specific to a single problematic configure option.

It looks like the hangs are of two flavors:

An MPI process is stuck in an MPI collective that never completes
An MPI process is stuck in a PMIX collective

The Intel test MPI_Abort_c is an example of case 1. In this test, MPI_COMM_WORLD rank 0 calls MPI_ABORT, and everyone else calls an MPI_ALLREDUCE.

It looks like the MCW rank 0 process is gone/dead, and all the others are stuck in the MPI_ALLREDUCE. The HNP and local orted is gone, too. I.e., somehow the RTE thread in the MPI processes somehow didn't kill these processes either when they got the abort signal, or the HNP / local orted went away.

I see the same pattern in the IBM test environment/abort: MCW 0 calls abort, everyone else calls sleep. In this case, MCW 0 and the HNP and the local orted are all gone, but all the other processes are stuck looping in sleep().

The Intel test MPI_Errhandler_fatal_f is an example of case 2. In this test, processes don't seem to get past MPI_INIT:

#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6  
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6 
#2  0x00002aaaac3ec99e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0
#3  0x00002aaaac3cccee in pmix1_fence ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libopen-pal.so.0 
#4  0x00002aaaab4f1ab6 in ompi_mpi_init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0 
#5  0x00002aaaab527167 in PMPI_Init ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi.so.0  
#6  0x00002aaaab25b602 in pmpi_init__ ()
   from /home/mpiteam/scratches/community/2016-02-17cron/owk4/installs/QD8g/install/lib/libmpi_mpifh.so.0 
#7  0x0000000000401744 in MAIN__ ()

I see a bunch of tests like this (hung in MPI_INIT) -- not just Fortran tests, and not just tests that are supposed to fail. In these cases, it looks like a server gets overloaded with CPU load and things start slowing down, and then even positive tests start getting stuck in the PMIX fence in MPI_INIT (i.e., not just tests that are supposed to fail).

I've also seen similar stack traces where PMIX is stuck on a fence, but in MPI_FINALIZE. E.g., in the t_winerror test:

(gdb) bt
#0  0x0000003ca8caccdd in nanosleep () from /lib64/libc.so.6
#1  0x0000003ca8ce1e54 in usleep () from /lib64/libc.so.6
#2  0x00002aaaab60988e in OPAL_PMIX_PMIX112_PMIx_Fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#3  0x00002aaaab5e9bde in pmix1_fence ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libopen-pal.so.0
#4  0x00002aaaaab306c5 in ompi_mpi_finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#5  0x00002aaaaab5a1c1 in PMPI_Finalize ()
   from /home/mpiteam/scratches/usnic/2016-02-17cron/6reD/installs/4xnK/install/lib/libmpi.so.0
#6  0x0000000000401cc4 in main ()

The text was updated successfully, but these errors were encountered:

jsquyres · 2016-02-18T21:15:56Z

@rhc54 and I chatted about this on the phone. He's looking into it.

rhc54 · 2016-02-18T21:51:23Z

@jsquyres I think I have this fixed, or at least the problem associated with MPI_Abort. I'm not sure if/why it would show up in 2.x as it was due to a change in the IOF a couple of days ago.

Let me know if you see any continuing problems, and if there is something going on in the 2.x branch.

jsquyres · 2016-02-18T21:59:46Z

I think my MTT problems were all caused by this master issue, but these failures increased the load on my servers, thereby causing a cascade of other failures. So it's kinda hard to tell if the real root was originally on the master. We'll let it percolate through MTT over the next several days and see what happens.

rhc54 · 2016-02-19T02:43:29Z

I found another error that was also causing problems in certain cases and fixed it too. It should have made it into tonight's tarball, so hopefully we'll see the impact soon.

rhc54 · 2016-02-19T03:25:29Z

My MTT tonight is looking very clean, so hopefully this has resolved the problem.

rhc54 · 2016-02-19T05:29:19Z

+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| Phase       | Section         | MPI Version | Duration | Pass | Fail | Time out | Skip | Detailed report                                                          |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+
| MPI Install | my installation | 3.0.0a1     | 00:00    | 1    |      |          |      | MPI_Install-my_installation-my_installation-3.0.0a1-my_installation.html |
| Test Build  | trivial         | 3.0.0a1     | 00:00    | 1    |      |          |      | Test_Build-trivial-my_installation-3.0.0a1-my_installation.html          |
| Test Build  | ibm             | 3.0.0a1     | 00:57    | 1    |      |          |      | Test_Build-ibm-my_installation-3.0.0a1-my_installation.html              |
| Test Build  | intel           | 3.0.0a1     | 00:27    | 1    |      |          |      | Test_Build-intel-my_installation-3.0.0a1-my_installation.html            |
| Test Build  | onesided        | 3.0.0a1     | 00:04    | 1    |      |          |      | Test_Build-onesided-my_installation-3.0.0a1-my_installation.html         |
| Test Build  | java            | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-java-my_installation-3.0.0a1-my_installation.html             |
| Test Build  | orte            | 3.0.0a1     | 00:01    | 1    |      |          |      | Test_Build-orte-my_installation-3.0.0a1-my_installation.html             |
| Test Run    | trivial         | 3.0.0a1     | 00:02    | 2    |      |          |      | Test_Run-trivial-my_installation-3.0.0a1-my_installation.html            |
| Test Run    | ibm             | 3.0.0a1     | 07:04    | 371  |      |          | 3    | Test_Run-ibm-my_installation-3.0.0a1-my_installation.html                |
| Test Run    | spawn           | 3.0.0a1     | 00:04    | 7    |      |          |      | Test_Run-spawn-my_installation-3.0.0a1-my_installation.html              |
| Test Run    | loopspawn       | 3.0.0a1     | 03:41    | 1    |      |          |      | Test_Run-loopspawn-my_installation-3.0.0a1-my_installation.html          |
| Test Run    | intel           | 3.0.0a1     | 16:27    | 242  |      |          | 2    | Test_Run-intel-my_installation-3.0.0a1-my_installation.html              |
| Test Run    | intel_skip      | 3.0.0a1     | 07:40    | 222  |      |          | 22   | Test_Run-intel_skip-my_installation-3.0.0a1-my_installation.html         |
| Test Run    | onesided        | 3.0.0a1     | 00:18    | 32   |      |          |      | Test_Run-onesided-my_installation-3.0.0a1-my_installation.html           |
| Test Run    | java            | 3.0.0a1     | 00:02    | 1    |      |          |      | Test_Run-java-my_installation-3.0.0a1-my_installation.html               |
| Test Run    | orte            | 3.0.0a1     | 00:34    | 19   |      |          |      | Test_Run-orte-my_installation-3.0.0a1-my_installation.html               |
+-------------+-----------------+-------------+----------+------+------+----------+------+--------------------------------------------------------------------------+

v2.0.x: man: Fix typos

jsquyres assigned rhc54 Feb 18, 2016

jsquyres added the bug label Feb 18, 2016

jsquyres added this to the v2.0.0 milestone Feb 18, 2016

rhc54 removed this from the v2.0.0 milestone Feb 24, 2016

rhc54 closed this as completed Feb 24, 2016

jsquyres added a commit to jsquyres/ompi that referenced this issue Sep 19, 2016

Merge pull request open-mpi#1379 from kawashima-fj/pr/v2.0.x/man-typo

a32c145

v2.0.x: man: Fix typos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangs on failures on master #1379

Hangs on failures on master #1379

jsquyres commented Feb 18, 2016

jsquyres commented Feb 18, 2016

rhc54 commented Feb 18, 2016

jsquyres commented Feb 18, 2016

rhc54 commented Feb 19, 2016

rhc54 commented Feb 19, 2016

rhc54 commented Feb 19, 2016

Hangs on failures on master #1379

Hangs on failures on master #1379

Comments

jsquyres commented Feb 18, 2016

jsquyres commented Feb 18, 2016

rhc54 commented Feb 18, 2016

jsquyres commented Feb 18, 2016

rhc54 commented Feb 19, 2016

rhc54 commented Feb 19, 2016

rhc54 commented Feb 19, 2016