Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI process hangs when using MPIR in OpenMPI v3.0.x #5349

Closed
kent-cheung-arm opened this issue Jun 28, 2018 · 3 comments
Closed

MPI process hangs when using MPIR in OpenMPI v3.0.x #5349

kent-cheung-arm opened this issue Jun 28, 2018 · 3 comments

Comments

@kent-cheung-arm
Copy link

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

v3.0.0 and v3.0.2

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from source tarball with GCC 7.1

Please describe the system on which you are running

  • Operating system/version: RHEL7
  • Computer hardware: aarch64
  • Network type: self

Details of the problem

An MPI process appears to be stuck recursively calling OPAL_MCA_PMIX2X_PMIx_Init () when attaching to it in GDB after having run to MPIR_Breakpoint in mpirun. It's only the last MPI process for which this occurs. Other processes are stopped at PMPI_Init ().

A simple reproducer is available here: gdb-only.zip

Here's how the reproducer can be used

$ make
mpicc -g hello.c -o hello_c
$ ./run.sh 
$ cat logfile 
...
#0  0x0000ffffb7e4bec8 in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x0000ffffb6f34ad4 in OPAL_MCA_PMIX2X_PMIx_Init () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
#2  0x0000ffffb6f34ad4 in OPAL_MCA_PMIX2X_PMIx_Init () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
#3  0x0000ffffb6f34ad4 in OPAL_MCA_PMIX2X_PMIx_Init () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
...
#997 0x0000ffffb6f34ad4 in OPAL_MCA_PMIX2X_PMIx_Init () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
#998 0x0000ffffb6f34ad4 in OPAL_MCA_PMIX2X_PMIx_Init () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
#999 0x0000ffffb6f34ad4 in OPAL_MCA_PMIX2X_PMIx_Init () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
...

(the backtrace has been limited to 1000 lines)

The reproducer starts mpirun under gdb, sets MPIR_being_debugged=1 and runs to MPIR_Breakpoint. It then attaches to the second (last) hello_c process and prints the bactrace to logfile, which can then be inspected.

@kent-cheung-arm kent-cheung-arm changed the title OPAL_MCA_PMIX2X_PMIx_Init recursively called when using MPIR (OpenMPI v3.0.0/OpenMPI v3.0.2) MPI process hangs when using MPIR in OpenMPI v3.0.x Jun 28, 2018
@kent-cheung-arm
Copy link
Author

We have seen a similar problem when running the reproducer on SLES12 and Ubuntu 16.04 systems. In those cases, both MPI process appears to be stuck calling ompi_mpi_init ():

$ cat logfile 
...
#0  0x0000ffffb7e7fe78 in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x0000ffffb6f91e74 in OPAL_MCA_PMIX2X_PMIx_Commit () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
#2  0x0000ffffb6f6a344 in pmix2x_commit () from /software/mpi//openmpi-3.0.2_gnu-7.1/lib/openmpi/mca_pmix_pmix2x.so
#3  0x0000ffffb7f0aaec in ompi_mpi_init () from /software/mpi/openmpi-3.0.2_gnu-7.1/lib/libmpi.so.40
...
#997 0x0000ffffb7f0aaec in ompi_mpi_init () from /software/mpi/openmpi-3.0.2_gnu-7.1/lib/libmpi.so.40
#998 0x0000ffffb7f0aaec in ompi_mpi_init () from /software/mpi/openmpi-3.0.2_gnu-7.1/lib/libmpi.so.40
#999 0x0000ffffb7f0aaec in ompi_mpi_init () from /software/mpi/openmpi-3.0.2_gnu-7.1/lib/libmpi.so.40
...

@rhc54
Copy link
Contributor

rhc54 commented Jun 29, 2018

Please see #5321 - might be the same problem.

@xavier1arm
Copy link

xavier1arm commented Jul 25, 2018

#5321 fix works.
This issue can be closed.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants