Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing symbols in internal PMIx on master #6815

Closed
devreal opened this issue Jul 15, 2019 · 9 comments
Closed

Missing symbols in internal PMIx on master #6815

devreal opened this issue Jul 15, 2019 · 9 comments

Comments

@devreal
Copy link
Contributor

devreal commented Jul 15, 2019

I'm trying to run Open MPI master (git commit 020a591) on our IB cluster but the startup fails with the following error:

$ mpirun -n 2 -N 1 ./test_mpi_rget_fetch_op
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v12: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_kval_t_class (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v20: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v20.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_kval_t_class (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v21: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v21.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_bfrops_base_print_uint (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v3: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v3.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_bfrops_base_print_uint (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v4: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so: undefined symbol: pmix_bfrops_base_pack_coord (ignored)
--------------------------------------------------------------------------
We were unable to find any usable plugins for the BFROPS framework. This PMIx
framework requires at least one plugin in order to operate. This can be caused
by any of the following:

* we were unable to build any of the plugins due to some combination
  of configure directives and available system support

* no plugin was selected due to some combination of MCA parameter
  directives versus built plugins (i.e., you excluded all the plugins
  that were built and/or could execute)

* the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
  "mca_base_component_path", is set and doesn't point to any location
  that includes at least one usable plugin for this framework.

Please check your installation and environment.
--------------------------------------------------------------------------

Configure suggests the use of internal PMIx:

$ ./configure CC=gcc CXX=g++ FTN=gfortran --with-ucx=$HOME//opt/ucx-1.6.x-gnu/ --without-verbs --prefix=$HOME/opt/openmpi-git-ucx
[...]
Miscellaneous
-----------------------
CUDA support: no
HWLOC support: internal
Libevent support: internal
PMIx support: Internal

The symbols are indeed missing

$ nm $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so | grep " U pmix_bfrops"
                 U pmix_bfrops_base_copy_coord
                 U pmix_bfrops_base_copy_envar
                 U pmix_bfrops_base_copy_regattr
                 U pmix_bfrops_base_pack_coord
                 U pmix_bfrops_base_pack_envar
                 U pmix_bfrops_base_pack_iof_channel
                 U pmix_bfrops_base_pack_regattr
                 U pmix_bfrops_base_print_coord
                 U pmix_bfrops_base_print_envar
                 U pmix_bfrops_base_print_iof_channel
                 U pmix_bfrops_base_print_regattr
                 U pmix_bfrops_base_unpack_coord
                 U pmix_bfrops_base_unpack_envar
                 U pmix_bfrops_base_unpack_iof_channel
                 U pmix_bfrops_base_unpack_regattr

I could not find a library that defines the missing symbol(s):

$ nm $HOME/opt/openmpi-git-ucx/lib/pmix/*.so | grep pmix_bfrops_base_copy_coord
                 U pmix_bfrops_base_copy_coord

The problem does not occur on the 3.1.x and 4.0.x branches. Any idea where these symbols should come from and why they are missing? I looked through issues here and in the PMIx repo but couldn't find anything related...

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Jul 15, 2019 via email

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

@ggouaillardet Thanks for your reply! I can confirm that the symbols are exported by mca_pmix_pmix4x.so but even after make distclean followed by autogen.pl and the usual configure, make, make install the symbols are not found when mca_bfrops_v4.so is loaded. I tried ld-preloading mca_pmix_pmix4x.so to force the symbols into memory but that did not work (div-by-zero in OPAL_MCA_PMIX3X_pmix_hash_table_get_value_ptr), I guess that's not intended. I don't see a dependency from mca_bfrops_v4.so to mca_pmix_pmix4x.so with ldd:

$ ldd $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so
	linux-vdso.so.1 =>  (0x00007fff6a168000)
	libm.so.6 => /lib64/libm.so.6 (0x00002b3c2dd3b000)
	libutil.so.1 => /lib64/libutil.so.1 (0x00002b3c2e03d000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00002b3c2e240000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00002b3c2e444000)
	libc.so.6 => /lib64/libc.so.6 (0x00002b3c2e660000)
	/lib64/ld-linux-x86-64.so.2 (0x00002b3c2d90d000)

How are the symbols from mca_pmix_pmix4x.so supposed to be loaded? I can see in strace that mca_pmix_pmix4x.so is opened, mmaped, and munmaped before mca_bfrops_v4.so is opened:

stat("$HOME/opt/openmpi-git-ucx/lib/openmpi/mca_pmix_pmix4x.so", {st_mode=S_IFREG|0755, st_size=1712960, ...}) = 0
open("$HOME/opt/openmpi-git-ucx/lib/openmpi/mca_pmix_pmix4x.so", O_RDONLY|O_CLOEXEC) = 10
read(10, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0 =\2\0\0\0\0\0"..., 832) = 832
fstat(10, {st_mode=S_IFREG|0755, st_size=1712960, ...}) = 0
mmap(NULL, 3714880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 10, 0) = 0x2b95f8bb6000
mprotect(0x2b95f8d36000, 2093056, PROT_NONE) = 0
mmap(0x2b95f8f35000, 36864, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 10, 0x17f000) = 0x2b95f8f35000
mmap(0x2b95f8f3e000, 12096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x2b95f8f3e000
close(10)                               = 0
mprotect(0x2b95f8f35000, 12288, PROT_READ) = 0
munmap(0x2b95f759e000, 2118288)         = 0
munmap(0x2b95f8bb6000, 3714880)         = 0            // <- mca_pmix_pmix4x.so is unmapped here

I assume that mca_pmix_pmix4x.so is opened using dlopen but I cannot really figure out where that happens...

@rhc54
Copy link
Contributor

rhc54 commented Jul 15, 2019

Try the patch at #6817 and see if that helps your situation. You will need to run autogen.pl and configure again.

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

Thanks for the PR. Unfortunately, now I am seeing the FPE I mentioned earlier during startup:

$ mpirun -n 2 ./test_mpi_rget_fetch_op
[n063102:78867] *** Process received signal ***
[n063102:78867] Signal: Floating point exception (8)
[n063102:78867] Signal code: Integer divide-by-zero (1)
[n063102:78867] Failing at address: 0x2b96ccc549ec
[n063102:78867] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b96c90f25d0]
[n063102:78867] [ 1] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_hash_table_get_value_ptr+0x5c)[0x2b96ccc549ec]
[n063102:78867] [ 2] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(+0x106d57)[0x2b96ccd2bd57]
[n063102:78867] [ 3] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(+0x106f91)[0x2b96ccd2bf91]
[n063102:78867] [ 4] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_mca_base_framework_register+0x123)[0x2b96ccd258e3]
[n063102:78867] [ 5] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_mca_base_framework_open+0x11)[0x2b96ccd25a41]
[n063102:78867] [ 6] openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so(+0x230ac)[0x2b96cd9d30ac]
[n063102:78867] [ 7] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_pmix_bfrop_base_select+0xc2)[0x2b96cb921932]
[n063102:78867] [ 8] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_pmix_rte_init+0x7d8)[0x2b96cb8dc638]
[n063102:78867] [ 9] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_PMIx_server_init+0x269)[0x2b96cb8bc4f9]
[n063102:78867] [10] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x2d9)[0x2b96cb84d859]
[n063102:78867] [11] openmpi-git-ucx/lib/libopen-rte.so.0(pmix_server_init+0x337)[0x2b96c80412e7]
[n063102:78867] [12] openmpi-git-ucx/lib/openmpi/mca_ess_hnp.so(+0x469f)[0x2b96ca6f069f]
[n063102:78867] [13] openmpi-git-ucx/lib/libopen-rte.so.0(orte_init+0x2a4)[0x2b96c8085c04]
[n063102:78867] [14] openmpi-git-ucx/lib/libopen-rte.so.0(orte_submit_init+0x900)[0x2b96c808a490]
[n063102:78867] [15] mpirun[0x40100f]
[n063102:78867] [16] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b96c93213d5]
[n063102:78867] [17] mpirun[0x400e8e]
[n063102:78867] *** End of error message ***
Floating point exception

Just to make sure that the correct revision is used:

$ ompi_info | grep revision
  Open MPI repo revision: v2.x-dev-7044-g30b37ff
  Open RTE repo revision: v2.x-dev-7044-g30b37ff
      OPAL repo revision: v2.x-dev-7044-g30b37ff

I am currently recompiling with debug symbols to get a better stack trace, will report back soon...

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Jul 15, 2019 via email

@ggouaillardet
Copy link
Contributor

ggouaillardet commented Jul 15, 2019 via email

@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

@ggouaillardet Thanks a lot for spotting this, I wasn't aware that old pmi libraries in the installation directory might cause this problem (it makes sense when I think about it of course...). This solves both the FPE and the missing symbols, which apparently were both artifacts of this. Apologies for the noise (also to @rhc54), I will make sure to clear the installation directory in the future :)

@devreal devreal closed this as completed Jul 15, 2019
@devreal
Copy link
Contributor Author

devreal commented Jul 15, 2019

@ggouaillardet Thinking about this a little more: why doesn't Open MPI ignore the older PMI libraries and just use pick the ones it was built to use? It should be aware of the version it was linked against (esp. if it's internal) and all the libraries seem to carry the version number in their file name. The same should be true for libraries that are loaded by PMI itself, right? I might be missing something here as I am not familiar with PMI at all but this still puzzles me a bit...

@ggouaillardet
Copy link
Contributor

@devreal Open MPI has a modular architecture so it can load third party modules (possibly provided by ISV), that is why it cannot only load "the ones it was built to use". Keep in mind this is the master branch, so even if some version checks are performed, the Open MPI version of this previous mca_pmix_pmix3x.so module might be the same as the current mca_pmix_pmix4x.so

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants