-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing symbols in internal PMIx on master #6815
Comments
Joseph,
these symbols should be provided by
$HOME/opt/openmpi-git-ucx/lib/openmpi/mca_pmix_pmix4x.so
When upgrading the git master branch, you sometimes have to
make distclean
and then the usual configure && make && make install
Incidentally, I noted
configure FTN=gfortran
I am not sure this is doing anything, the syntax is
configure FC=gfortran
Cheers,
Gilles
…On Mon, Jul 15, 2019 at 4:14 PM Joseph Schuchart ***@***.***> wrote:
I'm trying to run Open MPI master (git commit 020a591) on our IB cluster
but the startup fails with the following error:
$ mpirun -n 2 -N 1 ./test_mpi_rget_fetch_op
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v12: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_kval_t_class (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v20: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v20.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_kval_t_class (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v21: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v21.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_bfrops_base_print_uint (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v3: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v3.so: undefined symbol: OPAL_MCA_PMIX4X_pmix_bfrops_base_print_uint (ignored)
[n062402:194324] pmix_mca_base_component_repository_open: unable to open mca_bfrops_v4: $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so: undefined symbol: pmix_bfrops_base_pack_coord (ignored)
--------------------------------------------------------------------------
We were unable to find any usable plugins for the BFROPS framework. This PMIx
framework requires at least one plugin in order to operate. This can be caused
by any of the following:
* we were unable to build any of the plugins due to some combination
of configure directives and available system support
* no plugin was selected due to some combination of MCA parameter
directives versus built plugins (i.e., you excluded all the plugins
that were built and/or could execute)
* the PMIX_INSTALL_PREFIX environment variable, or the MCA parameter
"mca_base_component_path", is set and doesn't point to any location
that includes at least one usable plugin for this framework.
Please check your installation and environment.
--------------------------------------------------------------------------
Configure suggests the use of internal PMIx:
$ ./configure CC=gcc CXX=g++ FTN=gfortran --with-ucx=$HOME//opt/ucx-1.6.x-gnu/ --without-verbs --prefix=$HOME/opt/openmpi-git-ucx
[...]
Miscellaneous
-----------------------
CUDA support: no
HWLOC support: internal
Libevent support: internal
PMIx support: Internal
The symbols are indeed missing
$ nm $HOME/opt/openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so | grep " U pmix_bfrops"
U pmix_bfrops_base_copy_coord
U pmix_bfrops_base_copy_envar
U pmix_bfrops_base_copy_regattr
U pmix_bfrops_base_pack_coord
U pmix_bfrops_base_pack_envar
U pmix_bfrops_base_pack_iof_channel
U pmix_bfrops_base_pack_regattr
U pmix_bfrops_base_print_coord
U pmix_bfrops_base_print_envar
U pmix_bfrops_base_print_iof_channel
U pmix_bfrops_base_print_regattr
U pmix_bfrops_base_unpack_coord
U pmix_bfrops_base_unpack_envar
U pmix_bfrops_base_unpack_iof_channel
U pmix_bfrops_base_unpack_regattr
I could not find a library that defines the missing symbol(s):
$ nm $HOME/opt/openmpi-git-ucx/lib/pmix/*.so | grep pmix_bfrops_base_copy_coord
U pmix_bfrops_base_copy_coord
The problem does not occur on the 3.1.x and 4.0.x branches. Any idea where
these symbols should come from and why they are missing? I looked through
issues here and in the PMIx repo but couldn't find anything related...
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#6815?email_source=notifications&email_token=ABXF524JWDXBMWTVRUKUUGDP7QPVDA5CNFSM4IDT2T3KYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4G7EWWTQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABXF5244RNOTPZT5YYP675LP7QPVDANCNFSM4IDT2T3A>
.
|
@ggouaillardet Thanks for your reply! I can confirm that the symbols are exported by
How are the symbols from
I assume that |
Try the patch at #6817 and see if that helps your situation. You will need to run autogen.pl and configure again. |
Thanks for the PR. Unfortunately, now I am seeing the FPE I mentioned earlier during startup:
Just to make sure that the correct revision is used:
I am currently recompiling with debug symbols to get a better stack trace, will report back soon... |
Joseph,
OPAL_MCA_PMIX3X_pmix_hash_table_get_value_ptr is a red flag that strongly suggests your install directory contains an old mca_pmix_pmix3x.so module.
If you remove your install directory and then run “make install”, that could fix your problem.
Cheers,
Gilles
…Sent from my iPod
On Jul 15, 2019, at 23:16, Joseph Schuchart ***@***.***> wrote:
Thanks for the PR. Unfortunately, now I am seeing the FPE I mentioned earlier during startup:
$ mpirun -n 2 ./test_mpi_rget_fetch_op
[n063102:78867] *** Process received signal ***
[n063102:78867] Signal: Floating point exception (8)
[n063102:78867] Signal code: Integer divide-by-zero (1)
[n063102:78867] Failing at address: 0x2b96ccc549ec
[n063102:78867] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b96c90f25d0]
[n063102:78867] [ 1] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_hash_table_get_value_ptr+0x5c)[0x2b96ccc549ec]
[n063102:78867] [ 2] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(+0x106d57)[0x2b96ccd2bd57]
[n063102:78867] [ 3] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(+0x106f91)[0x2b96ccd2bf91]
[n063102:78867] [ 4] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_mca_base_framework_register+0x123)[0x2b96ccd258e3]
[n063102:78867] [ 5] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_mca_base_framework_open+0x11)[0x2b96ccd25a41]
[n063102:78867] [ 6] openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so(+0x230ac)[0x2b96cd9d30ac]
[n063102:78867] [ 7] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_pmix_bfrop_base_select+0xc2)[0x2b96cb921932]
[n063102:78867] [ 8] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_pmix_rte_init+0x7d8)[0x2b96cb8dc638]
[n063102:78867] [ 9] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_PMIx_server_init+0x269)[0x2b96cb8bc4f9]
[n063102:78867] [10] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x2d9)[0x2b96cb84d859]
[n063102:78867] [11] openmpi-git-ucx/lib/libopen-rte.so.0(pmix_server_init+0x337)[0x2b96c80412e7]
[n063102:78867] [12] openmpi-git-ucx/lib/openmpi/mca_ess_hnp.so(+0x469f)[0x2b96ca6f069f]
[n063102:78867] [13] openmpi-git-ucx/lib/libopen-rte.so.0(orte_init+0x2a4)[0x2b96c8085c04]
[n063102:78867] [14] openmpi-git-ucx/lib/libopen-rte.so.0(orte_submit_init+0x900)[0x2b96c808a490]
[n063102:78867] [15] mpirun[0x40100f]
[n063102:78867] [16] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b96c93213d5]
[n063102:78867] [17] mpirun[0x400e8e]
[n063102:78867] *** End of error message ***
Floating point exception
Just to make sure that the correct revision is used:
$ ompi_info | grep revision
Open MPI repo revision: v2.x-dev-7044-g30b37ff
Open RTE repo revision: v2.x-dev-7044-g30b37ff
OPAL repo revision: v2.x-dev-7044-g30b37ff
I am currently recompiling with debug symbols to get a better stack trace, will report back soon...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Joseph,
One more thing, you should not have a
opal/mca/pmix/pmix3x
directory in your source tree.
“git clean -dn” can help you spot files that should probably not be there.
Cheers,
Gilles
…Sent from my iPod
On Jul 15, 2019, at 23:25, Gilles Gouaillardet ***@***.***> wrote:
Joseph,
OPAL_MCA_PMIX3X_pmix_hash_table_get_value_ptr is a red flag that strongly suggests your install directory contains an old mca_pmix_pmix3x.so module.
If you remove your install directory and then run “make install”, that could fix your problem.
Cheers,
Gilles
Sent from my iPod
> On Jul 15, 2019, at 23:16, Joseph Schuchart ***@***.***> wrote:
>
> Thanks for the PR. Unfortunately, now I am seeing the FPE I mentioned earlier during startup:
>
> $ mpirun -n 2 ./test_mpi_rget_fetch_op
> [n063102:78867] *** Process received signal ***
> [n063102:78867] Signal: Floating point exception (8)
> [n063102:78867] Signal code: Integer divide-by-zero (1)
> [n063102:78867] Failing at address: 0x2b96ccc549ec
> [n063102:78867] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x2b96c90f25d0]
> [n063102:78867] [ 1] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_hash_table_get_value_ptr+0x5c)[0x2b96ccc549ec]
> [n063102:78867] [ 2] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(+0x106d57)[0x2b96ccd2bd57]
> [n063102:78867] [ 3] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(+0x106f91)[0x2b96ccd2bf91]
> [n063102:78867] [ 4] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_mca_base_framework_register+0x123)[0x2b96ccd258e3]
> [n063102:78867] [ 5] openmpi-git-ucx/lib/pmix/mca_bfrops_v12.so(OPAL_MCA_PMIX4X_pmix_mca_base_framework_open+0x11)[0x2b96ccd25a41]
> [n063102:78867] [ 6] openmpi-git-ucx/lib/pmix/mca_bfrops_v4.so(+0x230ac)[0x2b96cd9d30ac]
> [n063102:78867] [ 7] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_pmix_bfrop_base_select+0xc2)[0x2b96cb921932]
> [n063102:78867] [ 8] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_pmix_rte_init+0x7d8)[0x2b96cb8dc638]
> [n063102:78867] [ 9] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(OPAL_MCA_PMIX3X_PMIx_server_init+0x269)[0x2b96cb8bc4f9]
> [n063102:78867] [10] openmpi-git-ucx/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x2d9)[0x2b96cb84d859]
> [n063102:78867] [11] openmpi-git-ucx/lib/libopen-rte.so.0(pmix_server_init+0x337)[0x2b96c80412e7]
> [n063102:78867] [12] openmpi-git-ucx/lib/openmpi/mca_ess_hnp.so(+0x469f)[0x2b96ca6f069f]
> [n063102:78867] [13] openmpi-git-ucx/lib/libopen-rte.so.0(orte_init+0x2a4)[0x2b96c8085c04]
> [n063102:78867] [14] openmpi-git-ucx/lib/libopen-rte.so.0(orte_submit_init+0x900)[0x2b96c808a490]
> [n063102:78867] [15] mpirun[0x40100f]
> [n063102:78867] [16] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b96c93213d5]
> [n063102:78867] [17] mpirun[0x400e8e]
> [n063102:78867] *** End of error message ***
> Floating point exception
> Just to make sure that the correct revision is used:
>
> $ ompi_info | grep revision
> Open MPI repo revision: v2.x-dev-7044-g30b37ff
> Open RTE repo revision: v2.x-dev-7044-g30b37ff
> OPAL repo revision: v2.x-dev-7044-g30b37ff
> I am currently recompiling with debug symbols to get a better stack trace, will report back soon...
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub, or mute the thread.
|
@ggouaillardet Thanks a lot for spotting this, I wasn't aware that old pmi libraries in the installation directory might cause this problem (it makes sense when I think about it of course...). This solves both the FPE and the missing symbols, which apparently were both artifacts of this. Apologies for the noise (also to @rhc54), I will make sure to clear the installation directory in the future :) |
@ggouaillardet Thinking about this a little more: why doesn't Open MPI ignore the older PMI libraries and just use pick the ones it was built to use? It should be aware of the version it was linked against (esp. if it's internal) and all the libraries seem to carry the version number in their file name. The same should be true for libraries that are loaded by PMI itself, right? I might be missing something here as I am not familiar with PMI at all but this still puzzles me a bit... |
@devreal Open MPI has a modular architecture so it can load third party modules (possibly provided by ISV), that is why it cannot only load "the ones it was built to use". Keep in mind this is the |
I'm trying to run Open MPI
master
(git commit020a591
) on our IB cluster but the startup fails with the following error:Configure suggests the use of internal PMIx:
The symbols are indeed missing
I could not find a library that defines the missing symbol(s):
The problem does not occur on the 3.1.x and 4.0.x branches. Any idea where these symbols should come from and why they are missing? I looked through issues here and in the PMIx repo but couldn't find anything related...
The text was updated successfully, but these errors were encountered: