Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Topic/pmixstat #12

Closed
wants to merge 19 commits into from
Closed

Topic/pmixstat #12

wants to merge 19 commits into from

Conversation

jjhursey
Copy link
Owner

@jjhursey jjhursey commented Feb 5, 2020

CI testing for upstream: open-mpi#7202

bgoglin and others added 19 commits November 27, 2019 12:41
Both opal_hwloc_base_get_relative_locality() and _get_locality_string()
iterate over hwloc levels to build the proc locality information.
Unfortunately, NUMA nodes are not in those normal levels anymore since 2.0.
We have to explicitly look a the special NUMA level to get that locality info.

I am factorizing the core of the iterations inside dedicated "_by_depth"
functions and calling them again for the NUMA level at the end of the loops.

Thanks to Hatem Elshazly for reporting the NUMA communicator split failure
at https://www.mail-archive.com/users@lists.open-mpi.org/msg33589.html

It looks like only the opal_hwloc_base_get_locality_string() part is needed
to fix that split, but there's no reason not to fix get_relative_locality()
as well.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
Forgot to include a fix for the fortran test used to check if
new dtags is supported.

Related to open-mpi#7268

This patch is already included on v4.0.x branch.

Signed-off-by: Howard Pritchard <howardp@lanl.gov>
Signed-off-by: Artem Ryabov <artemry@mellanox.com>
…x_ci_for_release_branches

Enabled Mellanox CI for release branches (changes for master branch).
Fix the C types for the following:

* MPI_UNWEIGHTED
* MPI_WEIGHTS_EMPTY
* MPI_ARGV_NULL
* MPI_ARGVS_NULL
* MPI_ERRCODES_IGNORE

There is lengthy discussion on
open-mpi#7210 describing the issue; the
gist of it is that the C and Fortran types for several MPI global
sentenial values should agree (specifically: their sizes must(**)
agree).  We erroneously had several of these array-like sentinel
values be "array-like" values in C.  E.g., MPI_ERRCODES_IGNORE was an
(int *) in C while its corresponding Fortran type was "integer,
dimension(1)".  On a 64 bit platform, this resulted in C expecting the
symbol size to be sizeof(int*)==8 while Fortran expected the symbol
size to be sizeof(INTEGER, DIMENSION(1))==4.

That is incorrect -- the corresponding C type needed to be (int).
Then both C and Fortran expect the size of the symbol to be the same.

(**) NOTE: This code has been wrong for years.  This mismatch of types
typically worked because, due to Fortran's call-by-reference
semantics, Open MPI was comparing the *addresses* of these instances,
not their *types* (or sizes) -- so even if C expected the size of the
symbol to be X and Fortran expected the size of the symbol to be Y
(where X!=Y), all we really checked at run time was that the addresses
of the symbols were the same.  But it caused linker warning messages,
and even caused errors in some cases.

Specifically: due to a GNU ld bug
(https://sourceware.org/bugzilla/show_bug.cgi?id=25236), the 5 common
symbols are incorrectly versioned VER_NDX_LOCAL because their
definitions in Fortran sources have smaller st_size than those in
libmpi.so.

This makes the Fortran library not linkable with lld in distributions
that ship openmpi built with -Wl,--version-script
(https://bugs.llvm.org/show_bug.cgi?id=43748):

  % mpifort -fuse-ld=lld /dev/null
  ld.lld: error: corrupt input file: version definition index 0 for symbol
  mpi_fortran_argv_null_ is out of bounds
  >>> defined in /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_usempif08.so
  ...

If we fix the C and Fortran symbols to actually be the same size, the
problem goes away and the GNU ld bug does not come into play.

This commit also fixes a minor issue that MPI_UNWEIGHTED and
MPI_WEIGHTS_EMPTY were not declared as Fortran arrays (not fully fixed
by commit 107c007).

Fixes open-mpi#7209

Signed-off-by: Fangrui Song <i@maskray.me>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Fix Fortran st_size fields of mpi_fortran_argv_null_, mpi_fortran_weights_empty_, mpi_fortran_unweighted_, mpi_fortran_errcodes_ignore_, and mpi_fortran_argvs_null_
Signed-off-by: Dmitry Gladkov <dmitrygla@mellanox.com>
hwloc/base: fix opal proc locality wrt to NUMA nodes on hwloc 2.0
These -D's are for C compilation, not Fortran compilation.  Remove
this useless statement.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Automake's Fortran compilation rules inexplicably use CPPFLAGS and
AM_CPPFLAGS.  Unfortunately, this can cause problems in some cases
(e.g., picking up already-installed mpi.mod in a system-default
include search path).

So in relevant module-using Fortran compilation Makefile.am's, zero
out CPPFLAGS and AM_CPPFLAGS.

This has a side-effect of requiring that we compile the one .c file in
the F08 library in a new, separate subdirectory (with its own
Makefile.am that does _not_ have CPPFLAGS/AM_CPPFLAGS zeroed out).

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Ralph Castain <rhc@pmix.org>
Always consider retrieval of HOSTNAME to be optional
SPML/UCX: Fix compilation warnings with GCC
Signed-off-by: Austen Lauria <awlauria@us.ibm.com>
Will be replaced by PRRTE. Ensure that OMPI and OPAL layers build
without reference to ORTE. Setup opal/pmix framework to be static.
Remove support for all PMI-1 and PMI-2 libraries. Add support for
"external" pmix component as well as internal v4 one.

remove orte: misc fixes

 - UCX fixes
 - VPATH issue
 - oshmem fixes
 - remove useless definition
 - Add PRRTE submodule
 - Get autogen.pl to traverse PRRTE submodule
 - Remove stale orcm reference
 - Configure embedded PRRTE
 - Correctly pass the prefix to PRRTE
 - Correctly set the OMPI_WANT_PRRTE am_conditional
 - Move prrte configuration to the end of OMPI's configure.ac
 - Make mpirun a symlink to prun, when available
 - Fix makedist with --no-orte/--no-prrte option
 - Add a `--no-prrte` option which is the same as the legacy
   `--no-orte` option.
 - Remove embedded PMIx tarball. Replace it with new submodule
   pointing to OpenPMIx master repo's master branch
 - Some cleanup in PRRTE integration and add config summary entry
 - Correctly set the hostname
 - Fix locality
 - Fix singleton operations

Signed-off-by: Ralph Castain <rhc@pmix.org>
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
Signed-off-by: Ralph Castain <rhc@pmix.org>
@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 5, 2020

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/f80e5bdffc4738ddcf704c0114e780b6

@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 5, 2020

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0c6e44147f05d5343ae6ab34c2609bcb

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:retest

@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 5, 2020

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ibm-ompi/88b480fb98dfd2bdac939f023e1f8fe6

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:gnu:retest
bot:ibm:nodes:3:test

@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 5, 2020

The IBM CI (XL) build failed! Please review the log, linked below.

Gist: https://gist.github.com/d88abb002df75079a19cf59de7c9241b

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:prrte:retest

@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 5, 2020

The IBM CI (GNU/Scale) build failed! Please review the log, linked below.

Gist: https://gist.github.com/ibm-ompi/88b480fb98dfd2bdac939f023e1f8fe6

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:prrte:retest

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:xl:retest

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:prrte:retest

1 similar comment
@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:prrte:retest

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:retest

2 similar comments
@jjhursey
Copy link
Owner Author

jjhursey commented Feb 5, 2020

bot:ibm:retest

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 6, 2020

bot:ibm:retest

@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 6, 2020

The IBM CI (PRRTE) build failed! Please review the log, linked below.

Gist: https://gist.github.com/19d84ead9473d35fd7343b6c24ff8bfa

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 6, 2020

bot:ibm:retest

@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 6, 2020

The IBM CI (PRRTE) build failed! Please review the log, linked below.

Gist: https://gist.github.com/0bc63ee355523e9b3b0f6dcc39d8dffd

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 6, 2020

bot:ibm:retest

@ibm-ompi
Copy link
Collaborator

ibm-ompi commented Feb 6, 2020

The IBM CI (PRRTE) build failed! Please review the log, linked below.

Gist: https://gist.github.com/47664cf5630ac7307abaca1c4e822de3

@jjhursey
Copy link
Owner Author

jjhursey commented Feb 6, 2020

bot:ibm:retest

1 similar comment
@jjhursey
Copy link
Owner Author

jjhursey commented Feb 6, 2020

bot:ibm:retest

@jjhursey
Copy link
Owner Author

bot:ibm:pgi:retest

@jjhursey jjhursey closed this Feb 24, 2020
@jjhursey jjhursey deleted the topic/pmixstat branch February 24, 2020 22:53
jjhursey pushed a commit that referenced this pull request Jan 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.