Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Master tls refactor Part 2, rework address table as hash, and other minor changes #6

Open
wants to merge 239 commits into
base: master-tls-refactor_v4
Choose a base branch
from

Conversation

janjust
Copy link
Owner

@janjust janjust commented Aug 11, 2020

No description provided.

cniethammer and others added 11 commits June 30, 2020 22:04
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
Signed-off-by: Christoph Niethammer <niethammer@hlrs.de>
We completely disable C11 atomic op support for _Atomic for
all Intel compiler prior to 20200310 (which is currently the
latest released), by switching to our pre-C11 atomic
operations.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
`--enable-mem-debug` `#define`s `realloc`/`free` as macros, though macros
are also matched if they appear in references to members. Rename the
members to avoid this matching.

See open-mpi#6995

Signed-off-by: Bert Wesarg <bert.wesarg@tu-dresden.de>
Signed-off-by: Aurelien Bouteiller <bouteill@icl.utk.edu>
EFA incorrectly implements FI_DELIVERY_COMPLETE in earlier libfabric
versions. While FI_DELIVERY_COMPLETE would be advertised by the
provider, completions would return too early by not accounting for
bounce buffers on the receive side. This would cause the BTL
to receive early completions that lead to correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
The btl/ofi does not currently utilize the common ofi include/exclude
list. Added verification code similar to the mtl/ofi that will check if
the info object is in the include or exclude list. If it isn't in the
include list or is in the exclude list, validate_info will return
OPAL_ERROR. The btl/ofi will no longer pass a provider name as a hint
when calling getinfo, instead filtering the provider during
validate_info.

This patch also moves the is_in_list MTL function into common code and
adds additional debugging output to the BTL to match the MTL standard.

Signed-off-by: William Zhang <wilzhang@amazon.com>
…take2

Second take on fixing the Intel _Atomic atomic operation warning
btl/ofi: Use common provider include/exclude list
(`prte_hwloc_base_get_locality_string` never returns locality string with L0).

Signed-off-by: Mikhail Kurnosov <mkurnosov@gmail.com>
@janjust janjust changed the title Master tls refactor Part 2, address table is hash, and other minor changes Master tls refactor Part 2, rework address table as hash, and other minor changes Aug 11, 2020
jjhursey and others added 18 commits August 11, 2020 08:58
opal/hwloc: fix a typo in parsing locality string
bug fix: des->tag = hdr->frag, should be hdr->tag
The ofi_rxm provider is dependent upon the underlying hardware for its
implementation of FI_DELIVERY_COMPLETE. Since this can lead to early
completions, we disable the provider to avoid correctness issues.

This is not an issue in the mtl/ofi as it does not require
FI_DELIVERY_COMPLETE.

Signed-off-by: William Zhang <wilzhang@amazon.com>
btl/ofi: Disable EFA provider in versions earlier than libfabric 1.12.0
The C++ bindings were removed a while ago;
MPI::ERRORS_THROW_EXCEPTIONS and MPI_ERRORS_THROW_EXCEPTIONS no longer
exist.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
MPI-4 is finally cleaning up its language: an MPI "exception" does not
actually exist.  The only thing that exists is an MPI "error" (and
associated handlers).  This commit replaces all relevant uses of the
word "exception" with "error".  Note that this is still applicable in
versions of the MPI standard less than MPI-4.0 (indeed, nearly all the
cases fixed in this commit are just changes to comments, anyway).

One exception to this is the Java bindings, where there's an
MPIException class.  In hindsight, it probably should have been named
MPIError, but changing it now would break anyone who is using the Java
bindings.

Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
…rs-and-exceptions

Cleanup of MPI errors and exceptions
the ofi mtl mrecv was not properly setting the message in/out
arg to MPI_MRECV to MPI_MESSAGE_NULL.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
improve configury to check whether icc is handling no long double.
This prevents seeing 100s of messages like this:

icc: command line warning open-mpi#10148: option '-Wno-long-double' not supported

A similar patch will be needed for pmix.

Signed-off-by: Howard Pritchard <hppritcha@gmail.com>
Add comments in the ADAPT module

Signed-off-by: Xi Luo <xluo12@vols.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
* piggybacking Bull functionalities

* coll/adapt: Fix naming conventions and C11 atomic use

This commit fixes some naming convention issues, such as function names
which should follow the naming ompi_coll_adapt instead of
mca_coll_adapt, reserved for component and module naming (cf. tuned
collective component);

It also fixes the use of _Atomic construct, which is only valid in C11.
OPAL constructs have already been adapted to that use, so use
opal_atomic_* types instead.

* coll/adapt: Remove unused component field in module

This commit removes an unneeded field referencing the component in the
module of adapt, as it is already available through the
mca_coll_adapt_component global variable.

Signed-off-by: Marc Sergent <marc.sergent@atos.net>
Co-authored-by: Lemarinier, Pierre <pierre.lemarinier@atos.net>
Co-authored-by: pierrele <31764860+pierrele@users.noreply.github.com>
API consistent with other collective modules
Add comments
Other minor cleanups.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
As it is possible to have multiple outstanding non-blocking collectives
provided by different collective modules, we need a consistent
mechanism to allow them to select unique tags for each instance of a
collective.

Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Set request in ibcast.c to empty when the count is 0.

Signed-off-by: Xi Luo <xluo12@vols.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Signed-off-by: George Bosilca <bosilca@icl.utk.edu>
Reduce scatter block and reduce scatter algorithms were hitting
correctness issues for non commutative strided tests. We will revert to
the original default algorithms for those two collectives (basic linear
and non overlapping respectively) in the non commutative op case.

See open-mpi#8010

Signed-off-by: William Zhang <wilzhang@amazon.com>
paklui and others added 29 commits November 17, 2020 09:29
…iling param.c

Signed-off-by: Pak Lui <pak.lui@amd.com>
- there was potential leak in error handling, fixed

Signed-off-by: Sergey Oblomov <sergeyo@nvidia.com>
COLL TUNED: Use per-rank data size instead of total size for decision in allgatherv
oshmem/tools/oshmem_info: fix fortran keyword issue when compiling param.c
Signed-off-by: Ralph Castain <rhc@pmix.org>
Do not pass --enable-debug to internal hwloc
Seems like a copy/pasted typo in ob1 comments

Signed-off-by: Julien EMMANUEL <julien.emmanuel@inria.fr>
The selectable list is sorted with lowest to highest priority so the
user-defined preferences should be appended to the list.
The preference treatment should also maintain the order provided by the user
(first item has highest priority) so switch the loop order.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
…e components

Also make coll/tuned the default for shared memory communication
as coll/sm has shown performance issues that need investigation.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
This has shown to be more effective in achieving overlap
of inter- and intra-node communication and reduces the inital
delay before hitting the network.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
In ob1 we have four similar conditions but they are not written
in a uniform way

Signed-off-by: Julien EMMANUEL <julien.emmanuel@inria.fr>
Typo in ob1 comments, and uniform conditions
…arning-wpool

PML/UCX/WPOOL: fixed coverity issue
 - fix path to getdate.sh
 - do not prepend "date" to the revision
 - support git worktree

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Thanks FX Coudert for reporting this issue and pointing
to a solution.

Refs. open-mpi#8218

Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
Signed-off-by: Jeff Squyres <jsquyres@cisco.com>
autogen.pl: patch libtool.m4 for OSX Big Sur
Resolves the PRRTE launch scale limitation

Signed-off-by: Ralph Castain <rhc@pmix.org>
Exclude HAN, don't include it.

Signed-off-by: Joseph Schuchart <schuchart@icl.utk.edu>
coll/han: fix coll preference selection in mca_coll_han_comm_create_new
Signed-off-by: Leonid Genkin <lgenkin@nvidia.com>
Replace usage of the deprecated NB API of UCX with NBX
…ry functionality to

support world_comm rank translation

Co-authored-by: Artem Polyakov <artpol84@gmail.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
Update opal/mca/common/ucx/common_ucx_wpool.c
Update opal/mca/common/ucx/common_ucx_wpool.h
Update opal/mca/common/ucx/common_ucx_wpool_int.h

Co-authored-by: Artem Polyakov <artpol84@gmail.com>

Signed-off-by: Tomislav Janjusic <tomislavj@nvidia.com>
Co-authored-by: Artem Polyakov <artpol84@gmail.com>

Signed-off-by: Tomislav Janjusic <tomislavj@mellanox.com>
@janjust janjust force-pushed the master-tls-refactor_v5 branch from a24893e to b7683a4 Compare November 30, 2020 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.