Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#2281: Add new allreduce algorithms for Group, Objgroup and Collection #2337

Draft
wants to merge 70 commits into
base: develop
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
e5772c7
#2240: Initial work for new allreduce
JacobDomagala Mar 24, 2024
bfa369c
#2240: Semi working Rabenseifner
JacobDomagala Mar 27, 2024
69238fe
#2240: Working Rabenseifner (non-commutative ops)
JacobDomagala Apr 4, 2024
708a9da
#2240: Fix non power of 2 for new allreduce
JacobDomagala Apr 7, 2024
1fad3f5
#2240: Initial work for adding recursive doubling allreduce algorithm
JacobDomagala Apr 10, 2024
7d936f5
#2240: Make sure the order of reduce operations is correct
JacobDomagala Apr 11, 2024
f140748
#2240: Working Recursive doubling
JacobDomagala Apr 15, 2024
538dc03
#2240: Code cleanup and make Rabenseifner work with any Op type
JacobDomagala Apr 16, 2024
6e0f7d7
#2240: Improve accuracy of timing allreduce algorithms in allreduce.cc
JacobDomagala Apr 26, 2024
6e8e9fe
#2240: Add unit tests for new allreduce and cleanup code
JacobDomagala May 21, 2024
7f6f4eb
#2240: DataHandler for Rabenseifner allreduce that provides common AP…
JacobDomagala May 28, 2024
b5124c8
#2240: Fix warnings
JacobDomagala May 28, 2024
4bee334
#2240: Update ObjGroup test to use custom DataHandler for Rabenseifne…
JacobDomagala May 30, 2024
310f40b
#2240: Add unit test for Rabenseifner with Kokkos::View as DataType a…
JacobDomagala May 31, 2024
a00c6d4
#2240: Move function definitions to impl.h file for Rabenseifner
JacobDomagala Jun 3, 2024
11a8c45
#2240: Add allreduce print category and use it in rabenseifner instea…
JacobDomagala Jun 4, 2024
108226b
#2240: Provide documentation for RecursiveDoubling algorithm
JacobDomagala Jun 4, 2024
aec07ac
#2240: Use vt_debug_print for RecursiveDoubling allreduce
JacobDomagala Jun 4, 2024
8bc5659
#2240: Update allreduce perf tests to use array of payload sizes
JacobDomagala Jun 5, 2024
c3d1e59
#2240: Fix runtime failure in allreduce perf test
JacobDomagala Jun 7, 2024
9ead728
#2240: Working allreduce perf test with Kokkos
JacobDomagala Jun 16, 2024
ba11fb6
#2240: Working RecursiveDoubling with multiple allreduce in flight
JacobDomagala Jun 17, 2024
e8ba585
#2240: Update Rabenseifner to use ID for each allreduce and update tests
JacobDomagala Jun 18, 2024
2c184cf
#2240: Fix failing unit and performance tests for multiple allreduce …
JacobDomagala Jun 25, 2024
213c6af
#2240: Fix compile issues on some compilers and runtime issue with pa…
JacobDomagala Jul 2, 2024
2e80f87
#2240: Update logs
JacobDomagala Jul 6, 2024
7f78905
#2240: Fix issues with handlers being executed and payload not being …
JacobDomagala Jul 16, 2024
02ddc52
#2240: Add helpers and use Kokkos::View for internals of Rabenseifner…
JacobDomagala Jul 17, 2024
8329b0f
#2240: Store Reducers by tuple(ProxyType, DataType, OperandType)
JacobDomagala Jul 18, 2024
383fdee
#2281: Initial work to make collective group info contain the informa…
JacobDomagala Jul 24, 2024
9b060d6
#2281: Working nodes information for each group node
JacobDomagala Jul 29, 2024
06f65f9
#2281: Working original Rabenseifner with groups
JacobDomagala Aug 8, 2024
939000f
#2281: Working allreduce with Groups. Only thing missing is executing…
JacobDomagala Aug 9, 2024
76dad58
#2281: Working Handler execution on ObjGroup
JacobDomagala Aug 12, 2024
f3cfd76
#2281: Fix issue with collective groups' default_group and root_node …
JacobDomagala Aug 13, 2024
ad66011
#2281: Initial work for using new allreduce within collections
JacobDomagala Aug 21, 2024
f8cbdba
#2281: Working Rabenseifner (without final handler) for collection
JacobDomagala Aug 25, 2024
f3b04c1
#2281: Add template param for Rabenseifner to determine how to procee…
JacobDomagala Aug 25, 2024
b397a71
#2281: Working handler execution
JacobDomagala Aug 26, 2024
e7a5a14
#2281: Make final handler in Rabenseifner a callback
JacobDomagala Aug 29, 2024
67ef917
#2281: Fix the issue with DataHan::reduce
JacobDomagala Aug 29, 2024
7fc0b0e
#2281: Make GroupManager and ObjManager also use callback for final h…
JacobDomagala Aug 29, 2024
5ad6779
#2281: Working perf test with collection
JacobDomagala Sep 2, 2024
6caa9ee
#2281: Setup perf test for groups and remove the parent proxy, as we …
JacobDomagala Sep 3, 2024
8b31099
#2281: Remove RabenseifnerGroup as we've now integrated groups and co…
JacobDomagala Sep 3, 2024
d32f954
#2281: Add unit tests for new callback (BcastCollective)
JacobDomagala Sep 3, 2024
2385cc3
#2281: Refactor Rabensifer a bit so it doesn't require that many temp…
JacobDomagala Sep 4, 2024
cab5c61
#2281: Implement state holder and use it for Rabenseifner
JacobDomagala Sep 10, 2024
01ec9d7
#2281: Implement clear function for StateHolder that frees dynamic me…
JacobDomagala Sep 10, 2024
edbe990
#2281: Working Rabenseifner with updated StateHolder
JacobDomagala Sep 12, 2024
d2c8a32
#2281: Typeless Rabensifner and small fixes to StateHolder
JacobDomagala Sep 12, 2024
9cdcc29
#2281: Typeless RecursiveDoubling and general code refactor/cleanup
JacobDomagala Sep 13, 2024
5d47273
#2281: RecursiveDoubling allreduce - add Collection support
JacobDomagala Sep 16, 2024
373ec66
#2281: Add unit tests for new Collection allreduce
JacobDomagala Sep 17, 2024
afb37ee
#2281: Small fixes and code cleanup
JacobDomagala Sep 18, 2024
fe39968
#2281: Update documentation for new allreduce algorithms
JacobDomagala Sep 18, 2024
605e315
#2281: Initial work for AllreduceHolder
JacobDomagala Sep 20, 2024
a5d0c00
#2281: Fix runtime Kokkos when we run with both device/host memory sp…
JacobDomagala Sep 20, 2024
ed00a21
#2281: Working AllreduceHolder with all (Collection/Group/ObjGroup) c…
JacobDomagala Sep 21, 2024
d2fd2b0
#2281: Fixed runtime issues with StateHolder generating extra ID
JacobDomagala Sep 22, 2024
4ff246d
#2281: Update license and add reduce_op.h file for Kokkos reduce oper…
JacobDomagala Sep 26, 2024
c9eeb7a
#2281: Always create allreducers for new collective group
JacobDomagala Sep 26, 2024
e1d47ec
#2281: New allreducers no longer require Objgroup for internal commun…
JacobDomagala Sep 27, 2024
47b2013
#2281: Code refactor and minor bug fixes
JacobDomagala Sep 30, 2024
80f6f6a
#2281: Sort out the PendingSends in new allreduce methods
JacobDomagala Sep 30, 2024
b2a85d1
#2281: Don't use explicit Kokkos::HostSpace for Kokkos::View in Raben…
JacobDomagala Oct 10, 2024
2ad52fc
#2281: Cleanup StateHolder (move implementation to .cc impl.h files)
JacobDomagala Oct 10, 2024
1a4be39
#2281: Resolve issue with incorrect index generated by StateHolder::g…
JacobDomagala Oct 11, 2024
a37a58e
#2281: use or remove unused variables
cwschilly Nov 5, 2024
ac6da86
#2281: add perf tests for MPI_Allreduce
cwschilly Dec 3, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion src/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ set(
collective/scatter/
collective/reduce/
collective/reduce/operators collective/reduce/functors
collective/reduce/allreduce
elm
group/id group/region group/global group/msg group/collective group/rooted
group/base
Expand Down Expand Up @@ -98,7 +99,7 @@ set(
serialization/messaging serialization/traits serialization/auto_dispatch
serialization/sizing
utils/demangle utils/container utils/bits utils/mutex utils/file_spec
utils/hash utils/atomic utils/static_checks utils/string
utils/hash utils/atomic utils/static_checks utils/string utils/kokkos
utils/memory utils/mpi_limits utils/compress utils/json utils/strong
registry/auto
registry/auto/functor registry/auto/map registry/auto/collection
Expand Down
191 changes: 191 additions & 0 deletions src/vt/collective/reduce/allreduce/allreduce_holder.cc
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
/*
//@HEADER
// *****************************************************************************
//
// allreduce_holder.cc
// DARMA/vt => Virtual Transport
//
// Copyright 2019-2024 National Technology & Engineering Solutions of Sandia, LLC
// (NTESS). Under the terms of Contract DE-NA0003525 with NTESS, the U.S.
// Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are met:
//
// * Redistributions of source code must retain the above copyright notice,
// this list of conditions and the following disclaimer.
//
// * Redistributions in binary form must reproduce the above copyright notice,
// this list of conditions and the following disclaimer in the documentation
// and/or other materials provided with the distribution.
//
// * Neither the name of the copyright holder nor the names of its
// contributors may be used to endorse or promote products derived from this
// software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
// LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
// CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
// SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
// CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact darma@sandia.gov
//
// *****************************************************************************
//@HEADER
*/
#include "allreduce_holder.h"
#include "vt/objgroup/manager.h"
#include "state_holder.h"

namespace vt::collective::reduce::allreduce {

template <typename MapT>
inline static void removeImpl(MapT& map, uint64_t key){
auto it = map.find(key);

if (it != map.end()) {
auto& [rabenseifner, recursive_doubling] = map.at(key);

if(rabenseifner) {
delete rabenseifner;
}

if(recursive_doubling) {
delete recursive_doubling;
}

map.erase(key);
}
}

Rabenseifner* AllreduceHolder::addRabensifnerAllreducer(
detail::StrongVrtProxy strong_proxy, detail::StrongGroup strong_group,
size_t num_elems) {
auto const coll_proxy = strong_proxy.get();

auto obj_proxy = new Rabenseifner(strong_proxy, strong_group, num_elems);

col_reducers_[coll_proxy].first = obj_proxy;

vt_debug_print(
verbose, allreduce, "Adding new Rabenseifner reducer for collection={:x}\n",
coll_proxy
);

return obj_proxy;
}

RecursiveDoubling*
AllreduceHolder::addRecursiveDoublingAllreducer(
detail::StrongVrtProxy strong_proxy, detail::StrongGroup strong_group,
size_t num_elems) {
auto const coll_proxy = strong_proxy.get();
auto obj_proxy = new RecursiveDoubling(
strong_proxy, strong_group, num_elems);

col_reducers_[coll_proxy].second = obj_proxy;

vt_debug_print(
verbose, allreduce,
"Adding new RecursiveDoubling reducer for collection={:x}\n", coll_proxy
);

return obj_proxy;
}

Rabenseifner*
AllreduceHolder::addRabensifnerAllreducer(detail::StrongGroup strong_group) {
auto const group = strong_group.get();

auto obj_proxy = new Rabenseifner(
strong_group);

group_reducers_[group].first = obj_proxy;

vt_debug_print(
verbose, allreduce,
"Adding new Rabenseifner reducer for group={:x}\n", group
);

return obj_proxy;
}

RecursiveDoubling*
AllreduceHolder::addRecursiveDoublingAllreducer(
detail::StrongGroup strong_group) {
auto const group = strong_group.get();

auto obj_proxy = new RecursiveDoubling(
strong_group);

vt_debug_print(
verbose, allreduce,
"Adding new RecursiveDoubling reducer for group={:x}\n", group
);

group_reducers_[group].second = obj_proxy;

return obj_proxy;
}

Rabenseifner*
AllreduceHolder::addRabensifnerAllreducer(detail::StrongObjGroup strong_objgroup) {
auto const objgroup = strong_objgroup.get();

auto obj_proxy = new Rabenseifner(
strong_objgroup);

objgroup_reducers_[objgroup].first = obj_proxy;

vt_debug_print(
verbose, allreduce,
"Adding new Rabenseifner reducer for objgroup={:x}\n", objgroup
);

return obj_proxy;
}

RecursiveDoubling*
AllreduceHolder::addRecursiveDoublingAllreducer(
detail::StrongObjGroup strong_objgroup) {
auto const objgroup = strong_objgroup.get();

auto obj_proxy = new RecursiveDoubling(
strong_objgroup);

vt_debug_print(
verbose, allreduce,
"Adding new RecursiveDoubling reducer for objgroup={:x}\n", objgroup
);

objgroup_reducers_[objgroup].second = obj_proxy;

return obj_proxy;
}

void AllreduceHolder::remove(detail::StrongVrtProxy strong_proxy) {
auto const key = strong_proxy.get();
StateHolder::clearAll(strong_proxy);
removeImpl(col_reducers_, key);
}

void AllreduceHolder::remove(detail::StrongGroup strong_group) {
auto const key = strong_group.get();
StateHolder::clearAll(strong_group);
removeImpl(group_reducers_, key);
}

void AllreduceHolder::remove(detail::StrongObjGroup strong_objgroup) {
auto const key = strong_objgroup.get();
StateHolder::clearAll(strong_objgroup);
removeImpl(objgroup_reducers_, key);
}

} // namespace vt::collective::reduce::allreduce
142 changes: 142 additions & 0 deletions src/vt/collective/reduce/allreduce/allreduce_holder.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
/*
//@HEADER
// *****************************************************************************
//
// allreduce_holder.h
// DARMA/vt => Virtual Transport
//
// Copyright 2019-2024 National Technology & Engineering Solutions of Sandia, LLC
// (NTESS). Under the terms of Contract DE-NA0003525 with NTESS, the U.S.
// Government retains certain rights in this software.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions are met:
//
// * Redistributions of source code must retain the above copyright notice,
// this list of conditions and the following disclaimer.
//
// * Redistributions in binary form must reproduce the above copyright notice,
// this list of conditions and the following disclaimer in the documentation
// and/or other materials provided with the distribution.
//
// * Neither the name of the copyright holder nor the names of its
// contributors may be used to endorse or promote products derived from this
// software without specific prior written permission.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
// AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
// ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
// LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
// CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
// SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
// INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
// CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
// ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
// POSSIBILITY OF SUCH DAMAGE.
//
// Questions? Contact darma@sandia.gov
//
// *****************************************************************************
//@HEADER
*/

#if !defined INCLUDED_VT_COLLECTIVE_REDUCE_ALLREDUCE_ALLREDUCE_HOLDER_H
#define INCLUDED_VT_COLLECTIVE_REDUCE_ALLREDUCE_ALLREDUCE_HOLDER_H

#include "vt/configs/types/types_type.h"
#include "vt/collective/reduce/allreduce/type.h"
#include "vt/collective/reduce/scoping/strong_types.h"
#include "vt/objgroup/proxy/proxy_objgroup.h"

#include <unordered_map>

namespace vt::collective::reduce::allreduce {

struct Rabenseifner;
struct RecursiveDoubling;

struct AllreduceHolder {
using RabenseifnerProxy = ObjGroupProxyType;
using RecursiveDoublingProxy = ObjGroupProxyType;

template <typename ReducerT>
static decltype(auto) getAllreducer(detail::StrongVrtProxy strong_proxy);

template <typename ReducerT>
static decltype(auto) getAllreducer(detail::StrongGroup strong_group);

template <typename ReducerT>
static decltype(auto) getAllreducer(detail::StrongObjGroup strong_objgroup);

template <typename ReducerT>
static decltype(auto) getOrCreateAllreducer(
detail::StrongVrtProxy strong_proxy, detail::StrongGroup strong_group,
size_t num_elems);

template <typename ReducerT>
static decltype(auto) getOrCreateAllreducer(detail::StrongGroup strong_group);

template <typename ReducerT>
static decltype(auto)
getOrCreateAllreducer(detail::StrongObjGroup strong_objgroup);

static void remove(detail::StrongVrtProxy strong_proxy);
static void remove(detail::StrongGroup strong_group);
static void remove(detail::StrongObjGroup strong_group);

private:
template <typename ReducerT, typename MapT>
static decltype(auto) getAllreducerImpl(MapT& map, uint64_t id);

template <typename ReducerT, typename MapT, typename... Args>
static decltype(auto) getOrCreateAllreducerImpl(MapT& map, uint64_t id, Args&&... args);

static Rabenseifner* addRabensifnerAllreducer(
detail::StrongVrtProxy strong_proxy, detail::StrongGroup strong_group,
size_t num_elems);

static RecursiveDoubling* addRecursiveDoublingAllreducer(
detail::StrongVrtProxy strong_proxy, detail::StrongGroup strong_group,
size_t num_elems);

static Rabenseifner*
addRabensifnerAllreducer(detail::StrongGroup strong_group);
static RecursiveDoubling*
addRecursiveDoublingAllreducer(detail::StrongGroup strong_group);

static Rabenseifner*
addRabensifnerAllreducer(detail::StrongObjGroup strong_group);
static RecursiveDoubling*
addRecursiveDoublingAllreducer(detail::StrongObjGroup strong_group);

static inline std::unordered_map<
VirtualProxyType, std::pair<Rabenseifner*, RecursiveDoubling*>>
col_reducers_ = {};
static inline std::unordered_map<
GroupType, std::pair<Rabenseifner*, RecursiveDoubling*>>
group_reducers_ = {};
static inline std::unordered_map<
ObjGroupProxyType, std::pair<Rabenseifner*, RecursiveDoubling*>>
objgroup_reducers_ = {};
};

template <typename ReducerT>
static inline auto* getAllreducer(ComponentInfo type) {
if (type.first == ComponentT::VrtColl) {
return AllreduceHolder::getAllreducer<ReducerT>(
detail::StrongVrtProxy{type.second});
} else if (type.first == ComponentT::ObjGroup) {
return AllreduceHolder::getAllreducer<ReducerT>(
detail::StrongObjGroup{type.second});
} else {
return AllreduceHolder::getAllreducer<ReducerT>(
detail::StrongGroup{type.second});
}
}

} // namespace vt::collective::reduce::allreduce

#include "vt/collective/reduce/allreduce/allreduce_holder.impl.h"

#endif /*INCLUDED_VT_COLLECTIVE_REDUCE_ALLREDUCE_ALLREDUCE_HOLDER_H*/
Loading
Loading