-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll: add coll_group to collective interfaces #7103
base: main
Are you sure you want to change the base?
Conversation
295e2e0
to
736c1d1
Compare
test:mpich/ch4/most Only |
cefbfd9
to
d8d297a
Compare
test:mpich/ch4/most |
test:mpich/ch4/most 1 failure - |
Try get a clean test: test:mpich/ch4/ucx |
What is meant by |
When we use
|
b3477a3
to
b6cecd0
Compare
test:mpich/ch4/most Only 2 timouts in
|
Store num_local and num_external in MPIR_Comm. Along with internode_table, they help construct internode subgroups.
This is the same as num_external.
It does not take many instructions to calculate pof2 on the fly. Use of hard coded pof2 prevents collective algorithms to be used for non-trivial coll_group.
Lightweight struct to describe sub-groups of a communicator. They intend to replace the subcomms. Preset a set of reserved subgroups to simplify common usages such as intranode group and crossnode group. Since we only expect limited number of dynamic subgroups and they should always be push/pop'ed within the scope, we don't need many dynamic slots.
Group collectives will have non-trivial coll_group that alter the rank and size of the communicator. Thease macros and functions will facilitate it.
Add coll_group, index to comm->subgroups[], to all collectives except neighborhood collectives.
Assuming the device layer collectives are not able to handle non-trivial coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for now. Also normalize the code style to use the fallback label. We should always fallback to mpir impl routines rather than the netmod routines (composition_beta). The composition_beta may fallback in the future when netmod coll become fancy, resulting in deadloop.
Make csel coll_group aware.
Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This allows compositional collectives under threadcomm.
We call MPIR_Comm_is_parent_comm to prevent recursively entering compositional algorithms such as the _smp algorithms. Check coll_group as well as we will switch to use subgroup rather than subcomms. Also check num_external directly for trivial comm. Subcomms and comm->hierarchy_kind will be removed in the future.
Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral. Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent. It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since coll_group may alter the topology assumptions. Intercomm does not work with non-zero coll_group.
Replace the usage of subcomms with subgroups.
When root is not local rank 0, instead of adding a extra intra-node send/recv or bcast, construct an inter group that includes the root process.
Directly use information from MPIR_Process rather than from nodecomm in MPIR_Process. One step toward removing subcomms.
Use a single "cached_tree" rather than 3 different fields for each tree type.
The topology-aware tree utilities need check coll_group for correct world ranks.
Some algorithm, e.g. Allgather recexch, caches comm size-related info in communicator, thus won't work with none trivial coll_group. Add a restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.
All subgroup collectives should use the same tag within the parent collectives. This is because all processes in the communicator has to agree on the tag to use, but group collectives may not involve all processes. It is okay to use the same tag as long as the group collectives are always issued in order. This is the case since all group collectives are spawned under a parent collective, which has to obey the non-overlapping rule.
Because the compiler can't figure out the arithmetic, it is warning: ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] Refactor to suppress warning and for better readability.
Commit ba1b4dd left an empty branch that should be removed.
331eb14
to
10adb96
Compare
test:mpich/pmi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong objections to these changes at this stage. What would be good to know is the scope of additional changes coming for 4.3 based on the new interface. Comparisons vs. the previous implementation would also help, especially for review. Please add a commit summarizing the work to CHANGES.
src/mpi/coll/bcast/bcast_intra_smp.c
Outdated
if (local_rank == local_root) { | ||
mpi_errno = MPIR_Bcast(buffer, count, datatype, inter_root, | ||
comm_ptr, node_roots_group, errflag); | ||
MPIR_ERR_CHECK(mpi_errno); | ||
} | ||
|
||
/* perform the intranode broadcast on all except for the root's node */ | ||
/* perform the intranode broadcast */ | ||
if (local_size > 1) { | ||
mpi_errno = MPIR_Bcast(buffer, count, datatype, 0, comm_ptr, node_group, errflag); | ||
MPIR_ERR_CHECK(mpi_errno); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely appreciate the simplicity of this approach vs the extra messaging. Can we capture any data to confirm this is not any slower vs the previous code with non-zero root?
@@ -643,14 +643,14 @@ static csel_node_s *prune_tree(csel_node_s * root, MPIR_Comm * comm_ptr) | |||
else | |||
node = node->success; | |||
break; | |||
|
|||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiousity what is the eventual change to this block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we will no longer have subcomm, we will only remove hierarchy kind, and there is no point to do hierarchy kind check. We need check the current usages and rename the check. If it is checking for whether it's subgroup, we can remove the prune here; it has to be checked dynamically (on whether coll_group
is 0
). If it is checking for topology, we should replace the check with something more specific.
Pull Request Description
Make all (most) collective algorithms able to work within a subgroup.
MPIR_Subgroup
MPIR_Subgroup
differs fromMPIR_Group
as the latter does not live inside a communicator, thus overly complex and inefficient to use.MPIR_SUBGROUP_NONE
in place ofcoll_group
argument will provide backward collective semantics, i.e. the whole communicator collective.become
One of the goals of this PR is to make all
mpir-
layer intra-collectivescoll_group
aware.MPIR_SUBGROUP_NONE
(no group inter collectives)_smp
) only work withMPIR_SUBGROUP_NONE
, algo selection need make sure not to create recursive compositional situation[skip warnings]
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.