-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll: add coll_group to collective interfaces #7103
base: main
Are you sure you want to change the base?
Commits on Sep 14, 2024
-
comm: store num_local and num_external in MPIR_Comm
Store num_local and num_external in MPIR_Comm. Along with internode_table, they help construct internode subgroups.
Configuration menu - View commit details
-
Copy full SHA for 0499415 - Browse repository at this point
Copy the full SHA 0499415View commit details -
Configuration menu - View commit details
-
Copy full SHA for b7d6412 - Browse repository at this point
Copy the full SHA b7d6412View commit details -
Configuration menu - View commit details
-
Copy full SHA for cae3828 - Browse repository at this point
Copy the full SHA cae3828View commit details -
It does not take many instructions to calculate pof2 on the fly. Use of hard coded pof2 prevents collective algorithms to be used for non-trivial coll_group.
Configuration menu - View commit details
-
Copy full SHA for 438e7b8 - Browse repository at this point
Copy the full SHA 438e7b8View commit details -
Lightweight struct to describe sub-groups of a communicator. They intend to replace the subcomms. Preset a set of reserved subgroups to simplify common usages such as intranode group and crossnode group. Since we only expect limited number of dynamic subgroups and they should always be push/pop'ed within the scope, we don't need many dynamic slots.
Configuration menu - View commit details
-
Copy full SHA for e7c88bd - Browse repository at this point
Copy the full SHA e7c88bdView commit details -
coll: add macros to get rank/size with coll_group
Group collectives will have non-trivial coll_group that alter the rank and size of the communicator. Thease macros and functions will facilitate it.
Configuration menu - View commit details
-
Copy full SHA for 2b88398 - Browse repository at this point
Copy the full SHA 2b88398View commit details -
coll: add coll_group argument to coll interfaces
Add coll_group, index to comm->subgroups[], to all collectives except neighborhood collectives.
Configuration menu - View commit details
-
Copy full SHA for e804cbc - Browse repository at this point
Copy the full SHA e804cbcView commit details -
Configuration menu - View commit details
-
Copy full SHA for e3969cc - Browse repository at this point
Copy the full SHA e3969ccView commit details -
Configuration menu - View commit details
-
Copy full SHA for 024377a - Browse repository at this point
Copy the full SHA 024377aView commit details -
Configuration menu - View commit details
-
Copy full SHA for 6338e01 - Browse repository at this point
Copy the full SHA 6338e01View commit details -
ch4: fallback to mpir if coll_group is non-zero
Assuming the device layer collectives are not able to handle non-trivial coll_group, always fallback when coll_group != MPIR_SUBGROUP_NONE, for now. Also normalize the code style to use the fallback label. We should always fallback to mpir impl routines rather than the netmod routines (composition_beta). The composition_beta may fallback in the future when netmod coll become fancy, resulting in deadloop.
Configuration menu - View commit details
-
Copy full SHA for 049fab4 - Browse repository at this point
Copy the full SHA 049fab4View commit details -
Configuration menu - View commit details
-
Copy full SHA for 98fc2fc - Browse repository at this point
Copy the full SHA 98fc2fcView commit details -
coll: threadcomm coll to use MPIR_SUBGROUP_THREADCOMM
Use coll_group=MPIR_SUBGROUP_THREADCOMM for threadcomm collectives. This allows compositional collectives under threadcomm.
Configuration menu - View commit details
-
Copy full SHA for 20b3244 - Browse repository at this point
Copy the full SHA 20b3244View commit details -
coll: check coll_group in MPIR_Comm_is_parent_comm
We call MPIR_Comm_is_parent_comm to prevent recursively entering compositional algorithms such as the _smp algorithms. Check coll_group as well as we will switch to use subgroup rather than subcomms. Also check num_external directly for trivial comm. Subcomms and comm->hierarchy_kind will be removed in the future.
Configuration menu - View commit details
-
Copy full SHA for a831867 - Browse repository at this point
Copy the full SHA a831867View commit details -
coll: make non-compositional algorithm coll_group aware
Use MPIR_COLL_RANK_SIZE if the algorithm is topology neutral. Use MPIR_COLL_RANK_SIZE_NO_GROUP if the algorithm is topology dependent. It adds an assertion on coll_group == MPIR_SUBGROUPS_NONE since coll_group may alter the topology assumptions. Intercomm does not work with non-zero coll_group.
Configuration menu - View commit details
-
Copy full SHA for d2a6412 - Browse repository at this point
Copy the full SHA d2a6412View commit details -
coll: modify bcast_intra_smp to use subgroups
Replace the usage of subcomms with subgroups.
Configuration menu - View commit details
-
Copy full SHA for a2f92c4 - Browse repository at this point
Copy the full SHA a2f92c4View commit details -
coll: avoid extra intra bcast in bcast_intra_smp
When root is not local rank 0, instead of adding a extra intra-node send/recv or bcast, construct an inter group that includes the root process.
Configuration menu - View commit details
-
Copy full SHA for 370661e - Browse repository at this point
Copy the full SHA 370661eView commit details -
Configuration menu - View commit details
-
Copy full SHA for ae6fe4e - Browse repository at this point
Copy the full SHA ae6fe4eView commit details -
mpir: replace subcomm usage with subgroups
Directly use information from MPIR_Process rather than from nodecomm in MPIR_Process. One step toward removing subcomms.
Configuration menu - View commit details
-
Copy full SHA for 0ba1a80 - Browse repository at this point
Copy the full SHA 0ba1a80View commit details -
Configuration menu - View commit details
-
Copy full SHA for 3543718 - Browse repository at this point
Copy the full SHA 3543718View commit details -
coll: refactor caching tree in the comm struct
Use a single "cached_tree" rather than 3 different fields for each tree type.
Configuration menu - View commit details
-
Copy full SHA for 7ea94c7 - Browse repository at this point
Copy the full SHA 7ea94c7View commit details -
coll: add coll_group to treealgo routines
The topology-aware tree utilities need check coll_group for correct world ranks.
Configuration menu - View commit details
-
Copy full SHA for ce2274d - Browse repository at this point
Copy the full SHA ce2274dView commit details -
coll: add nogroup restriction to certain algorithms
Some algorithm, e.g. Allgather recexch, caches comm size-related info in communicator, thus won't work with none trivial coll_group. Add a restriction so it will fallback when coll_group != MPIR_SUBGROUP_NONE.
Configuration menu - View commit details
-
Copy full SHA for 2bf4890 - Browse repository at this point
Copy the full SHA 2bf4890View commit details -
coll: check coll_group in MPIR_Sched_next_tag
All subgroup collectives should use the same tag within the parent collectives. This is because all processes in the communicator has to agree on the tag to use, but group collectives may not involve all processes. It is okay to use the same tag as long as the group collectives are always issued in order. This is the case since all group collectives are spawned under a parent collective, which has to obey the non-overlapping rule.
Configuration menu - View commit details
-
Copy full SHA for 757066a - Browse repository at this point
Copy the full SHA 757066aView commit details -
coll: refactor barrier_intra_k_dissemination
Because the compiler can't figure out the arithmetic, it is warning: ‘MPIC_Waitall’ accessing 8 bytes in a region of size 0 [-Wstringop-overflow=] Refactor to suppress warning and for better readability.
Configuration menu - View commit details
-
Copy full SHA for 513991d - Browse repository at this point
Copy the full SHA 513991dView commit details -
coll/allreduce: remove a leftover empty branch
Commit ba1b4dd left an empty branch that should be removed.
Configuration menu - View commit details
-
Copy full SHA for 10adb96 - Browse repository at this point
Copy the full SHA 10adb96View commit details