Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE: fix score update when only score given #779

Merged

Conversation

Sergei-Lebedev
Copy link
Contributor

What

Fixes ucc score table when user provides score for TL/CL and only total score is given
e.g. UCC_TL_SHARP_TUNE=inf results in scatter, allgather, alltoall collectives (not supported by sharp) in UCC score table

with fix
       ucc_team.c:471  UCC  INFO  ===== COLL_SCORE_MAP (team_id 32768, size 2) =====
ucc_coll_score_map.c:201  UCC  INFO  Allgather:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..4095}:TL_UCP:10 {4K..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Allgatherv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Allreduce:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..4095}:TL_SHARP:10 {4K..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Alltoall:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..257}:TL_UCP:10 {258..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Alltoallv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Barrier:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Bcast:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Fanin:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Fanout:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Gather:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Gatherv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce_scatter:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce_scatterv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
ucc_coll_score_map.c:201  UCC  INFO  Scatterv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_UCP:10
       ucc_team.c:474  UCC  INFO  ================================================
without fix
       ucc_team.c:472  UCC  INFO  ===== COLL_SCORE_MAP (team_id 32771, size 2) =====
ucc_coll_score_map.c:201  UCC  INFO  Allgather:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..4095}:TL_SHARP:10 {4K..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Allgatherv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Allreduce:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..4095}:TL_SHARP:10 {4K..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Alltoall:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..257}:TL_SHARP:10 {258..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Alltoallv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Barrier:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Bcast:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Fanin:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Fanout:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Gather:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Gatherv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce_scatter:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Reduce_scatterv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Scatter:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO  Scatterv:
ucc_coll_score_map.c:201  UCC  INFO        Host: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Cuda: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        CudaManaged: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        Rocm: {0..inf}:TL_SHARP:10
ucc_coll_score_map.c:201  UCC  INFO        RocmManaged: {0..inf}:TL_SHARP:10
       ucc_team.c:474  UCC  INFO  ================================================

How ?

Pass TL supported collectives to tuning string parser

@Sergei-Lebedev
Copy link
Contributor Author

bot:retest

src/components/cl/hier/cl_hier_team.c Show resolved Hide resolved
src/components/cl/hier/cl_hier_team.c Show resolved Hide resolved
@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/fix_score_coll_update branch from 8faf086 to 771ac8d Compare May 20, 2023 03:53
ucc_coll_score_alloc_from_str.
Return values:
UCC_OK - input alg_id can be correctly mapped to the "init" fn
UCC_ERR_NOT_SUPPORTED - CL/TL does allow changing algorithms ids for
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo - shouod be "doesn't" instead of "does"

int mt_n);

ucc_status_t
ucc_coll_score_update_from_str(const char *str,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once merged, needs to be updated in TL/SHM as well.

@@ -34,7 +34,8 @@ UCC_TEST_F(test_score_update, non_overlap)
UCC_COLL_TYPE_BARRIER);
init_score(update, RLIST({RANGE(10, 20, 100), RANGE(30, 35, 1)}),
UCC_COLL_TYPE_BARRIER);
EXPECT_EQ(UCC_OK, ucc_coll_score_update(score, update, 0, NULL, 0));
EXPECT_EQ(UCC_OK, ucc_coll_score_update(score, update, 0, NULL, 0,
UCC_COLL_TYPE_ALL));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will UCC_COLL_TYPE_ALL work for TL/SHM as well? Not all colls are supported there - will it fall back onto other TL?

@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/fix_score_coll_update branch 2 times, most recently from f4d0aae to fe29e9c Compare June 2, 2023 08:31
@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/fix_score_coll_update branch from fe29e9c to d06f9d6 Compare June 12, 2023 07:13
@Sergei-Lebedev Sergei-Lebedev merged commit fac619e into openucx:master Jun 12, 2023
@Sergei-Lebedev Sergei-Lebedev deleted the topic/fix_score_coll_update branch June 12, 2023 11:12
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants