Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CL/HIER: fix allreduce rab pipeline #759

Merged

Conversation

Sergei-Lebedev
Copy link
Contributor

What

Fixes segfault in reduce allreduce broadcast hierarchical allreduce algorithm when pipelining is used
Also fixes coll trace print for pipelined schedules

How ?

Pipeline fragment start dependency for sequential and ordered pipelines was set to previous fragment only hence completion of one fragment immediately triggers start of another fragment even if schedule is completed. This PR adds dependency on schedule start to prevent this.

@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/fix_rab_pipeline branch 2 times, most recently from 6887dd1 to 8a98742 Compare March 29, 2023 09:05
tasks[i], ucc_task_start_handler),
out, status);
UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, tasks[i]), out, status);
if (n_frags > 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this if statement be combined with same if in line 101?

out, status);
for (i = 1; i < n_tasks; i++) {
UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, tasks[i]), out, status);
UCC_CHECK_GOTO(ucc_task_subscribe_dep(tasks[i-1], tasks[i],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we just leave task_subscibe_dep for both cases? iirc, subscribe dep is the same ucc_event_manager_subscribe + n_deps initialization

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's what i did initially, but it doesn't work for persistent collectives because of deps counter

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we start collective using schedule_pipelined_post, right? doesn't it reset deps_satisfied to 0? Why exactly dep counter is broken for persistent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does, but we use schedule pipelined init only for pipelined schedule otherwise it will be ucc_schedule_start

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but this is what we call: ucc_cl_hier_rab_allreduce_start->schedule_pipelined_post. Am i missing smth?
Also even if we call ucc_schedule_start. I think it would be cleaner to add n_deps_satisfied = 0 to ucc_schedule_start and then still use "subscribe_dep" alone.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed over phone we can't do it because it will break pipelined schedules. Action items for next PRs would be to

  1. check if bcast 2 step has similar bug
  2. check if use pipelined schedule only will not harm performance

@manjugv manjugv added this to the v1.2.0 Release milestone Mar 29, 2023
@Sergei-Lebedev Sergei-Lebedev force-pushed the topic/fix_rab_pipeline branch from 8a98742 to 1d890bf Compare April 26, 2023 04:59
@Sergei-Lebedev Sergei-Lebedev enabled auto-merge (squash) April 26, 2023 04:59
@Sergei-Lebedev Sergei-Lebedev merged commit f506c8a into openucx:master Apr 26, 2023
janjust pushed a commit to janjust/ucc that referenced this pull request Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants