-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CL/HIER: fix allreduce rab pipeline #759
CL/HIER: fix allreduce rab pipeline #759
Conversation
6887dd1
to
8a98742
Compare
tasks[i], ucc_task_start_handler), | ||
out, status); | ||
UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, tasks[i]), out, status); | ||
if (n_frags > 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this if statement be combined with same if in line 101?
out, status); | ||
for (i = 1; i < n_tasks; i++) { | ||
UCC_CHECK_GOTO(ucc_schedule_add_task(schedule, tasks[i]), out, status); | ||
UCC_CHECK_GOTO(ucc_task_subscribe_dep(tasks[i-1], tasks[i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we just leave task_subscibe_dep for both cases? iirc, subscribe dep is the same ucc_event_manager_subscribe + n_deps initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's what i did initially, but it doesn't work for persistent collectives because of deps counter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we start collective using schedule_pipelined_post, right? doesn't it reset deps_satisfied to 0? Why exactly dep counter is broken for persistent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it does, but we use schedule pipelined init only for pipelined schedule otherwise it will be ucc_schedule_start
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this is what we call: ucc_cl_hier_rab_allreduce_start->schedule_pipelined_post. Am i missing smth?
Also even if we call ucc_schedule_start. I think it would be cleaner to add n_deps_satisfied = 0 to ucc_schedule_start and then still use "subscribe_dep" alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as discussed over phone we can't do it because it will break pipelined schedules. Action items for next PRs would be to
- check if bcast 2 step has similar bug
- check if use pipelined schedule only will not harm performance
8a98742
to
1d890bf
Compare
What
Fixes segfault in reduce allreduce broadcast hierarchical allreduce algorithm when pipelining is used
Also fixes coll trace print for pipelined schedules
How ?
Pipeline fragment start dependency for sequential and ordered pipelines was set to previous fragment only hence completion of one fragment immediately triggers start of another fragment even if schedule is completed. This PR adds dependency on schedule start to prevent this.