-
Notifications
You must be signed in to change notification settings - Fork 805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why tree algorithms are specifically targeted at All-Reduce? #1473
Comments
Implementing a Tree algorithm for allgather/reducescatter is very hard, contrary to allreduce. That being said, in 2.23 we introduced the PAT algorithms which are a variation of brucks algorithm reordering the schedule to keep the memory needs limited. They improve the performance of allgather/reducescatter at scale, but they are still a work in progress and not yet as low-latency as the Tree allreduce. They also only support one GPU per node at the moment, i.e. the intra-node part is not yet implemented. |
I'm running nccl-test
all-reduce
between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed that it seems only the all-reduce operation has a tree algorithm implementation. As far as I know, with the FSDP (Fully Sharded Data Parallelism) strategy, it achieves global all-reduce through a combination ofall-gather
andreduce-scatter
. I confirmed this by settingNCCL_DEBUG_SUBSYS=COLL
. So, why does it appear that only all-reduce has a tree algorithm implementation? Would using the ring algorithm when training models with the FSDP strategy result in underutilization of bandwidth?The text was updated successfully, but these errors were encountered: