Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why tree algorithms are specifically targeted at All-Reduce? #1473

Open
jxh314 opened this issue Oct 9, 2024 · 1 comment
Open

Why tree algorithms are specifically targeted at All-Reduce? #1473

jxh314 opened this issue Oct 9, 2024 · 1 comment

Comments

@jxh314
Copy link

jxh314 commented Oct 9, 2024

I'm running nccl-test all-reduce between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed that it seems only the all-reduce operation has a tree algorithm implementation. As far as I know, with the FSDP (Fully Sharded Data Parallelism) strategy, it achieves global all-reduce through a combination of all-gather and reduce-scatter. I confirmed this by setting NCCL_DEBUG_SUBSYS=COLL. So, why does it appear that only all-reduce has a tree algorithm implementation? Would using the ring algorithm when training models with the FSDP strategy result in underutilization of bandwidth?

@sjeaugey
Copy link
Member

sjeaugey commented Oct 9, 2024

why does it appear that only all-reduce has a tree algorithm implementation

Implementing a Tree algorithm for allgather/reducescatter is very hard, contrary to allreduce.
The Tree algorithm implements multiple reduce/broadcast operations, so it doesn't apply to allgather/reducescatter.

That being said, in 2.23 we introduced the PAT algorithms which are a variation of brucks algorithm reordering the schedule to keep the memory needs limited. They improve the performance of allgather/reducescatter at scale, but they are still a work in progress and not yet as low-latency as the Tree allreduce. They also only support one GPU per node at the moment, i.e. the intra-node part is not yet implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants