-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FSDP does not properly set "num_nodes" on SLURM #17436
Labels
bug
Something isn't working
duplicate
This issue or pull request already exists
strategy: fsdp
Fully Sharded Data Parallel
ver: 2.0.x
Milestone
Comments
weicao1990
added
bug
Something isn't working
needs triage
Waiting to be triaged by maintainers
labels
Apr 22, 2023
Thanks for reporting. |
awaelchli
added
duplicate
This issue or pull request already exists
and removed
needs triage
Waiting to be triaged by maintainers
labels
Apr 22, 2023
@weicao1990 until the fix is out, you can work around this issue by setting the num_nodes value yourself via |
@awaelchli thanks a lot~ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
bug
Something isn't working
duplicate
This issue or pull request already exists
strategy: fsdp
Fully Sharded Data Parallel
ver: 2.0.x
Bug description
Hi all. I trained my model on SLURM and it works well when using DDPStrategy. However, when using FSDPStrategy, it works well on a single node with 8 GPUs. But when I try to use 5 nodes, it reports
for node 2~5.
I found such error is due to DistributedSampler
where num_replicas = self.num_nodes * self.num_processes, defined in the strategy file. For DDPStrategy, self.num_nodes is properly set as 5, but for FSDPStrategy, self.num_nodes is 1, which makes "rank>num_nodes * num_processes".
What version are you seeing the problem on?
2.0+
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
cc @awaelchli @carmocca
The text was updated successfully, but these errors were encountered: