Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FSDP does not properly set "num_nodes" on SLURM #17436

Closed
weicao1990 opened this issue Apr 22, 2023 · 3 comments · Fixed by #17438
Closed

FSDP does not properly set "num_nodes" on SLURM #17436

weicao1990 opened this issue Apr 22, 2023 · 3 comments · Fixed by #17438
Assignees
Labels
bug Something isn't working duplicate This issue or pull request already exists strategy: fsdp Fully Sharded Data Parallel ver: 2.0.x
Milestone

Comments

@weicao1990
Copy link

weicao1990 commented Apr 22, 2023

Bug description

Hi all. I trained my model on SLURM and it works well when using DDPStrategy. However, when using FSDPStrategy, it works well on a single node with 8 GPUs. But when I try to use 5 nodes, it reports

ValueError: Invalid rank 8, rank should be in the interval [0, 7]

for node 2~5.

I found such error is due to DistributedSampler

        if rank >= num_replicas or rank < 0:
            raise ValueError(
                "Invalid rank {}, rank should be in the interval"
                " [0, {}]".format(rank, num_replicas - 1))

where num_replicas = self.num_nodes * self.num_processes, defined in the strategy file. For DDPStrategy, self.num_nodes is properly set as 5, but for FSDPStrategy, self.num_nodes is 1, which makes "rank>num_nodes * num_processes".

What version are you seeing the problem on?

2.0+

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning 2.0.0
#- PyTorch 2.0.0

More info

No response

cc @awaelchli @carmocca

@weicao1990 weicao1990 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Apr 22, 2023
@awaelchli
Copy link
Contributor

awaelchli commented Apr 22, 2023

Thanks for reporting.
This is a duplicate of #17028, right?
I'll take care of it

@awaelchli awaelchli self-assigned this Apr 22, 2023
@awaelchli awaelchli added duplicate This issue or pull request already exists and removed needs triage Waiting to be triaged by maintainers labels Apr 22, 2023
@awaelchli awaelchli added the strategy: fsdp Fully Sharded Data Parallel label Apr 22, 2023
@awaelchli
Copy link
Contributor

@weicao1990 until the fix is out, you can work around this issue by setting the num_nodes value yourself via
trainer.strategy.num_nodes = x. This should unblock you.

@awaelchli awaelchli added this to the 2.0.x milestone Apr 22, 2023
@weicao1990
Copy link
Author

@awaelchli thanks a lot~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists strategy: fsdp Fully Sharded Data Parallel ver: 2.0.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants