-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Launch ddp on 8 devices, but only run on the first gpu #16236
Comments
got the same problem |
try
it works for me... I spend one night and one morning on it...TT |
Solved.. Thx |
@superhero-7 Unfortunately I don't know how the KUBERNETES_PORT relates to this problem here, or even how it solved it. Does that mean this issue is closed, or are there still some open questions? |
Our machines are managed by k8s, I suppose maybe there are some conflicts about the GLOBAL RANK enviroment between k8s setting and pytorch_lightning ddp setting? |
I got the same issue but on a SLURM cluster. I have access to two SLURM clusters. Interestingly, on one cluster PL DDP works fine but on the second one, I experience this issue. Since I don't use K8s, I guess it would be really hard to reproduce this. Any pointers to what I could try? |
You could try printing the Since you are using SLURM, make sure to follow exactly the instructions here. |
@awaelchli Great idea! I think I should have correctly followed the instructions. For this test, I use two GPUs on a single node. First sbatch script for the server on which there are no issues: #!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=2
#SBATCH --gres=gpu:2
#SBATCH --output=/some/path/%j.out
module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate Second sbatch script for the server where I observe the described issue: #!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out
module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate Now, the os.environ output on the server where I observe no issues:
The os.environ output on the server where I observe the described issue:
From a quick scan I see that |
I found a workaround. Strangely, when I additionally set ntasks=NUM_GPUS, DDP works as expected. In this case, on the problematic cluster I get SLURM_NTAKS=NUM_GPUS and then the script runs correctly. So the augmented sbatch script is: #!/usr/bin/env bash
#SBATCH --parsable
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=16G
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=3
#SBATCH --gpus=rtx_3090:2
#SBATCH --output=/some/path/%j.out
module load nccl
source /some/path/conda.sh
conda activate myenv
srun python myscript.py ...
conda deactivate No idea why ntasks-per-node is not sufficient. |
Got a response from the cluster support. Apparently they still need to configure: TLDR: it is a slurm config issue not PL related. |
For SLURM users (using the Try to downgrade the pytorch-lightning: |
many thanks, it really works!!! |
@superhero-7 Were you able to resolve the issue on your end? I couldn't figure out whether this is an issue with Lightning or not. |
I ran into the same issue. Seeing #5225 (comment) and the docs, I solved it by adding |
@jasonkena That'll work yes. Here is the proper docs link for this. The other users who commented here had an issue with the kubernetes environment variable and I fixed this in the linked PR: #18137 |
@awaelchli Thanks for your work! I'm using the kubernetes environment and |
reply to myself: |
For those using SLURM, don't forget to use |
Bug description
I train the model like this,there are my code bellow:
And it works fine, and didn't drow any error.But it didn't runing on 8 gpus,instead, it only runing on the first gpu.
And only initializing one MEMBER like this:
I am so confuse,beacause the progress bar is totally right.The length of my dataset is 1198099,and in the progress bar, it shows 37457 steps one epoch, I set batch size to 4, so there is 4837457 almost equal to 11198099.
But the question is, nvidia-smi only see the first gpu is runing,like bellow:
I don't understand why this happend?I hope someone can help me,thanks a lot!!!!!
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
More info
No response
cc @justusschock @awaelchli
The text was updated successfully, but these errors were encountered: