Skip to content

Launch ddp on 8 devices, but only run on the first gpu #16236

Closed
@superhero-7

Description

@superhero-7

Bug description

I train the model like this,there are my code bellow:

trainer_kwargs["accelerator"] = 'gpu'
trainer_kwargs["devices"] = 8
trainer_kwargs["strategy"] = "ddp"
trainer = Trainer.from_argparse_args(trainer_config,**trainer_kwargs)
trainer.fit(model, data)

And it works fine, and didn't drow any error.But it didn't runing on 8 gpus,instead, it only runing on the first gpu.
And only initializing one MEMBER like this:
1672801542612

I am so confuse,beacause the progress bar is totally right.The length of my dataset is 1198099,and in the progress bar, it shows 37457 steps one epoch, I set batch size to 4, so there is 4837457 almost equal to 11198099.
image

But the question is, nvidia-smi only see the first gpu is runing,like bellow:
image

I don't understand why this happend?I hope someone can help me,thanks a lot!!!!!

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): I try the lastest and 1.7.3, get the same question
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.12.1 cuda 11.3
#- Python version (e.g., 3.9): 3.8.5
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration: A100*8
#- How you installed Lightning(`conda`, `pip`, source): pip install pytorch_lightning==1.7.3
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

cc @justusschock @awaelchli

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions