Closed
Description
Bug description
I train the model like this,there are my code bellow:
trainer_kwargs["accelerator"] = 'gpu'
trainer_kwargs["devices"] = 8
trainer_kwargs["strategy"] = "ddp"
trainer = Trainer.from_argparse_args(trainer_config,**trainer_kwargs)
trainer.fit(model, data)
And it works fine, and didn't drow any error.But it didn't runing on 8 gpus,instead, it only runing on the first gpu.
And only initializing one MEMBER like this:
I am so confuse,beacause the progress bar is totally right.The length of my dataset is 1198099,and in the progress bar, it shows 37457 steps one epoch, I set batch size to 4, so there is 4837457 almost equal to 11198099.
But the question is, nvidia-smi only see the first gpu is runing,like bellow:
I don't understand why this happend?I hope someone can help me,thanks a lot!!!!!
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): I try the lastest and 1.7.3, get the same question
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.12.1 cuda 11.3
#- Python version (e.g., 3.9): 3.8.5
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration: A100*8
#- How you installed Lightning(`conda`, `pip`, source): pip install pytorch_lightning==1.7.3
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response