Questions About Distributed Training #1278
Replies: 13 comments
-
Hello, Not quite normal. Even though you don't decrease the batch size, training time should be compareble when using different numbers of GPUs. Are you using 8 GPU in a single node or multiple nodes? Could you please upload the result of Moreover, image translation models have been moved to MMGeneration, with a new design. You can try MMGeneration. |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
@wangruohui |
Beta Was this translation helpful? Give feedback.
-
It should have nothing to do with iter or epoch. It might have something to do with Have you tried MMGeneration? |
Beta Was this translation helpful? Give feedback.
-
Haven't tried yet. But if the problem could not be fixed, I will turn to MMGeneration. Does MMediting code works well on your computer under distribution mode? |
Beta Was this translation helpful? Give feedback.
-
I don't have a prepared environment at hand. Maybe you need to wait for some time if you want me to have a try. But actually, since these two models are supported by MMGeneration for nearly a year, we are considering deprecating them. So I still suggest you switch to MMGeneration anyway :> |
Beta Was this translation helpful? Give feedback.
-
Ok, thank you for your advice! |
Beta Was this translation helpful? Give feedback.
-
@wangruohui |
Beta Was this translation helpful? Give feedback.
-
I tried, but my task is still pending as I borrow some machines from others :( Previously I thought the problem lies in your hardware because SYS is the slowest connection within a node. multi GPU running would be problematic. But you said that on action2 it works well, so maybe you need to wait until my tasks get run. |
Beta Was this translation helpful? Give feedback.
-
OK, I just got some data. |
Beta Was this translation helpful? Give feedback.
-
So, is this a bug? And I found the training time increases with the whole batch size (samples_per_gpu*gpu_number). |
Beta Was this translation helpful? Give feedback.
-
probably not a bug, just lack of performance optimization. Smaller model suffer usually more communication overhead. |
Beta Was this translation helpful? Give feedback.
-
Hi, guys. Have the problem been solved? I met the same issues as mentioned above. More GPU take longer training time and bigger sample_per_gpu take longer training time as well. |
Beta Was this translation helpful? Give feedback.
-
Hi, guys. I meet some issues when trying to use multi-gpus in distribution mode. I found it not faster than a single GPU. Further, I found more GPU take longer training time. So, I make an experiment to verify this. Take the config of
configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py
for example, the estimated time of the training process is 2h, 6h, 8h for 1GPU, 2GPU, 8GPU respectively.Is this a normal phenomenon (I guess not)? How to use distributed training correctly to speed up my experiment?
I list the commands that I used below.
1gpu:
python ./tools/train.py configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py
2gpus:
bash ./tools/dist_train.sh configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py 2
8gpus:
bash ./tools/dist_train.sh configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py 8
1gpu:
2gpus:
8gpus:
So, how to solve this problem?
Beta Was this translation helpful? Give feedback.
All reactions