Questions About Distributed Training #1278

Endeavour10020 · 2022-05-05T07:33:47Z

Endeavour10020
May 5, 2022

Hi, guys. I meet some issues when trying to use multi-gpus in distribution mode. I found it not faster than a single GPU. Further, I found more GPU take longer training time. So, I make an experiment to verify this. Take the config of configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py for example, the estimated time of the training process is 2h, 6h, 8h for 1GPU, 2GPU, 8GPU respectively.
Is this a normal phenomenon (I guess not)? How to use distributed training correctly to speed up my experiment?
I list the commands that I used below.

1gpu: python ./tools/train.py configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py

2gpus: bash ./tools/dist_train.sh configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py 2

8gpus: bash ./tools/dist_train.sh configs/synthesizers/pix2pix/pix2pix_vanilla_unet_bn_a2b_1x1_219200_maps.py 8

1gpu:

2gpus:

8gpus:

So, how to solve this problem?

wangruohui · 2022-05-05T07:42:51Z

wangruohui
May 5, 2022
Collaborator

Hello,

Not quite normal. Even though you don't decrease the batch size, training time should be compareble when using different numbers of GPUs.

Are you using 8 GPU in a single node or multiple nodes? Could you please upload the result of nvidia-smi topo -m?

Moreover, image translation models have been moved to MMGeneration, with a new design. You can try MMGeneration.

0 replies

Endeavour10020 · 2022-05-05T07:58:38Z

Endeavour10020
May 5, 2022
Author

Here the result of nvidia-smi topo -m

The server is equipped with 8 2080Ti Gpus. I use them all in a single node.

I am familiar with some other repos in openmmlab, e.g. MMaction2. The distribution mode woks well in that repo.

By the way, MMediting uses iterations, not epochs to control the whole training process. I am not sure whether it is the latent reason.

0 replies

Endeavour10020 · 2022-05-05T08:33:41Z

Endeavour10020
May 5, 2022
Author

@wangruohui
Hi ruohui,
Maybe you can clone the latest MMediting repo, and try to train the pix2pix model with the distribution command.

0 replies

wangruohui · 2022-05-05T08:36:58Z

wangruohui
May 5, 2022
Collaborator

It should have nothing to do with iter or epoch. It might have something to do with DistributedDataParallelWrapper.

Have you tried MMGeneration?

0 replies

Endeavour10020 · 2022-05-05T08:46:28Z

Endeavour10020
May 5, 2022
Author

Haven't tried yet. But if the problem could not be fixed, I will turn to MMGeneration. Does MMediting code works well on your computer under distribution mode?

0 replies

wangruohui · 2022-05-05T09:00:21Z

wangruohui
May 5, 2022
Collaborator

I don't have a prepared environment at hand. Maybe you need to wait for some time if you want me to have a try.

But actually, since these two models are supported by MMGeneration for nearly a year, we are considering deprecating them. So I still suggest you switch to MMGeneration anyway :>

0 replies

Endeavour10020 · 2022-05-05T09:41:23Z

Endeavour10020
May 5, 2022
Author

Ok, thank you for your advice!

0 replies

Endeavour10020 · 2022-05-05T11:41:49Z

Endeavour10020
May 5, 2022
Author

@wangruohui
Hi ruohui,
I found the same problem in MMGeneration. 8gpus takes longer training time than single gpu while using the same config file. why?

0 replies

wangruohui · 2022-05-07T06:08:50Z

wangruohui
May 7, 2022
Collaborator

I tried, but my task is still pending as I borrow some machines from others :(

Previously I thought the problem lies in your hardware because SYS is the slowest connection within a node. multi GPU running would be problematic. But you said that on action2 it works well, so maybe you need to wait until my tasks get run.

0 replies

wangruohui · 2022-05-07T08:18:36Z

wangruohui
May 7, 2022
Collaborator

OK, I just got some data.
1 GPU 2.5h
2 GPU 4.2h

0 replies

Endeavour10020 · 2022-05-10T08:05:20Z

Endeavour10020
May 10, 2022
Author

So, is this a bug? And I found the training time increases with the whole batch size (samples_per_gpu*gpu_number).
Its weird, right?

0 replies

wangruohui · 2022-05-10T20:12:11Z

wangruohui
May 10, 2022
Collaborator

probably not a bug, just lack of performance optimization. Smaller model suffer usually more communication overhead.
You can try using torch profiler to profile the runtime if you hope dig into this problem.

0 replies

sciart17 · 2022-11-16T18:26:44Z

sciart17
Nov 16, 2022

Hi, guys. Have the problem been solved? I met the same issues as mentioned above. More GPU take longer training time and bigger sample_per_gpu take longer training time as well.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions About Distributed Training #1278

{{title}}

Replies: 13 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Questions About Distributed Training #1278

Endeavour10020 May 5, 2022

Replies: 13 comments

wangruohui May 5, 2022 Collaborator

Endeavour10020 May 5, 2022 Author

Endeavour10020 May 5, 2022 Author

wangruohui May 5, 2022 Collaborator

Endeavour10020 May 5, 2022 Author

wangruohui May 5, 2022 Collaborator

Endeavour10020 May 5, 2022 Author

Endeavour10020 May 5, 2022 Author

wangruohui May 7, 2022 Collaborator

wangruohui May 7, 2022 Collaborator

Endeavour10020 May 10, 2022 Author

wangruohui May 10, 2022 Collaborator

sciart17 Nov 16, 2022

Endeavour10020
May 5, 2022

wangruohui
May 5, 2022
Collaborator

Endeavour10020
May 5, 2022
Author

Endeavour10020
May 5, 2022
Author

wangruohui
May 5, 2022
Collaborator

Endeavour10020
May 5, 2022
Author

wangruohui
May 5, 2022
Collaborator

Endeavour10020
May 5, 2022
Author

Endeavour10020
May 5, 2022
Author

wangruohui
May 7, 2022
Collaborator

wangruohui
May 7, 2022
Collaborator

Endeavour10020
May 10, 2022
Author

wangruohui
May 10, 2022
Collaborator

sciart17
Nov 16, 2022