Learning rate under multiple gpus #1165

Pattorio · 2018-07-10T02:36:15Z

Hi AlexeyAB,

I find that if I set learning_rate=0.001 in .cfg file and use multiple gpus, say 4 gpus, to train, the learning rate in log would be 0.004, which equals to the number of gpus times the lr setting in the cfg file. Here are my questions:

I went through the code, but didn't find any thing about why lr in log would be num_gpus * lr. Could you please help me figure it out?
Under this situation (using 4 gpus), what is the actual lr for each gpu, 0.001 or 0.004? Is it different from using 1 gpu and seting lr=0.001?
If the actual lr for each gpu is 0.004, does it mean that I have to consider the number of gpus when I set lr?
If I want to add new learning rate policy, what should I do? I have tried to do following things:
a) add new_policy into get_policy() in parcer.c.
b) add if(net->policy == new_policy) and corresponding operations in parse_net_options() in parser.c.
c) add case new_policy in get_current_rate() in network.c
d) add new_policy into learning_rate_policy struct in darknet.h
After doing these, lr in log file is just the one I set in .cfg file. It has no relation to number of gpus. Anything else to add/modify?

Look forward to hearing from you. Thanks!

The text was updated successfully, but these errors were encountered:

AlexeyAB · 2018-07-10T14:27:46Z

Perhaps this was done simply by mistake ) I don't know why Joseph done increase 4x learning_rate and decrease /4 loaded_images when are used 4 GPU. More about it: max_batches in multi gpus #1098
learning_rate will be 0.004 for each GPU: https://github.com/pjreddie/darknet/blob/f6d861736038da22c9eb0739dca84003c5a5e275/examples/detector.c#L27
Yes, but usually it works well with 0.004 too
Yes, these places:

darknet/src/parser.c

Line 610 in e301fee

void parse_net_options(list *options, network *net)
darknet/src/parser.c

Line 597 in e301fee

learning_rate_policy get_policy(char *s)
darknet/src/network.c

Line 89 in e301fee

float get_current_rate(network net)
darknet/src/network.h

Line 20 in e301fee

typedef struct network{

Pattorio · 2018-07-11T11:49:01Z

Thank you for your quick response, Alexey!

It takes me some time to understand your word that

Joseph done increase 4x learning_rate and decrease /4 loaded_images when are used 4 GPU

I read the code and log, here is my understanding:
First, Joseph increase 4x learning_rate when using 4 GPU which is obviously according to:
https://github.com/pjreddie/darknet/blob/f6d861736038da22c9eb0739dca84003c5a5e275/examples/detector.c#L27

Then, for each iteration, network will train batch_in_cfg images. And every 4 iterations, it will do "Sync", which skips 4 * (ngpus - 1) * batch_in_cfg images ( just add such number of images into net->seen without training):
https://github.com/pjreddie/darknet/blob/d3828827e70b293a3045a1eb80bfb4026095b87b/src/network.c#L1078

Therefore, he decreases /4 loaded_images when using multiple (>1) GPUs. Am I right?

Do you have any idea about variable interval in his code above? It's really confusing setting learning_rate=ngpus * learning_rate

Besides, I have some other questions:

You said in https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

Increase network-resolution by set in your .cfg-file (height=608 and width=608) or (height=832 and width=832) or (any value multiple of 32) - this increases the precision and makes it possible to detect small objects. if error Out of memory occurs then in .cfg-file you should increase subdivisions=16, 32 or 64

When we do 'test', usually we set batch=1 and subdivisions=1. Why is setting subdivisions=16,32 or 64 useful? If batch=1, subdivisions=16, net->batch=1/16 is not an integer. Why it works?

Once I forgot to change value of batch and subdivisions in cfg file when testing, but the objects in image were still detected. After changing batch=1 and subdivisions=1, I tested again and found that I got much better result which had less missing/wrong boxes. Why? How does these two variables works when testing?
When training, I get log as follow.

I thought it should look like this:

Region 16 ..
Region 23 ..
Region 16 ..
Region 23 ..

One region 16, one region 23, this kind of pattern, on and on.
But as showing above in the blue box, the second should be region 23, however it is missing. Why? Where is it? When the region would be ignored (not be shown in the log)?

So many questions, lol. If you can answer any of them, I would appreciate!

AlexeyAB · 2018-07-11T15:59:09Z

In my repo, batch= and subdivisions will be taken from cfg-file only fro Training. In other cases it will be set batch=1 subdivisions=1 automatically:

darknet/src/detector.c

Line 1091 in 3d2d0a7

network net = parse_network_cfg_custom(cfgfile, 1); // set batch=1
I don't know why is it happen. In my repo it should be the same. Can you show screenshots of this differ?
Do you use multi-GPU. So training goes simultaneosly, so it can go in different order due to parallel execution.

Pattorio · 2018-07-12T11:13:58Z

Does it mean that subdivisions is not used when testing?
Sorry, my bad. I tried again. It's the same.
Got it.
One more question, since the net decreases /4 loaded_images when using 4 GPU, what would be actual epoch, total_image/batch_in_cfg or total_image/batch_in_cfg*4?

Thanks!

sharoseali · 2018-08-07T06:40:01Z

Hi .. Please answer this... AlexeyAB, i trained aeroplane data set on the model same as you defined in your Repo. There were 3094 images in the data set ... The error rate is not decreasing after 2200th iterations.. .. when i test the model, it wasn't good.. enough.. (detecting..... but not so accurate)
Now please let me know.. what i do know so that my model give good results.. if i increase learning rate ..and continue training from 2200th iteration, can i got better results......please let me know any one of you...... please..?

Pattorio · 2018-08-07T07:50:00Z

@sharoseali Hey, I have to say there are many reasons causing bad detection results. Data set itself, learning rate, number of iterations, input image size, etc. are the things you should consider. It's not easy to tell you what you should do based on your description.

The error rate is not decreasing after 2200th iterations

For my experience, keep training may help. But you must be careful of overfitting.

AlexeyAB · 2018-08-07T10:47:26Z

@sharoseali Just train about 6000 - 10 000 iterations
Alse read:
https://github.com/AlexeyAB/darknet#when-should-i-stop-training
https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

sharoseali · 2018-08-07T14:19:01Z

Thanks Alexy .and Pattorio.....for replying

Alexey..... i have now 5500th weight file.....the link which u have suggested has several instructions.... after make changing in cfg file: will i petform training from start or continue with laresr weight file....??

sharoseali · 2018-08-07T14:24:09Z

Pattorio.......thanks for applying..

the box draw around the detected aerplane is very wide... in horizontal dimension..and if there are more than one plane in a test image .. it also draw wronge bounding boxes... and how can i find that my model is overfitted..??
thanks again

AlexeyAB · 2018-08-07T14:45:40Z

@sharoseali

after make changing in cfg file: will i petform training from start

Yes, you should start training from the begining.

Pattorio · 2018-08-08T06:38:49Z

@sharoseali
What do you mean "in horizontal dimension"? Any example?

the box draw around the detected aerplane is very wide

You can check your anchor in cfg file. You had better generate anchors for your own data set.
you can use calc_anchor in this repo:
https://github.com/AlexeyAB/darknet#how-to-improve-object-detection

if there are more than one plane in a test image .. it also draw wronge bounding boxes

I think it depends on your training set. If the images in your training set have single object, say one plane in one image, you may get result like this. You can add image that has more than one planes into your training set.

how can i find that my model is overfitted..??

Alexey has replied you, answer is here:
https://github.com/AlexeyAB/darknet#when-should-i-stop-training

Pattorio · 2018-08-08T07:33:34Z

@AlexeyAB

Let's continue previous discussion.

Training under multiple gpus looks like this way:

If max_batch = 16, batch = 2.

For 1 gpu, it loads 2 images in one iteration. After 4 iterations, it will load 4*2 = 8 images. It will take 16 iterations to finish training (load 32 images in total).

For 4 gpus, they load batch*ngpus = 2*4=8 images in one iteration. Each gpu will get 2 images to train. After 4 iterations, they will load 4*8 = 32 images.

Oops, now they have finish what 1 gpu should do under this setting - loading 32 images. So they just wait (Actually they don't need to wait. They just skip next 3 iterations) and say "we finished" till the 4th iteration (using "Sync").

Therefore, under the same setting, 1-gpu-mode and 4-gpus-mode load the same number of images.

For 4 gpus: each gpu will load x1/4 images compare to the gpu in 1-gpu-mode. And for each iteration, 4 gpus load x4 images to train, which can be considered that 4-gpu-mode has x4 batch size.
According to this: https://miguel-data-sc.github.io/2017-11-05-first/

For the ones unaware, general rule is “bigger batch size bigger learning rate”

So the learning rate *= ngpus.

Is this explain reasonable?

Any I did experiment, ylabel is loss. The yellow line is training result under 4-gpu-mode with cfg_lr. The pink line is result under 4-gpu-mode with lr*=ngpus. Ignore the others.

x4 lr have better result.

AlexeyAB · 2018-08-08T11:34:13Z

@Pattorio

If max_batch = 16, batch = 2 then

For 1 GPU: will be loaded 32 images in total, and learning_rate = lr_cfg
For 4 GPUs: will be loaded 8 images in total, and learning_rate = 4*lr_cfg

More: #1098

So may be according to this rule “bigger batch size bigger learning rate” - it uses 4x higher learning_rate, and to compensate it - it loads 4x times less images

According to this: https://miguel-data-sc.github.io/2017-11-05-first/

For the ones unaware, general rule is “bigger batch size bigger learning rate”

sharoseali · 2018-08-08T19:44:18Z

**@AlexeyAB **
sir... u have suggested me to read the content How to improve object detection:..The problem is that i am using yolo v2 and the changes sugested here is related to YOLO v3 .. how can i made changing in yolo v2 cfg file i-e yolo-obj.cfg to detect small objects....

How can i managed these instructions:

training for small objects - set layers = -1, 11 instead of .........
(when i change this the training didnt start and ends up giving message no error)
to detect small objects and and increase precision... u suggest
and set stride=4 instead of ..........
(In yolo v3 this change is under [un sample] layer) which is not in yolo-obj.cfg
recalculate anchors for your dataset for width and height from cfg-file: darknet.exe detector calc_anchors data/obj.data -num_of_clusters 9 -width 416 -height 416 then set the same 9 anchors in each of 3 [yolo]-layers in your cfg-file
(if i increase resolution by 608 x 608 than whether i use width 416 -height 416 or width 608-height 608 in upper point)

please explain so that i start my training asap

sharoseali · 2018-08-08T19:56:15Z

**@AlexeyAB **
Sir one thing more to ask is that... I have to show confusion matrix (map between my training and testing values during the training process) which shows whether the model is over or under fitted..
How can i draw it....? Thanks

Pattorio · 2018-08-09T12:11:21Z

@AlexeyAB

4 gpus only 8 images in totoal? Why?

Log file shows that 4 gpus will load 4 times images than 1 gpu. And code in detector.c also shows that total number of images is ngpus times than 1 gpus.

int imgs = net.batch * net.subdivisions * ngpus;

AlexeyAB · 2018-08-09T12:29:39Z

@Pattorio

For 4 GPUs the iterations will be increased by 16 instead of 4:
*nets[0].seen += interval * n * nets[0].batch * nets[0].subdivisions = 4 * 4 * batch_from_cfg

intinterval = 4
n = 4

darknet/src/network_kernels.cu

Line 363 in 24c889d

*nets[0].seen += interval * (n-1) * nets[0].batch * nets[0].subdivisions;
darknet/src/network_kernels.cu

Line 149 in 24c889d

*net.seen += net.batch;

AlexeyAB · 2018-08-09T12:53:45Z

@sharoseali

Calculation anchors for yolo v2:

darknet.exe detector calc_anchors data/obj.data -num_of_clusters 5 -width 13 -height 13

For yolov2-voc.cfg for detection small objects:

layers=-15

darknet/cfg/yolov2-voc.cfg

Line 209 in d59a172

layers=-9
set stride=4

darknet/cfg/yolov2-voc.cfg

Line 220 in d59a172

stride=2

But yolov3 much more better for detection small objects.

Pattorio · 2018-08-13T03:10:26Z

@AlexeyAB

If max_batch = 16, batch = 2.

For 1 gpu, it loads 2 images in one iteration. After 4 iterations, it will load 4*2 = 8 images. It will take 16 iterations to finish training (load 32 images in total).

For 4 gpus, they load batch x ngpus = 2x4 = 8 images in one iteration. Each gpu will get 2 images to train. After 4 iterations, they will load 4*8 = 32 images.

Am I right ?
I know that for 4 GPUs the iterations will be increased by 16 instead of 4 by using
*nets[0].seen += interval * n * nets[0].batch * nets[0].subdivisions = 4 * 4 * batch_from_cfg. Am I right if I explain it in this way:

After 4 iterations, 4 gpus have finished what 1 gpu should do under this setting - loading 32 images. So they just wait (Actually they don't need to wait. They just skip next 12 iterations by using *nets[0].seen += interval * (ngpus-1) * nets[0].batch * nets[0].subdivisions = 4 * (4-1) * batch_from_cfg) and say "we finished" till the 16th iteration.

If in both 1 and 2 I am right, even though 4 gpus increase iteration by 16, they load 4x image than 1 gpu in one iteration. Therefore, 1-gpu-mode and 4-gpus-mode should load the same number of images.

For 1 gpu:
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 finish (32 images in total)
For 4 gpus:
8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0 finish (32 images in total)
"0" means *nets[0].seen += interval * (ngpus-1) * nets[0].batch * nets[0].subdivisions = 4 * (4-1) * batch_from_cfg)

AlexeyAB · 2018-08-13T12:33:18Z

@Pattorio

If batch=2

For 4 gpus, they load batchngpus = 24=8 images in one iteration. Each gpu will get 2 images to train. After 4 iterations, they will load 4*8 = 32 images.

No.

For 4 gpus it loads 8 images (batch x gpus) in one iteration:

darknet/src/detector.c

Line 73 in a9fef1b

int imgs = net.batch * net.subdivisions * ngpus;

these 8 images will be divided by 4 gpus - 2 images for each 1 gpu:

darknet/src/network_kernels.cu

Lines 386 to 389 in a9fef1b

    
           for(i = 0; i < n; ++i){ 
        
               data p = get_data_part(d, i, n); 
        
               threads[i] = train_network_in_thread(nets[i], p, errors + i); 
        
           }

darknet/src/data.c

Lines 1267 to 1278 in a9fef1b

    
           data get_data_part(data d, int part, int total) 
        
           { 
        
               data p = {0}; 
        
               p.shallow = 1; 
        
               p.X.rows = d.X.rows * (part + 1) / total - d.X.rows * part / total; 
        
               p.y.rows = d.y.rows * (part + 1) / total - d.y.rows * part / total; 
        
               p.X.cols = d.X.cols; 
        
               p.y.cols = d.y.cols; 
        
               p.X.vals = d.X.vals + d.X.rows * part / total; 
        
               p.y.vals = d.y.vals + d.y.rows * part / total; 
        
               return p; 
        
           }

Pattorio · 2018-08-20T10:23:55Z

@AlexeyAB

Too busy these day to reply you..

For 4 gpus it loads 8 images (batch x gpus) in one iteration.

These 8 images will be divided by 4 gpus - 2 images for each 1 gpu.

Yes, that's what I mean!
In this case, after 4 iterations, 1-gpu-mode will load 8 images in total, and 4-gpu-mode will load 32 images in total (8 images for each gpu), right?

Then, in the next 12 iteration, 1-gpu-mode will load 2 images every iteration. The log looks like this:

However, according to *nets[0].seen += interval * (ngpus-1) * nets[0].batch * nets[0].subdivisions = 4 * (4-1) * batch_from_cfg) in multiple gpus case, 4-gpu-mode will skip next 12 iterations. The log looks like this:

After 16 iterations, 1-gpu-mode and 4-gpu-mode both load same number of images.

I think your word For 4 GPUs the iterations will be increased by 16 instead of 4 is the same as my word 4-gpu-mode will skip next 12 iterations.

In my understanding, "for 4 GPUs the iterations will be increased by 16 instead of 4" is just a way to count how many work has been done. We have already set max_batches, say 32. Increased by 16 means it will finish work faster, i.e. when 1 gpu has done 4/32, 4 gpus have done 16/32. But after they finish 32/32, the total amount (of images they load) is the same.

If we treat 4 gpus as an unity, it finishes training 4 times faster than 1-gpu-mode and loads 4 times images than 1-gpu-mode every iteration. Based on general rule “bigger batch size bigger learning rate”, 4-gpu-mode actually has bigger batch size, therefore, lr *= ngpus

I just miss '*' in my previous reply:

For 4 gpus, they load batchngpus = 24=8 images in one iteration. Each gpu will get 2 images to train. After 4 iterations, they will load 4*8 = 32 images.

It should be:

For 4 gpus, they load batch*ngpus = 2*4=8 images in one iteration. Each gpu will get 2 images to train. After 4 iterations, they will load 4*8 = 32 images.

AlexeyAB · 2018-08-20T20:48:36Z

@Pattorio

1xGPU

4xGPU

These images show that was loaded the same number of images 6144. But actually for 1 GPU was loaded 6144 images, and for 4xGPU was loaded 1536 images.

I think your word For 4 GPUs the iterations will be increased by 16 instead of 4 is the same as my word 4-gpu-mode will skip next 12 iterations.

Yes.

4xGPU trains 4x faster, but iteration counter increases 16x faster.

For 1 gpu:
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 finish (32 images in total)
For 4 gpus:
8 8 8 8 0 0 0 0 0 0 0 0 0 0 0 0 finish (32 images in total)

If mini_batch=2 then for 4xGPU the each GPU can't process 8 images in one iteration, because there isn't enough GPU-RAM for it. Exactly to avoid CUDA out of memory we use mini_batch=batch/subdivisions. And if we found that maximum mini_batch=2, then 2 will be used for both 1xGPU and 4xGPU.

Pattorio · 2018-09-14T11:33:53Z

Hey Alexey, thank you for your patience and response.

I am still a little confused. I am afraid I have some misunderstanding of the code.

These images show that was loaded the same number of images 6144. But actually for 1 GPU was loaded 6144 images, and for 4xGPU was loaded 1536 images.

If it is true, it means with the same max_batches in cfg file, 4-GPUs will load 4 times images less than 1-GPU. I don’t think it makes sense.

I understand the use of mini_batch. In both 1-GPU-model and 4-GPU-model, network will load mini_batch=batch/subdivisions images for one forward.

https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L314-L328
https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L289-L298

In total, each GPU will do forward_network subdivisions times and then update the network.
For 4 GPUs, after each of them updates the network, they have to sync_nets

https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L1072-L1089

In this way, in one iteration, each GPU loads batch images and uses mini_batch images to do forward_network for subdivisions times. 4 GPUs will load 4*batch images in total.
Therefore, for 1 GPU was loaded 6144 images, and for 4xGPU was loaded the same number of images.

Do I have any misunderstanding of the code?

AlexeyAB · 2018-09-14T11:51:02Z

@Pattorio

In total, each GPU will do forward_network subdivisions times and then update the network.
For 4 GPUs, after each of them updates the network, they have to sync_nets

https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L1072-L1089

In this way, in one iteration, each GPU loads batch images and uses mini_batch images to do forward_network for subdivisions times. 4 GPUs will load 4*batch images in total.

Yes. This is correct for 1 iteration. But isn't correct for 1000 iterations.

Make a little experiment. Change the source code so that the iterations are increased by 100 instead of 1. Then all that you said is true. But in total will be loaded 100x times less images.

4xGPU increases iterations by +16 instead of +4. - This is the main thing.
Training log for 4xGPU is incorrect.

Pattorio · 2018-09-14T11:58:24Z

@AlexeyAB

I did some experiments to find the best parameters when training with multiple GPUs.

I choose a small training set. It is about 90 images with 1 class.

For 1 GPU, I set parameter in cfg file in this way.

And get the result:

Then, in the following experiments I use 4 GPUs.

DEFAULT-LR
First, I change nothing in cfg file and use 4 GPUs to train.

However, it can detect nothing.

4x-LESS-LR
According to your recommendation in issue #1456 , I first decrease learning_rate by 4 times and keep max_batches unchanged.

After 2000 iterations, it can detector something which seems good.

4x-LESS-LR-4X-MORE-ITERATIONS
Then I increase both burn_in and max_batches by 4 times and decrease learning_rate by 4 times as you said in #1456 .

Get even better result than 1-GPU.

4x-MORE-ITERATIONS
I think it might be useful to keep learning_rate unchanged and just train more iterations. So, I increase both burn_in and max_batches by 4 times.

However, it didn't converge.

It seems that keeping learning_rate unchanged in multiple gpu training is not a good choice.

4x-LESS-LR-4x-MORE-ITERATION-DEFAULT-BURN_IN
Finally, I just want to figure out the whether parameter burn_in will help in training. I decrease the learning_rate and increase the max_batches, but keep burn_in unchanged.

Not so good.

Conclusion for my experiments: for training with multiple GPUs, it might get good result by increasing both burn_in and max_batches by ngpus times and decrease learning_rate by ngpus times.

Above experiments are for reference. You can add some tips about training with multiple GPUs in Readme.

Pattorio · 2018-09-14T12:12:24Z

4xGPU increases iterations by +16 instead of +4. - This is the main thing.

Oh! I think I got it. If it is increase by +4, 4-GPU-model and 1-GPU-model will load the same number of image, right?

AlexeyAB · 2018-09-14T12:17:45Z

Oh! I think I got it. If it is increase by +4, 4-GPU-model and 1-GPU-model will load the same number of image, right?

Yes.

Conclusion for my experiments: for training with multiple GPUs, it might get good result by increasing both burn_in and max_batches by ngpus times and decrease learning_rate by ngpus times.

Above experiments are for reference. You can add some tips about training with multiple GPUs in Readme.

Yes, there is already about 4 time less learning rate.
I will add about 4x times more burn_in= and max_batches=.

https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

Adjust the learning rate (cfg/yolov3-voc.cfg) to fit the amount of GPUs. The learning rate should be equal to 0.001, regardless of how many GPUs are used for training. So learning_rate * GPUs = 0.001. For 4 GPUs adjust the value to learning_rate = 0.00025.

Pattorio · 2018-09-15T05:34:26Z

That's great!

Thank you soooooo much! You are so patient!

Pattorio · 2018-11-13T06:22:24Z

Hey Alexey! After my experiments, I find that for large dataset, about 240 thousand images I was using, 4x less learning rate might make convergence too slow. It's not so wise to use 4x less learning rate in this case.

Try npus less learning rate when using small dataset. For large dateset, wisely choose learning rate depending on the dataset.

Eniac-Xie · 2018-12-09T12:26:09Z

@Pattorio Hey, have you try warming up the model with 4x less learning rate for a few iterations, then use large learning rate for further training?

According to the Linear Scaling Rule in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour", multiple GPUs (larger batch size) should use large learning.

Beside, when we set batch=64 and subdivisions=16, then every GPU only takes 4 images for forward, am I right? If so, I think 4 images are not enough to compute mean and std in BatchNorm layer, should we consider Synchronized BatchNorm in "MegDet: A Large Mini-Batch Object Detector"?

AlexeyAB · 2018-12-09T21:38:53Z

@Eniac-Xie Hi,

Beside, when we set batch=64 and subdivisions=16, then every GPU only takes 4 images for forward, am I right?

No. In this case each GPU use mini-batch = 4 images - for forward-backward. And batch=64 images - for weights updating.
And 4 different weights arrays (for each GPU) will be synchronized for each iteration:

darknet/src/network_kernels.cu

Line 414 in 7bd6b2f

sync_nets(nets, n, interval);

But may be yes, in this case we should use learning_rate 4x according to this article: https://arxiv.org/abs/1706.02677v2

AlexeyAB · 2018-12-09T22:38:30Z

@Pattorio Hi,

I changed this in Readme as recommendation for small datasets: https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu

aidevmin · 2023-08-15T07:02:53Z

@Pattorio
Thanks for deep investigation.
As you mentioned above

Conclusion for my experiments: for training with multiple GPUs, it might get good result by increasing both burn_in and max_batches by ngpus times and decrease learning_rate by ngpus times.

As my understanding, batch, subdivisions are not changed, max_batches = #GPUs * max_batches (for 1 GPU), lr = lr(for 1 GPU) / #GPUs, burn_in = #GPUs * burn_in (for 1 GPU). Is that correct?

Can you confirm that with the above config, training time of 4 GPUs case is smaller than training time of 1 GPU?

AlexeyAB mentioned this issue Aug 20, 2018

Bad results with multi GPU training #1456

Closed

Pattorio closed this as completed Sep 15, 2018

This was referenced Nov 8, 2018

about multi gpu training #1899

Open

multi gpu learning #1486

Open

This was referenced Aug 15, 2023

Training with multiple GPUs is not faster than 1 GPU??? #8823

Open

Training with multiple GPUs is not faster than 1 GPU??? hank-ai/darknet#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning rate under multiple gpus #1165

Learning rate under multiple gpus #1165

Pattorio commented Jul 10, 2018

AlexeyAB commented Jul 10, 2018

Pattorio commented Jul 11, 2018

AlexeyAB commented Jul 11, 2018

Pattorio commented Jul 12, 2018

sharoseali commented Aug 7, 2018

Pattorio commented Aug 7, 2018

AlexeyAB commented Aug 7, 2018

sharoseali commented Aug 7, 2018

sharoseali commented Aug 7, 2018

AlexeyAB commented Aug 7, 2018

Pattorio commented Aug 8, 2018

Pattorio commented Aug 8, 2018

AlexeyAB commented Aug 8, 2018

sharoseali commented Aug 8, 2018

sharoseali commented Aug 8, 2018

Pattorio commented Aug 9, 2018

AlexeyAB commented Aug 9, 2018 •

edited

Loading

AlexeyAB commented Aug 9, 2018

Pattorio commented Aug 13, 2018 •

edited

Loading

AlexeyAB commented Aug 13, 2018

Pattorio commented Aug 20, 2018

AlexeyAB commented Aug 20, 2018

Pattorio commented Sep 14, 2018

AlexeyAB commented Sep 14, 2018

Pattorio commented Sep 14, 2018

Pattorio commented Sep 14, 2018

AlexeyAB commented Sep 14, 2018 •

edited

Loading

Pattorio commented Sep 15, 2018

Pattorio commented Nov 13, 2018

Eniac-Xie commented Dec 9, 2018 •

edited

Loading

AlexeyAB commented Dec 9, 2018

AlexeyAB commented Dec 9, 2018

aidevmin commented Aug 15, 2023

Learning rate under multiple gpus #1165

Learning rate under multiple gpus #1165

Comments

Pattorio commented Jul 10, 2018

AlexeyAB commented Jul 10, 2018

Pattorio commented Jul 11, 2018

AlexeyAB commented Jul 11, 2018

Pattorio commented Jul 12, 2018

sharoseali commented Aug 7, 2018

Pattorio commented Aug 7, 2018

AlexeyAB commented Aug 7, 2018

sharoseali commented Aug 7, 2018

sharoseali commented Aug 7, 2018

AlexeyAB commented Aug 7, 2018

Pattorio commented Aug 8, 2018

Pattorio commented Aug 8, 2018

AlexeyAB commented Aug 8, 2018

sharoseali commented Aug 8, 2018

sharoseali commented Aug 8, 2018

Pattorio commented Aug 9, 2018

AlexeyAB commented Aug 9, 2018 • edited Loading

AlexeyAB commented Aug 9, 2018

Pattorio commented Aug 13, 2018 • edited Loading

AlexeyAB commented Aug 13, 2018

Pattorio commented Aug 20, 2018

AlexeyAB commented Aug 20, 2018

Pattorio commented Sep 14, 2018

AlexeyAB commented Sep 14, 2018

Pattorio commented Sep 14, 2018

Pattorio commented Sep 14, 2018

AlexeyAB commented Sep 14, 2018 • edited Loading

Pattorio commented Sep 15, 2018

Pattorio commented Nov 13, 2018

Eniac-Xie commented Dec 9, 2018 • edited Loading

AlexeyAB commented Dec 9, 2018

AlexeyAB commented Dec 9, 2018

aidevmin commented Aug 15, 2023

AlexeyAB commented Aug 9, 2018 •

edited

Loading

Pattorio commented Aug 13, 2018 •

edited

Loading

AlexeyAB commented Sep 14, 2018 •

edited

Loading

Eniac-Xie commented Dec 9, 2018 •

edited

Loading