-
Notifications
You must be signed in to change notification settings - Fork 8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Learning rate under multiple gpus #1165
Comments
|
Thank you for your quick response, Alexey! It takes me some time to understand your word that
I read the code and log, here is my understanding: Then, for each iteration, network will train Therefore, he decreases /4 loaded_images when using multiple (>1) GPUs. Am I right? Do you have any idea about variable Besides, I have some other questions:
When we do 'test', usually we set batch=1 and subdivisions=1. Why is setting subdivisions=16,32 or 64 useful? If batch=1, subdivisions=16, net->batch=1/16 is not an integer. Why it works?
One region 16, one region 23, this kind of pattern, on and on. So many questions, lol. If you can answer any of them, I would appreciate! |
|
Thanks! |
Hi .. Please answer this... AlexeyAB, i trained aeroplane data set on the model same as you defined in your Repo. There were 3094 images in the data set ... The error rate is not decreasing after 2200th iterations.. .. when i test the model, it wasn't good.. enough.. (detecting..... but not so accurate) |
@sharoseali Hey, I have to say there are many reasons causing bad detection results. Data set itself, learning rate, number of iterations, input image size, etc. are the things you should consider. It's not easy to tell you what you should do based on your description.
For my experience, keep training may help. But you must be careful of overfitting. |
@sharoseali Just train about 6000 - 10 000 iterations |
Thanks Alexy .and Pattorio.....for replying Alexey..... i have now 5500th weight file.....the link which u have suggested has several instructions.... after make changing in cfg file: will i petform training from start or continue with laresr weight file....?? |
Pattorio.......thanks for applying.. the box draw around the detected aerplane is very wide... in horizontal dimension..and if there are more than one plane in a test image .. it also draw wronge bounding boxes... and how can i find that my model is overfitted..?? |
Yes, you should start training from the begining. |
@sharoseali
You can check your anchor in cfg file. You had better generate anchors for your own data set.
I think it depends on your training set. If the images in your training set have single object, say one plane in one image, you may get result like this. You can add image that has more than one planes into your training set.
Alexey has replied you, answer is here: |
Let's continue previous discussion. Training under multiple gpus looks like this way: If max_batch = 16, batch = 2. For 1 gpu, it loads 2 images in one iteration. After 4 iterations, it will load For 4 gpus, they load Oops, now they have finish what 1 gpu should do under this setting - loading 32 images. So they just wait (Actually they don't need to wait. They just skip next 3 iterations) and say "we finished" till the 4th iteration (using "Sync"). Therefore, under the same setting, 1-gpu-mode and 4-gpus-mode load the same number of images. For 4 gpus: each gpu will load x1/4 images compare to the gpu in 1-gpu-mode. And for each iteration, 4 gpus load x4 images to train, which can be considered that 4-gpu-mode has x4 batch size.
So the learning rate *= ngpus. Is this explain reasonable? Any I did experiment, ylabel is loss. The yellow line is training result under 4-gpu-mode with cfg_lr. The pink line is result under 4-gpu-mode with lr*=ngpus. Ignore the others. x4 lr have better result. |
If
More: #1098 So may be according to this rule “bigger batch size bigger learning rate” - it uses 4x higher learning_rate, and to compensate it - it loads 4x times less images
|
**@AlexeyAB ** How can i managed these instructions:
please explain so that i start my training asap |
**@AlexeyAB ** |
4 gpus only 8 images in totoal? Why? Log file shows that 4 gpus will load 4 times images than 1 gpu. And code in detector.c also shows that total number of images is ngpus times than 1 gpus.
|
For 4 GPUs the iterations will be increased by 16 instead of 4: intinterval = 4
|
Calculation anchors for yolo v2:
For But yolov3 much more better for detection small objects. |
If in both 1 and 2 I am right, even though 4 gpus increase iteration by 16, they load 4x image than 1 gpu in one iteration. Therefore, 1-gpu-mode and 4-gpus-mode should load the same number of images. For 1 gpu: |
If
No.
|
These images show that was loaded the same number of images
Yes. 4xGPU trains 4x faster, but iteration counter increases 16x faster.
If |
Hey Alexey, thank you for your patience and response. I am still a little confused. I am afraid I have some misunderstanding of the code.
If it is true, it means with the same max_batches in cfg file, 4-GPUs will load 4 times images less than 1-GPU. I don’t think it makes sense. I understand the use of mini_batch. In both 1-GPU-model and 4-GPU-model, network will load mini_batch=batch/subdivisions images for one forward. https://github.com/pjreddie/darknet/blob/680d3bde1924c8ee2d1c1dea54d3e56a05ca9a26/src/network.c#L314-L328 In total, each GPU will do forward_network subdivisions times and then update the network. In this way, in one iteration, each GPU loads batch images and uses mini_batch images to do forward_network for subdivisions times. 4 GPUs will load 4*batch images in total. Do I have any misunderstanding of the code? |
Yes. This is correct for 1 iteration. But isn't correct for 1000 iterations. Make a little experiment. Change the source code so that the iterations are increased by 100 instead of 1. Then all that you said is true. But in total will be loaded 100x times less images. 4xGPU increases iterations by +16 instead of +4. - This is the main thing. |
I did some experiments to find the best parameters when training with multiple GPUs. I choose a small training set. It is about 90 images with 1 class. For 1 GPU, I set parameter in cfg file in this way. Then, in the following experiments I use 4 GPUs. DEFAULT-LR 4x-LESS-LR 4x-LESS-LR-4X-MORE-ITERATIONS 4x-MORE-ITERATIONS It seems that keeping 4x-LESS-LR-4x-MORE-ITERATION-DEFAULT-BURN_IN Conclusion for my experiments: for training with multiple GPUs, it might get good result by increasing both Above experiments are for reference. You can add some tips about training with multiple GPUs in Readme. |
Oh! I think I got it. If it is increase by +4, 4-GPU-model and 1-GPU-model will load the same number of image, right? |
Yes.
Yes, there is already about 4 time less learning rate. https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu
|
That's great! Thank you soooooo much! You are so patient! |
Hey Alexey! After my experiments, I find that for large dataset, about 240 thousand images I was using, 4x less learning rate might make convergence too slow. It's not so wise to use 4x less learning rate in this case. Try |
@Pattorio Hey, have you try warming up the model with 4x less learning rate for a few iterations, then use large learning rate for further training? According to the Linear Scaling Rule in "Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour", multiple GPUs (larger batch size) should use large learning. Beside, when we set batch=64 and subdivisions=16, then every GPU only takes 4 images for forward, am I right? If so, I think 4 images are not enough to compute mean and std in BatchNorm layer, should we consider Synchronized BatchNorm in "MegDet: A Large Mini-Batch Object Detector"? |
@Eniac-Xie Hi,
No. In this case each GPU use mini-batch = 4 images - for forward-backward. And batch=64 images - for weights updating. darknet/src/network_kernels.cu Line 414 in 7bd6b2f
But may be yes, in this case we should use learning_rate 4x according to this article: https://arxiv.org/abs/1706.02677v2 |
@Pattorio Hi, I changed this in Readme as recommendation for small datasets: https://github.com/AlexeyAB/darknet#how-to-train-with-multi-gpu |
@Pattorio
As my understanding, Can you confirm that with the above config, training time of 4 GPUs case is smaller than training time of 1 GPU? |
Hi AlexeyAB,
I find that if I set learning_rate=0.001 in .cfg file and use multiple gpus, say 4 gpus, to train, the learning rate in log would be 0.004, which equals to the number of gpus times the lr setting in the cfg file. Here are my questions:
I went through the code, but didn't find any thing about why lr in log would be num_gpus * lr. Could you please help me figure it out?
Under this situation (using 4 gpus), what is the actual lr for each gpu, 0.001 or 0.004? Is it different from using 1 gpu and seting lr=0.001?
If the actual lr for each gpu is 0.004, does it mean that I have to consider the number of gpus when I set lr?
If I want to add new learning rate policy, what should I do? I have tried to do following things:
a) add
new_policy
intoget_policy()
inparcer.c
.b) add
if(net->policy == new_policy)
and corresponding operations inparse_net_options()
inparser.c
.c) add
case new_policy
inget_current_rate()
innetwork.c
d) add
new_policy
into learning_rate_policy struct indarknet.h
After doing these, lr in log file is just the one I set in .cfg file. It has no relation to number of gpus. Anything else to add/modify?
Look forward to hearing from you. Thanks!
The text was updated successfully, but these errors were encountered: