Question on Learning Rate Scaling #25

mahfuzalhasan · 2024-05-10T07:49:03Z

Hello,
First, thank you for the clean and nice implementation specifically for the model part. I have some queries regarding the training of the focal transformer.

In the main.py, the learning rate was scaled for DDP training as
linear_scaled_lr = config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * dist.get_world_size() / 512.0.
The BASE_LR is set to 5e-4 in config, BATCH_SIZE (per GPU) = 128 and get_world_size()=8 (8 NVIDIA A100 according to paper). So, ultimately,
linear_scaled_lr = (5e-4 * 128 * 8 )/512 = 5e-4 * 2 = 1e-3. I guess that's why the learning rate was written 1e-3 in the paper.

I have 2 questions in this regard.

Is there any particular reason for using 512 as the normalization factor? I found it in the SWIN repo and some other repos too but couldn't find any logical explanation.
Currently, I am trying to use a variant of the Focal Transformer for my research. If I am using 4 GPU and 128 Batch per GPU (in total BATCH_SIZE=512), then do I need to change anything regarding the learning rate scaling?
It would have been great if you could kindly guide me in the above queries.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Learning Rate Scaling #25

Question on Learning Rate Scaling #25

mahfuzalhasan commented May 10, 2024

Question on Learning Rate Scaling #25

Question on Learning Rate Scaling #25

Comments

mahfuzalhasan commented May 10, 2024