Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question on Learning Rate Scaling #25

Open
mahfuzalhasan opened this issue May 10, 2024 · 0 comments
Open

Question on Learning Rate Scaling #25

mahfuzalhasan opened this issue May 10, 2024 · 0 comments

Comments

@mahfuzalhasan
Copy link

Hello,
First, thank you for the clean and nice implementation specifically for the model part. I have some queries regarding the training of the focal transformer.

In the main.py, the learning rate was scaled for DDP training as
linear_scaled_lr = config.TRAIN.BASE_LR * config.DATA.BATCH_SIZE * dist.get_world_size() / 512.0.
The BASE_LR is set to 5e-4 in config, BATCH_SIZE (per GPU) = 128 and get_world_size()=8 (8 NVIDIA A100 according to paper). So, ultimately,
linear_scaled_lr = (5e-4 * 128 * 8 )/512 = 5e-4 * 2 = 1e-3. I guess that's why the learning rate was written 1e-3 in the paper.

I have 2 questions in this regard.

  1. Is there any particular reason for using 512 as the normalization factor? I found it in the SWIN repo and some other repos too but couldn't find any logical explanation.
  2. Currently, I am trying to use a variant of the Focal Transformer for my research. If I am using 4 GPU and 128 Batch per GPU (in total BATCH_SIZE=512), then do I need to change anything regarding the learning rate scaling?
    It would have been great if you could kindly guide me in the above queries.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant