Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning Rate Scheduling with Multi-GPU, model sharding and gradient accumulation #8

Open
icedoom888 opened this issue Dec 6, 2024 · 0 comments
Labels
bug Something isn't working enhancement New feature or request training

Comments

@icedoom888
Copy link
Contributor

What happened?

With the current implementation, I find it quite hard to set the learning rate scheduler when:

  • Multi node GPUs are used
  • Model sharding is used
  • Batch Gradient Accumulation > 1

In fact, currently the LR can be set with:

# Training specifics
  max_epochs: 15
  max_steps: null
  lr:
    rate: 1e-5 #local_lr
    iterations: 60000 # 15000 with actual_bs=32, actual_bs= bs * num_nodes * num_gpus_per_node / num_gpus_per_model = 1 * 8 * 4 / 2 = 16 -> 15000 * 2 * accum_grad_batches
    min: 3e-7 #Not scaled by #GPU
  accum_grad_batches: 2

I would therefore propose to introduce a configuration where the lr is decreased according to max_epochs, rather than number of iterations.
The number of iterations is otherwise incredibly hard to manually set given that it depens on the lenght of the dataset and the effective_bs.
In a multi-gpu setting with model sharding and gradient accumulation, the effective bs can be computed as follows:
effective_bs = (batch_size * num_nodes * num_gpus_per_node) // (num_gpus_per_model * accum_grad_batches)
Ideally the lr scheduler can be automatically set to decrease the lr to 0 in a given number of epochs.

What are the steps to reproduce the bug?

_

Version

0.3.1

Platform (OS and architecture)

Balfrin HPC

Relevant log output

No response

Accompanying data

No response

Organisation

MeteoSwiss

@icedoom888 icedoom888 added bug Something isn't working enhancement New feature or request labels Dec 6, 2024
@JesperDramsch JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request training
Projects
None yet
Development

No branches or pull requests

2 participants