Learning Rate Scheduling with Multi-GPU, model sharding and gradient accumulation #8

icedoom888 · 2024-12-06T09:50:25Z

What happened?

With the current implementation, I find it quite hard to set the learning rate scheduler when:

Multi node GPUs are used
Model sharding is used
Batch Gradient Accumulation > 1

In fact, currently the LR can be set with:

# Training specifics
  max_epochs: 15
  max_steps: null
  lr:
    rate: 1e-5 #local_lr
    iterations: 60000 # 15000 with actual_bs=32, actual_bs= bs * num_nodes * num_gpus_per_node / num_gpus_per_model = 1 * 8 * 4 / 2 = 16 -> 15000 * 2 * accum_grad_batches
    min: 3e-7 #Not scaled by #GPU
  accum_grad_batches: 2

I would therefore propose to introduce a configuration where the lr is decreased according to max_epochs, rather than number of iterations.
The number of iterations is otherwise incredibly hard to manually set given that it depens on the lenght of the dataset and the effective_bs.
In a multi-gpu setting with model sharding and gradient accumulation, the effective bs can be computed as follows:
effective_bs = (batch_size * num_nodes * num_gpus_per_node) // (num_gpus_per_model * accum_grad_batches)
Ideally the lr scheduler can be automatically set to decrease the lr to 0 in a given number of epochs.

What are the steps to reproduce the bug?

_

Version

0.3.1

Platform (OS and architecture)

Balfrin HPC

Relevant log output

No response

Accompanying data

No response

Organisation

MeteoSwiss

The text was updated successfully, but these errors were encountered:

icedoom888 added bug Something isn't working enhancement New feature or request labels Dec 6, 2024

JesperDramsch transferred this issue from ecmwf/anemoi-training Dec 19, 2024

JesperDramsch added the training label Dec 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning Rate Scheduling with Multi-GPU, model sharding and gradient accumulation #8

Learning Rate Scheduling with Multi-GPU, model sharding and gradient accumulation #8

icedoom888 commented Dec 6, 2024

Learning Rate Scheduling with Multi-GPU, model sharding and gradient accumulation #8

Learning Rate Scheduling with Multi-GPU, model sharding and gradient accumulation #8

Comments

icedoom888 commented Dec 6, 2024

What happened?

What are the steps to reproduce the bug?

Version

Platform (OS and architecture)

Relevant log output

Accompanying data

Organisation