You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would therefore propose to introduce a configuration where the lr is decreased according to max_epochs, rather than number of iterations.
The number of iterations is otherwise incredibly hard to manually set given that it depens on the lenght of the dataset and the effective_bs.
In a multi-gpu setting with model sharding and gradient accumulation, the effective bs can be computed as follows: effective_bs = (batch_size * num_nodes * num_gpus_per_node) // (num_gpus_per_model * accum_grad_batches)
Ideally the lr scheduler can be automatically set to decrease the lr to 0 in a given number of epochs.
What are the steps to reproduce the bug?
_
Version
0.3.1
Platform (OS and architecture)
Balfrin HPC
Relevant log output
No response
Accompanying data
No response
Organisation
MeteoSwiss
The text was updated successfully, but these errors were encountered:
What happened?
With the current implementation, I find it quite hard to set the learning rate scheduler when:
In fact, currently the LR can be set with:
I would therefore propose to introduce a configuration where the lr is decreased according to max_epochs, rather than number of iterations.
The number of iterations is otherwise incredibly hard to manually set given that it depens on the lenght of the dataset and the effective_bs.
In a multi-gpu setting with model sharding and gradient accumulation, the effective bs can be computed as follows:
effective_bs = (batch_size * num_nodes * num_gpus_per_node) // (num_gpus_per_model * accum_grad_batches)
Ideally the lr scheduler can be automatically set to decrease the lr to 0 in a given number of epochs.
What are the steps to reproduce the bug?
_
Version
0.3.1
Platform (OS and architecture)
Balfrin HPC
Relevant log output
No response
Accompanying data
No response
Organisation
MeteoSwiss
The text was updated successfully, but these errors were encountered: