You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When we resume training from a checkpoint, the process may stale, because the number of already passed steps from the checkpoint may turn out to be greater than the estimated total number of steps. When auto_find_batch_size is turned on, the trainer would first try to choose a higher value for the batch size, before running out of memory.
Consider a simple example:
We want to train a model on 100 samples for 10 epochs. Here is what happens:
The trainer tries to work with a larger batch_size = 8. The estimated number of steps is 100 * 10 / 8 = 125;
We run out of GPU memory, and eventually the batch_size gets reduced to 2. We now have 100 * 10 / 2 = 500 steps to go;
At the step 150 we save a checkpoint and stop the training;
Later we load the model from a checkpoint and try to continue training with the same params. Now the trainer would once again try to set the batch_size = 8, estimate 125 total steps and... Finish the process immediately, since we have already taken 150 / 125 steps.
The information about the used batch_size should probably be saved somewhere in the checkpoint, and the trainer should be smart enough to account for it when interpreting the number of completed steps. For now, it seems like the only solution is to continue the training by manually providing the same batch size, which is not intuitive and somewhat restricting - suppose, my hardware changed but I want to resume the training from my checkpoint.
The text was updated successfully, but these errors were encountered:
System Info
When we resume training from a checkpoint, the process may stale, because the number of already passed steps from the checkpoint may turn out to be greater than the estimated total number of steps. When
auto_find_batch_size
is turned on, the trainer would first try to choose a higher value for the batch size, before running out of memory.Consider a simple example:
We want to train a model on 100 samples for 10 epochs. Here is what happens:
batch_size = 8
. The estimated number of steps is 100 * 10 / 8 = 125;Who can help?
@muellerz, @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The information about the used batch_size should probably be saved somewhere in the checkpoint, and the trainer should be smart enough to account for it when interpreting the number of completed steps. For now, it seems like the only solution is to continue the training by manually providing the same batch size, which is not intuitive and somewhat restricting - suppose, my hardware changed but I want to resume the training from my checkpoint.
The text was updated successfully, but these errors were encountered: