Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resume_from_checkpoint may fail with auto_find_batch_size #25956

Closed
2 of 4 tasks
n-splv opened this issue Sep 4, 2023 · 1 comment · Fixed by #27568
Closed
2 of 4 tasks

resume_from_checkpoint may fail with auto_find_batch_size #25956

n-splv opened this issue Sep 4, 2023 · 1 comment · Fixed by #27568
Assignees

Comments

@n-splv
Copy link

n-splv commented Sep 4, 2023

System Info

When we resume training from a checkpoint, the process may stale, because the number of already passed steps from the checkpoint may turn out to be greater than the estimated total number of steps. When auto_find_batch_size is turned on, the trainer would first try to choose a higher value for the batch size, before running out of memory.

Consider a simple example:
We want to train a model on 100 samples for 10 epochs. Here is what happens:

  1. The trainer tries to work with a larger batch_size = 8. The estimated number of steps is 100 * 10 / 8 = 125;
  2. We run out of GPU memory, and eventually the batch_size gets reduced to 2. We now have 100 * 10 / 2 = 500 steps to go;
  3. At the step 150 we save a checkpoint and stop the training;
  4. Later we load the model from a checkpoint and try to continue training with the same params. Now the trainer would once again try to set the batch_size = 8, estimate 125 total steps and... Finish the process immediately, since we have already taken 150 / 125 steps.

Who can help?

@muellerz, @pacman100

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

args = Seq2SeqTrainingArguments(
    ...
    auto_find_batch_size=True,
)
train_result = trainer.train(resume_from_checkpoint=CHECKPOINT_DIR)

Expected behavior

The information about the used batch_size should probably be saved somewhere in the checkpoint, and the trainer should be smart enough to account for it when interpreting the number of completed steps. For now, it seems like the only solution is to continue the training by manually providing the same batch size, which is not intuitive and somewhat restricting - suppose, my hardware changed but I want to resume the training from my checkpoint.

@pacman100
Copy link
Contributor

cc @muellerzr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants