resume_from_checkpoint may fail with auto_find_batch_size #25956

n-splv · 2023-09-04T14:21:54Z

System Info

When we resume training from a checkpoint, the process may stale, because the number of already passed steps from the checkpoint may turn out to be greater than the estimated total number of steps. When auto_find_batch_size is turned on, the trainer would first try to choose a higher value for the batch size, before running out of memory.

Consider a simple example:
We want to train a model on 100 samples for 10 epochs. Here is what happens:

The trainer tries to work with a larger batch_size = 8. The estimated number of steps is 100 * 10 / 8 = 125;
We run out of GPU memory, and eventually the batch_size gets reduced to 2. We now have 100 * 10 / 2 = 500 steps to go;
At the step 150 we save a checkpoint and stop the training;
Later we load the model from a checkpoint and try to continue training with the same params. Now the trainer would once again try to set the batch_size = 8, estimate 125 total steps and... Finish the process immediately, since we have already taken 150 / 125 steps.

Who can help?

@muellerz, @pacman100

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

args = Seq2SeqTrainingArguments(
    ...
    auto_find_batch_size=True,
)
train_result = trainer.train(resume_from_checkpoint=CHECKPOINT_DIR)

Expected behavior

The information about the used batch_size should probably be saved somewhere in the checkpoint, and the trainer should be smart enough to account for it when interpreting the number of completed steps. For now, it seems like the only solution is to continue the training by manually providing the same batch size, which is not intuitive and somewhat restricting - suppose, my hardware changed but I want to resume the training from my checkpoint.

The text was updated successfully, but these errors were encountered:

pacman100 · 2023-09-29T10:13:24Z

cc @muellerzr

muellerzr self-assigned this Oct 14, 2023

huggingface deleted a comment from github-actions bot Nov 7, 2023

muellerzr mentioned this issue Nov 17, 2023

Allow resume_from_checkpoint to handle auto_find_batch_size #27568

Merged

5 tasks

huggingface deleted a comment from github-actions bot Dec 4, 2023

muellerzr closed this as completed in #27568 Dec 8, 2023

aspears-premierinc mentioned this issue Mar 7, 2024

resume_from_checkpoint may still fail with auto_find_batch_size #29518

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

resume_from_checkpoint may fail with auto_find_batch_size #25956

resume_from_checkpoint may fail with auto_find_batch_size #25956

n-splv commented Sep 4, 2023

pacman100 commented Sep 29, 2023

resume_from_checkpoint may fail with auto_find_batch_size #25956

resume_from_checkpoint may fail with auto_find_batch_size #25956

Comments

n-splv commented Sep 4, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

pacman100 commented Sep 29, 2023