Allow `resume_from_checkpoint` to handle `auto_find_batch_size` #27568

muellerzr · 2023-11-17T18:25:40Z

What does this PR do?

This PR adds the training batch size as part of the TrainerState. We do this because the TrainerState can be loaded in on resume_from_checkpoint, and so if a user has set auto_find_batch_size to be True, we can keep what that batch size was in there and load it back in if it were saved

Fixes #25956

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts

muellerzr · 2023-11-17T18:31:37Z

src/transformers/trainer.py

+            # In case of repeating the find_executable_batch_size, set `self._train_batch_size` properly
+            state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))


Don't necessarily enjoy the fact we load it up here just for one value, but it makes sense to keep this metadata in here rather than in the model metadata as that makes no sense, nor does it make sense to have a whole new dataclass/state/thing for us to store it in either

Indeed, it's not ideal. How large is it / how long does it take to load?

Could we protect it behind the if state.train_batch_size is not None branch until there's a need for the state later.

Not large at all, it's just got a few dataclass entries in it. However we can certainly protect it to reduce IO time

Also that's not really possible, because a user should be able to do auto_find_batch_size first, and then not require it again if they're resuming from a checkpoint and that checkpoint stored information about the prior run, for us to load in automatically, so thus it's always needed. This is only happening if we resume from checkpoint which should limit it enough,.

HuggingFaceDocBuilderDev · 2023-11-17T18:44:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

ArthurZucker

LGTM
I don't know the context enough, but loading the whole train state seems a bit illogical to me.

the auto_find_batch_size does not overwrite the argument's batch size which is why we have to look for it when we resume?
not sure I understand the test either, let's try to make it apparent that there are two different values, the input arg and the _train_batch_size that is overwritten by the auto_find_batch_size

ArthurZucker · 2023-11-22T10:21:26Z

tests/trainer/test_trainer.py

+        # assume that `auto_find_bs` set it to 8, and we were originally at 16
+        trainer.args.per_device_train_batch_size = 16
+        trainer.train(resume_from_checkpoint=True)
+        # We should be back to 16 again


tests/trainer/test_trainer.py

muellerzr · 2023-11-22T13:59:45Z

@ArthurZucker agreed that it's a bit overkill. Would it be better to create a new file instead (something like training_metadata.json) instead that only gets made for now when doing something like auto_find_batch_size is enabled?

ArthurZucker · 2023-11-22T16:13:54Z

Why don't we just overwrite the arg given from the user?

muellerzr · 2023-11-22T16:23:34Z

@ArthurZucker we still need to store it away somewhere when we do resume_from_checkpoint. The assumption is given a fresh run we don't want to have to run through the iteration loop again to find the right batch size if we've found it once during a prior call. It still needs to be saved somewhere outside on the file system

ArthurZucker · 2023-11-23T07:43:17Z

Ah okay, we don't know if the input batch was auto-found or not. Got it. Not sure we want to create a new file for this, fine with loading the state and if we need more meta-data we'll put them there as well I guess!

amyeroberts

Thanks for adding!

Some comments and questions on the state management. I don't know trainer in-depth so I might be misunderstanding how it's meant to behave

amyeroberts · 2023-11-24T17:50:19Z

tests/trainer/test_trainer.py

+            max_steps=2,
+            save_steps=1,
+            per_device_train_batch_size=8,
+            auto_find_batch_size=True,


I don't think I know enough about auto_find_batch_size to understand the implication of this test. If auto_find_batch_size is True and per_device_train_batch_size is set - which one takes precedence?

batch_size exists in the metadata > per_device_train_batch_size > auto_find_batch_size if still OOM

tests/trainer/test_trainer.py

src/transformers/trainer.py

amyeroberts · 2023-11-24T17:54:41Z

src/transformers/trainer.py

@@ -1641,6 +1647,7 @@ def _inner_training_loop(

        self.state = TrainerState()
        self.state.is_hyper_param_search = trial is not None
+        self.state.train_batch_size = self._train_batch_size


Won't this mean the batch_size from the state is always loaded even if self.args.auto_find_batch_size is False ?

Yes, we care about if it was called at all and the metadata exists inside the state.

src/transformers/trainer_callback.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

ArthurZucker

Thanks, having different variable names like actual train batch size vs user_input_train batch size might help differentiate but it's a nit

muellerzr requested a review from amyeroberts November 17, 2023 18:25

muellerzr commented Nov 17, 2023

View reviewed changes

muellerzr requested review from ArthurZucker and amyeroberts and removed request for amyeroberts November 20, 2023 14:49

ArthurZucker reviewed Nov 22, 2023

View reviewed changes

amyeroberts reviewed Nov 24, 2023

View reviewed changes

muellerzr requested a review from ArthurZucker December 5, 2023 14:23

muellerzr and others added 7 commits December 5, 2023 16:08

Fuffill request

40696e6

Add test

dbbc71f

Better test

096fd7c

Apply suggestions from code review

b4a1903

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

Better test

497a44b

Better test

ed31b37

MOre comments

780bf72

muellerzr force-pushed the muellerzr-resume-auto-batch-size branch from 0d798d9 to 780bf72 Compare December 5, 2023 21:08

ArthurZucker approved these changes Dec 7, 2023

View reviewed changes

muellerzr merged commit 6757ed2 into main Dec 8, 2023
3 checks passed

muellerzr deleted the muellerzr-resume-auto-batch-size branch December 8, 2023 16:51

muellerzr mentioned this pull request Dec 11, 2023

Fix test for auto_find_batch_size on multi-GPU #27947

Merged

5 tasks

pacman100 mentioned this pull request Feb 16, 2024

fix failing trainer ds tests #29057

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow `resume_from_checkpoint` to handle `auto_find_batch_size` #27568

Allow `resume_from_checkpoint` to handle `auto_find_batch_size` #27568

muellerzr commented Nov 17, 2023

muellerzr Nov 17, 2023

amyeroberts Nov 20, 2023

muellerzr Nov 21, 2023

muellerzr Nov 21, 2023

HuggingFaceDocBuilderDev commented Nov 17, 2023

ArthurZucker left a comment

ArthurZucker Nov 22, 2023

muellerzr commented Nov 22, 2023

ArthurZucker commented Nov 22, 2023

muellerzr commented Nov 22, 2023

ArthurZucker commented Nov 23, 2023

amyeroberts left a comment

amyeroberts Nov 24, 2023

muellerzr Dec 5, 2023

amyeroberts Nov 24, 2023

muellerzr Dec 5, 2023

ArthurZucker left a comment

		# In case of repeating the find_executable_batch_size, set `self._train_batch_size` properly
		state = TrainerState.load_from_json(os.path.join(resume_from_checkpoint, TRAINER_STATE_NAME))

Allow resume_from_checkpoint to handle auto_find_batch_size #27568

Allow resume_from_checkpoint to handle auto_find_batch_size #27568

Conversation

muellerzr commented Nov 17, 2023

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 17, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr commented Nov 22, 2023

ArthurZucker commented Nov 22, 2023

muellerzr commented Nov 22, 2023

ArthurZucker commented Nov 23, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Allow `resume_from_checkpoint` to handle `auto_find_batch_size` #27568

Allow `resume_from_checkpoint` to handle `auto_find_batch_size` #27568