Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify batch size displayed when using DataParallel #24430

Merged
merged 1 commit into from
Jun 22, 2023
Merged

Conversation

sgugger
Copy link
Collaborator

@sgugger sgugger commented Jun 22, 2023

What does this PR do?

As pointed out in #24345, the batch size displayed when using DataParallel is unclear, this PR fixes that.

Fixes #24345

@sgugger sgugger requested a review from muellerzr June 22, 2023 17:30
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 22, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, much clearer now!

@sgugger sgugger merged commit 2834c17 into main Jun 22, 2023
@sgugger sgugger deleted the trainer_dp_bs branch June 22, 2023 18:46
logger.info(f" Instantaneous batch size per device = {self._train_batch_size:,}")
logger.info(f" Instantaneous batch size per device = {self.args.per_device_train_batch_size:,}")
if self.args.per_device_train_batch_size != self._train_batch_size:
logger.info(f" Training with DataParallel so batch size has been adjusted to: {self._train_batch_size:,}")
logger.info(f" Total train batch size (w. parallel, distributed & accumulation) = {total_train_batch_size:,}")
Copy link

@cgbahk cgbahk Jun 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for care! #24345 seems resolved.

Sorry, I wrote below, but it seems not the case 😅 I didn't check gradient accumulation

So please ignore below


As I said #24345 (comment), I will not use DP anymore, but it seems to have bug around this logging part in case of DP.

e.g.) total_train_batch_size seems be larger than expected real value..?

May use

        total_train_batch_size = self.args.per_device_train_batch_size * args.gradient_accumulation_steps * args.world_size

at least in case of DP, instead of

total_train_batch_size = args.train_batch_size * args.gradient_accumulation_steps * args.world_size

..?

I'm not sure about this issue and my suggestion (as I'm not familiar yet with trainer core/internal) so by now I will not report any issue or make a PR, but you guys may care about this.

I would try to make new issue or PR if I have enough time for check 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trainer reports batch size different from argument on multiple GPUs with DP
4 participants