enable average tokens across devices #34373

techkang · 2024-10-24T12:39:52Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@muellerzr pls help to check.

muellerzr

Thanks! It's a good start, let's improve it a little :)

muellerzr · 2024-10-24T13:26:09Z

src/transformers/trainer.py

            if self.compute_loss_func is not None:
+                if (self.args.average_tokens_across_devices and num_items_in_batch is not None and
+                        self.args.world_size > 1):
+                    num_items_in_batch_tensor = torch.tensor(num_items_in_batch, device=self.args.device)
+                    num_items_in_batch = int(self.accelerator.gather(num_items_in_batch_tensor).sum().cpu())
+                    device_count_for_loss = self.args.world_size
+                else:
+                    device_count_for_loss = 1
                loss = self.compute_loss_func(outputs, labels, num_items_in_batch=num_items_in_batch)
+                loss *= device_count_for_loss
            elif model_name in MODEL_FOR_CAUSAL_LM_MAPPING_NAMES.values():
                loss = self.label_smoother(outputs, labels, shift_labels=True)


Thanks! We also need the case for when we don't define this, e.g. it's passed to the model forward(). So what would be better is to perform the gather much earlier, and pass the new num_items_in_batch as part of the call to compute_loss.

And then perform the loss *= where we call loss *= self.args.gradient_accumulation_steps later (right before we call accelerator.backward())

Thanks for advice! Already fixed it, please check again.

muellerzr

Thanks! We're quite close. Two more nits

muellerzr · 2024-10-24T13:45:58Z

src/transformers/training_args.py

+    average_tokens_across_devices: Optional[bool] = field(
+        default=False,
+        metadata={
+            "help": "Whether or not to average tokens across devices."
+        }
+    )


During the __post_init__ we call setup_devices. We can change average_tokens_across_devices value to False if the world size < 1 I think!

This then simplifies it earlier to just be if self.args.average_tokens_across_devices

Thanks, already fixed it, please check.

muellerzr · 2024-10-24T13:47:23Z

src/transformers/trainer.py

            loss *= self.args.gradient_accumulation_steps
+            if (self.args.average_tokens_across_devices and num_items_in_batch is not None and
+                    self.args.world_size > 1):
+                loss *= self.args.world_size


Both of these chunks I think can be under a if num_items_in_batch is not None and self.model_accepts_loss_kwargs, since both need to be valid for the loss *= self.args.gradient_accumulation_steps

I think gradient accumulation is orthogonal to DDP, and used a new if statement. Please check my code. Thanks.

It is, it's a matter of self.model.accepts_loss_kwargs

HuggingFaceDocBuilderDev · 2024-10-24T14:11:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

muellerzr

This indeed looks like the correct solution! Thanks!

SunMarc

Thanks for your work ! LGTM !

SunMarc · 2024-10-24T14:41:24Z

src/transformers/training_args.py

+    average_tokens_across_devices: Optional[bool] = field(
+        default=False,
+        metadata={
+            "help": "Whether or not to average tokens across devices."


Maybe we could share a bit more why this arg could be useful ?

Done, please review my code.

techkang · 2024-10-24T15:14:47Z

@muellerzr @SunMarc I fixed one typo to make ruff happy and pass auto check. And added more explanation to the argument. Pls check again, thank you.

techkang · 2024-10-24T15:39:11Z

One required tests failed but it's due to timeout. Can any one help to fix it? Thanks a lot.

techkang · 2024-10-25T02:07:24Z

world_size is unavailable when only using TF backend. Using try/except to catch errors.

CI still enconunted four errors, but the same error appeared at the recently merged PR. So it may not caused by this PR.

CI errors of this PR: ci_error

CI errors for former PR: ci_error

muellerzr

Thanks! Overall LG2M, cc @SunMarc

muellerzr · 2024-10-25T15:35:22Z

We'll merge once our CI allows us to 😅

SunMarc · 2024-10-25T15:36:09Z

src/transformers/trainer.py


        return loss.detach() / self.args.gradient_accumulation_steps


We modified a lot how loss is computed, are we sure that this is loss is the same as the one applied ?

Good catch, it should be loss.detach() / self.args.gradient_accumulation_steps / self.accelerator.num_processes (dividing by num processes if and only if we did our loss function num tokens logic)

SunMarc

Thanks ! LGTM ! Just a nit

muellerzr · 2024-10-25T18:51:51Z

cc @ydshieh

ArthurZucker

Thanks! Rebase to make sure we have green CIs, will merge anyways!

ArthurZucker · 2024-10-28T17:59:50Z

Thanks @techkang 🤗

* enable average tokens across devices * reduce earlier in case model needs it * simplify if statement * reformat code to make ruff happy * add doc for argument: average_tokens_across_devices * cannot find world size when pytorch is unavailable * format code --------- Co-authored-by: Zach Mueller <muellerzr@gmail.com> Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

enable average tokens across devices

590a522

muellerzr reviewed Oct 24, 2024

View reviewed changes

reduce earlier in case model needs it

647203b

muellerzr reviewed Oct 24, 2024

View reviewed changes

muellerzr mentioned this pull request Oct 24, 2024

fix loss scaling only when compute_loss_func is used #34233

Open

5 tasks

simplify if statement

8fd583d

muellerzr approved these changes Oct 24, 2024

View reviewed changes

muellerzr requested a review from SunMarc October 24, 2024 14:32

SunMarc approved these changes Oct 24, 2024

View reviewed changes

techkang added 2 commits October 24, 2024 23:05

reformat code to make ruff happy

190d534

add doc for argument: average_tokens_across_devices

ae9fbe9

muellerzr and others added 2 commits October 24, 2024 12:54

Merge branch 'main' into main

4c8d02f

Merge branch 'main' into main

70919a1

techkang requested a review from SunMarc October 25, 2024 01:56

techkang added 2 commits October 25, 2024 11:15

cannot find world size when pytorch is unavailable

8ca8c4d

format code

1eab209

techkang requested a review from muellerzr October 25, 2024 03:21

muellerzr approved these changes Oct 25, 2024

View reviewed changes

muellerzr requested a review from ArthurZucker October 25, 2024 15:27

SunMarc reviewed Oct 25, 2024

View reviewed changes

SunMarc approved these changes Oct 25, 2024

View reviewed changes

Merge branch 'main' into main

cdbb3d3

Merge branch 'main' into main

bd8ae99

ArthurZucker approved these changes Oct 28, 2024

View reviewed changes

Merge branch 'main' into main

588a80e

ArthurZucker merged commit d21dbd1 into huggingface:main Oct 28, 2024
19 of 24 checks passed

anhuong mentioned this pull request Nov 1, 2024

build(deps): set transformers below 4.46, waiting on fixes foundation-model-stack/fms-hf-tuning#384

Merged

2 tasks

techkang mentioned this pull request Nov 4, 2024

Does per_device_train_batch_size have a loss error similar to that of GA? #34579

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable average tokens across devices #34373

enable average tokens across devices #34373

techkang commented Oct 24, 2024

muellerzr left a comment

muellerzr Oct 24, 2024

techkang Oct 24, 2024

muellerzr left a comment

muellerzr Oct 24, 2024

techkang Oct 24, 2024

muellerzr Oct 24, 2024 •

edited

Loading

techkang Oct 24, 2024

muellerzr Oct 24, 2024

HuggingFaceDocBuilderDev commented Oct 24, 2024

muellerzr left a comment

SunMarc left a comment

SunMarc Oct 24, 2024

techkang Oct 24, 2024

techkang commented Oct 24, 2024

techkang commented Oct 24, 2024

techkang commented Oct 25, 2024 •

edited

Loading

muellerzr left a comment

muellerzr commented Oct 25, 2024

SunMarc Oct 25, 2024

muellerzr Oct 25, 2024 •

edited

Loading

SunMarc left a comment

muellerzr commented Oct 25, 2024

ArthurZucker left a comment

ArthurZucker commented Oct 28, 2024


		return loss.detach() / self.args.gradient_accumulation_steps

enable average tokens across devices #34373

enable average tokens across devices #34373

Conversation

techkang commented Oct 24, 2024

What does this PR do?

Before submitting

Who can review?

muellerzr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr Oct 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Oct 24, 2024

muellerzr left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

techkang commented Oct 24, 2024

techkang commented Oct 24, 2024

techkang commented Oct 25, 2024 • edited Loading

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr commented Oct 25, 2024

Choose a reason for hiding this comment

muellerzr Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

muellerzr commented Oct 25, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented Oct 28, 2024

muellerzr Oct 24, 2024 •

edited

Loading

techkang commented Oct 25, 2024 •

edited

Loading

muellerzr Oct 25, 2024 •

edited

Loading