Fix scheduler reporting #30169

muellerzr · 2024-04-10T18:38:42Z

What does this PR do?

Fixes the issue of the wrong iterations scheduler being reported and adds a test

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@amyeroberts

muellerzr · 2024-04-10T19:21:55Z

src/transformers/trainer.py

@@ -2255,17 +2255,21 @@ def _inner_training_loop(
                    # Optimizer step
                    self.optimizer.step()
                    optimizer_was_run = not self.accelerator.optimizer_step_was_skipped
+                    # Store the current LR before stepping
+                    learning_rate = self._get_learning_rate()


I had to modify the order slightly, as we need to grab the LR before scheduling, but we need to step() it before calling on_step_end

HuggingFaceDocBuilderDev · 2024-04-10T19:43:40Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

amyeroberts

Thanks for working on this!

Just some questions so I can better understand the intended logic

amyeroberts · 2024-04-11T07:41:29Z

src/transformers/trainer.py

+            self._maybe_log_save_evaluate(
+                tr_loss, grad_norm, self._get_learning_rate(), model, trial, epoch, ignore_keys_for_eval
+            )


Just for my own understanding - why is self._maybe_log_save_evaluate called twice in the training loop?

First is at n steps, the second is at the end of the epoch

amyeroberts · 2024-04-11T15:10:08Z

src/transformers/trainer.py

+                    self._maybe_log_save_evaluate(
+                        tr_loss, grad_norm, learning_rate, model, trial, epoch, ignore_keys_for_eval
+                    )


Why not move this line above L2260? It seems it's logging a few values before there's any change to the global state e.g. epoch

That changes any behavior if users want to modify something with on_step_end and fully adjusts when that callback is called. Seems far too reaching for this IMO :) We want that to be a thing

(And if we do this, it will make some tests fail IIRC because reported vs expected changes)

If I've understood correctly, this means that the elements being logged aren't all in sync i.e. some will be reporting values before a step and some after

Yes, which I think for the scheduler is important because stepping it only reports what the next LR will be when grabbing, which is not what we want. We want the applied scheduler. For the rest not as much I think, I'll look back through some eval reports to make sure that it is just an isolated incident.

Sorry, I don't think I've fully understood what should be happening.

Yes, which I think for the scheduler is important because stepping it only reports what the next LR will be when grabbing, which is not what we want. We want the applied scheduler.

Three questions:

In terms of what we "want" - do you mean what we want to pass to maybe_log_save_evaluate here, or something else?

What is the difference between the scheduler and the applied scheduler?

Why can't we use the applied scheduler

I think it would be good to have @pacman100 review this first, as he's more familiar with all of these objects

For the rest not as much I think, I'll look back through some eval reports to make sure that it is just an isolated incident.

Were you able to review reports and confirm behaviour is as expected?

github-actions · 2024-07-02T08:05:15Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Fix scheduler reporting

24a3759

muellerzr requested a review from amyeroberts April 10, 2024 18:38

Fix tests

8bf61e9

muellerzr commented Apr 10, 2024

View reviewed changes

amyeroberts reviewed Apr 11, 2024

View reviewed changes

muellerzr requested a review from amyeroberts April 15, 2024 12:19

huggingface deleted a comment from github-actions bot May 13, 2024

huggingface deleted a comment from github-actions bot Jun 7, 2024

github-actions bot closed this Jul 10, 2024

WizKnight mentioned this pull request Sep 16, 2024

Accelerate x Trainer issue tracker: #33345

Open

43 tasks

SunMarc reopened this Sep 27, 2024

muellerzr added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix scheduler reporting #30169

Fix scheduler reporting #30169

muellerzr commented Apr 10, 2024

muellerzr Apr 10, 2024

HuggingFaceDocBuilderDev commented Apr 10, 2024

amyeroberts left a comment

amyeroberts Apr 11, 2024

muellerzr Apr 11, 2024

amyeroberts Apr 11, 2024

muellerzr Apr 11, 2024

muellerzr Apr 11, 2024 •

edited

Loading

amyeroberts Apr 11, 2024

muellerzr Apr 11, 2024

amyeroberts Apr 16, 2024

github-actions bot commented Jul 2, 2024

Fix scheduler reporting #30169

Are you sure you want to change the base?

Fix scheduler reporting #30169

Conversation

muellerzr commented Apr 10, 2024

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 10, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

muellerzr Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Jul 2, 2024

muellerzr Apr 11, 2024 •

edited

Loading