Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recommended Adafactor settings for T5 cause error #7789

Closed
2 of 4 tasks
OyvindTafjord opened this issue Oct 14, 2020 · 5 comments · Fixed by #10526
Closed
2 of 4 tasks

Recommended Adafactor settings for T5 cause error #7789

OyvindTafjord opened this issue Oct 14, 2020 · 5 comments · Fixed by #10526
Assignees

Comments

@OyvindTafjord
Copy link
Contributor

Environment info

  • transformers version: 3.3.1
  • Platform: Darwin-19.6.0-x86_64-i386-64bit
  • Python version: 3.7.7
  • PyTorch version (GPU?): 1.6.0 (False)
  • Tensorflow version (GPU?): 2.2.0 (False)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@sshleifer (from activity on Adafactor PRs)

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

The Adafactor docs recommend the following for T5 : Adafactor(model.parameters(), lr=1e-3, relative_step=False, warmup_init=True)

However, the init code then has:

        if lr is not None and relative_step:
            raise ValueError("Cannot combine manual lr and relative_step options")
        if warmup_init and not relative_step:
            raise ValueError("warmup_init requires relative_step=True")

which makes this setting impossible (as well as just changing to relative_step=True). So something seems to be missing either in the recommendations or in the implementation.

Thanks!

@sshleifer
Copy link
Contributor

sshleifer commented Oct 14, 2020

I think the doc should recommend

Adafactor(model.parameters(), relative_step=True, warmup_init=True, lr=None)

want to fix it?

@sshleifer sshleifer self-assigned this Oct 14, 2020
@OyvindTafjord
Copy link
Contributor Author

I think what corresponds to the original T5 training code is Adafactor(model.parameters(), lr=1e-3, relative_step=False, warmup_init=False), however that didn't work great for me so far (much slower than Adam, and giving me NaN's even in FP32).

@sonaliserro
Copy link

Hello @OyvindTafjord, have you been able to fine-tune T5 with Adafactor? Thanks, Sonali

@OyvindTafjord
Copy link
Contributor Author

No, I haven't investigated further regarding the slowness and NaN's I was getting.

@stale stale bot added the wontfix label Jan 9, 2021
@stale stale bot closed this as completed Jan 18, 2021
@jsrozner
Copy link
Contributor

jsrozner commented Mar 4, 2021

This issue persists (i.e. the suggested defaults still produce the error).

I can confirm that Adafactor(lr=1e-3, relative_step=False, warmup_init=False) seems to break training (i.e. I observe no learning over 4 epochs, whereas Adafactor(model.parameters(), relative_step=True, warmup_init=True, lr=None) works well (much better than adam)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants