Recommended Adafactor settings for T5 cause error #7789

OyvindTafjord · 2020-10-14T16:10:34Z

Environment info

transformers version: 3.3.1
Platform: Darwin-19.6.0-x86_64-i386-64bit
Python version: 3.7.7
PyTorch version (GPU?): 1.6.0 (False)
Tensorflow version (GPU?): 2.2.0 (False)
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help

@sshleifer (from activity on Adafactor PRs)

Information

Model I am using (Bert, XLNet ...): T5

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

The Adafactor docs recommend the following for T5 : Adafactor(model.parameters(), lr=1e-3, relative_step=False, warmup_init=True)

However, the init code then has:

        if lr is not None and relative_step:
            raise ValueError("Cannot combine manual lr and relative_step options")
        if warmup_init and not relative_step:
            raise ValueError("warmup_init requires relative_step=True")

which makes this setting impossible (as well as just changing to relative_step=True). So something seems to be missing either in the recommendations or in the implementation.

Thanks!

The text was updated successfully, but these errors were encountered:

sshleifer · 2020-10-14T20:44:05Z

I think the doc should recommend

Adafactor(model.parameters(), relative_step=True, warmup_init=True, lr=None)

want to fix it?

OyvindTafjord · 2020-10-20T18:18:43Z

I think what corresponds to the original T5 training code is Adafactor(model.parameters(), lr=1e-3, relative_step=False, warmup_init=False), however that didn't work great for me so far (much slower than Adam, and giving me NaN's even in FP32).

sonaliserro · 2020-11-10T20:14:45Z

Hello @OyvindTafjord, have you been able to fine-tune T5 with Adafactor? Thanks, Sonali

OyvindTafjord · 2020-11-10T20:26:55Z

No, I haven't investigated further regarding the slowness and NaN's I was getting.

jsrozner · 2021-03-04T22:58:52Z

This issue persists (i.e. the suggested defaults still produce the error).

I can confirm that Adafactor(lr=1e-3, relative_step=False, warmup_init=False) seems to break training (i.e. I observe no learning over 4 epochs, whereas Adafactor(model.parameters(), relative_step=True, warmup_init=True, lr=None) works well (much better than adam)

sshleifer self-assigned this Oct 14, 2020

stale bot added the wontfix label Jan 9, 2021

stale bot closed this as completed Jan 18, 2021

jsrozner mentioned this issue Mar 4, 2021

Fix Adafactor documentation (recommend correct settings) #10526

Merged

huggingface deleted a comment from stale bot Mar 22, 2021

stas00 reopened this Mar 22, 2021

stas00 removed the wontfix label Mar 22, 2021

stas00 assigned stas00 and unassigned sshleifer Mar 22, 2021

stas00 closed this as completed in #10526 Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended Adafactor settings for T5 cause error #7789

Recommended Adafactor settings for T5 cause error #7789

OyvindTafjord commented Oct 14, 2020

sshleifer commented Oct 14, 2020 •

edited

Loading

OyvindTafjord commented Oct 20, 2020

sonaliserro commented Nov 10, 2020

OyvindTafjord commented Nov 10, 2020

jsrozner commented Mar 4, 2021

Recommended Adafactor settings for T5 cause error #7789

Recommended Adafactor settings for T5 cause error #7789

Comments

OyvindTafjord commented Oct 14, 2020

Environment info

Who can help

Information

To reproduce

sshleifer commented Oct 14, 2020 • edited Loading

OyvindTafjord commented Oct 20, 2020

sonaliserro commented Nov 10, 2020

OyvindTafjord commented Nov 10, 2020

jsrozner commented Mar 4, 2021

sshleifer commented Oct 14, 2020 •

edited

Loading