4.27.1 breaks fp16 training of Flaubert #22426

thomas-schillaci · 2023-03-28T12:55:43Z

System Info

transformers version: 4.27.1
Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
Python version: 3.9.16
Huggingface_hub version: 0.12.1
PyTorch version (GPU?): 1.12.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: yes
Using distributed or parallel set-up in script?: no

Who can help?

@sgugger @ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Using transformers 4.26.1 the following script behaves properly (train and validation loss decreasing), using transformers >=4.27.1, the training loss is always 0, the validation loss always is nan.
Please note that the problem doesn't occur when removing fp16=True.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

model_name = 'flaubert/flaubert_base_cased'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    problem_type='single_label_classification'
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


def tokenize_function(example):
    return tokenizer(example['text'], padding=True, truncation=True)


dataset = load_dataset('imdb', split='train[:1%]')
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text'])

train_args = TrainingArguments(
    'out',
    report_to=[],
    logging_strategy='steps',
    logging_steps=1,
    evaluation_strategy='epoch',
    num_train_epochs=1,
    fp16=True
)
trainer = Trainer(model=model, train_dataset=dataset['train'], eval_dataset=dataset['test'], args=train_args)
trainer.train()

Expected behavior

Upgrading to >=4.27.1 should produce a similar training to 4.26.1.
Thank you for your help!

The text was updated successfully, but these errors were encountered:

sgugger · 2023-03-28T14:14:31Z

Thanks for reporting and providing a clear reproducer! It let me pinpoint the regression to this PR. I think we shouldn't have touched that modeling code. Let me just consult internally and I will report back here with the next steps soon!

sgugger · 2023-03-29T14:47:54Z

The PR mentions above reverts the commit that introduced the bug. This will be released in a patch (4.27.4) later today.

thomas-schillaci · 2023-03-29T14:54:16Z

Awesome, thank you!

thomas-schillaci · 2023-03-30T12:28:09Z

Tested on 4.27.4, this issue is fixed, thank you again!

sgugger · 2023-03-30T12:50:28Z

Thanks for letting us know!

sgugger mentioned this issue Mar 29, 2023

Revert "Error (also in original) model, scaling only q matrix not qk.T dot product (qk.T/sqrt(dim_per_head))" #22444

Merged

thomas-schillaci closed this as completed Mar 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4.27.1 breaks fp16 training of Flaubert #22426

4.27.1 breaks fp16 training of Flaubert #22426

thomas-schillaci commented Mar 28, 2023 •

edited

Loading

sgugger commented Mar 28, 2023

sgugger commented Mar 29, 2023

thomas-schillaci commented Mar 29, 2023

thomas-schillaci commented Mar 30, 2023

sgugger commented Mar 30, 2023

4.27.1 breaks fp16 training of Flaubert #22426

4.27.1 breaks fp16 training of Flaubert #22426

Comments

thomas-schillaci commented Mar 28, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

sgugger commented Mar 28, 2023

sgugger commented Mar 29, 2023

thomas-schillaci commented Mar 29, 2023

thomas-schillaci commented Mar 30, 2023

sgugger commented Mar 30, 2023

thomas-schillaci commented Mar 28, 2023 •

edited

Loading