Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4.27.1 breaks fp16 training of Flaubert #22426

Closed
2 of 4 tasks
thomas-schillaci opened this issue Mar 28, 2023 · 5 comments
Closed
2 of 4 tasks

4.27.1 breaks fp16 training of Flaubert #22426

thomas-schillaci opened this issue Mar 28, 2023 · 5 comments

Comments

@thomas-schillaci
Copy link

thomas-schillaci commented Mar 28, 2023

System Info

  • transformers version: 4.27.1
  • Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.16
  • Huggingface_hub version: 0.12.1
  • PyTorch version (GPU?): 1.12.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: no

Who can help?

@sgugger @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Using transformers 4.26.1 the following script behaves properly (train and validation loss decreasing), using transformers >=4.27.1, the training loss is always 0, the validation loss always is nan.
Please note that the problem doesn't occur when removing fp16=True.

from transformers import AutoModelForSequenceClassification, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset

model_name = 'flaubert/flaubert_base_cased'
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=2,
    problem_type='single_label_classification'
)
tokenizer = AutoTokenizer.from_pretrained(model_name)


def tokenize_function(example):
    return tokenizer(example['text'], padding=True, truncation=True)


dataset = load_dataset('imdb', split='train[:1%]')
dataset = dataset.train_test_split(test_size=0.2)
dataset = dataset.map(tokenize_function, batched=True, remove_columns=['text'])

train_args = TrainingArguments(
    'out',
    report_to=[],
    logging_strategy='steps',
    logging_steps=1,
    evaluation_strategy='epoch',
    num_train_epochs=1,
    fp16=True
)
trainer = Trainer(model=model, train_dataset=dataset['train'], eval_dataset=dataset['test'], args=train_args)
trainer.train()

Expected behavior

Upgrading to >=4.27.1 should produce a similar training to 4.26.1.
Thank you for your help!

@sgugger
Copy link
Collaborator

sgugger commented Mar 28, 2023

Thanks for reporting and providing a clear reproducer! It let me pinpoint the regression to this PR. I think we shouldn't have touched that modeling code. Let me just consult internally and I will report back here with the next steps soon!

@sgugger
Copy link
Collaborator

sgugger commented Mar 29, 2023

The PR mentions above reverts the commit that introduced the bug. This will be released in a patch (4.27.4) later today.

@thomas-schillaci
Copy link
Author

Awesome, thank you!

@thomas-schillaci
Copy link
Author

Tested on 4.27.4, this issue is fixed, thank you again!

@sgugger
Copy link
Collaborator

sgugger commented Mar 30, 2023

Thanks for letting us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants