Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 finetune outputting gibberish #7796

Closed
1 of 3 tasks
jsrozner opened this issue Oct 14, 2020 · 12 comments
Closed
1 of 3 tasks

T5 finetune outputting gibberish #7796

jsrozner opened this issue Oct 14, 2020 · 12 comments

Comments

@jsrozner
Copy link
Contributor

Environment info

  • transformers version: 3.3.1
  • Platform: Linux-4.4.0-116-generic-x86_64-with-glibc2.10
  • Python version: 3.8.5
  • PyTorch version (GPU?): 1.6.0 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: (tried with both 1 and 2 gpus)

Who can help

Summarization: @sshleifer
T5: @patrickvonplaten
examples/seq2seq: @sshleifer

Information

I am trying to finetune on a custom dataset. I posted about my specific use case here in the forums: https://discuss.huggingface.co/t/t5-tips-for-finetuning-on-crossword-clues-clue-answer/1514

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • [X ] my own task or dataset: (give details below)

To reproduce

  • clone transformers from master
  • pip install -e . ; pip install -r requirements.txt
  • cd exampls/seq2seq
  • modify finetune_t5.sh script to run with a local data set (data_set/[val|test|train].[source|target])

(Note that I have changed nothing else)

python finetune.py \ --model_name_or_path=t5-small \ --tokenizer_name=t5-small \ --data_dir=${HOME}/data_set \ --learning_rate=3e-4 \ --output_dir=$OUTPUT_DIR \ --max_source_length=100 \ --max_target_length=100 \ --num_train_epochs=300 \ --train_batch_size=64 \ --eval_batch_size=64 \ --gpus=1 \ --auto_select_gpus=True \ --save_top_k=3 \ --output_dir=$OUTPUT_DIR \ --do_train \ --do_predict \ "$@"

As a baseline "does the T5 work", my input outputs are of the form (one per line)
(this is one line in train.source): This is a sentence
(this is corresponding line in train.target): This

The lines are exactly as above, with a new line after each example, but with no other punctuation. I have not modified tokens or the model.

Expected behavior

Expect T5 to learn to output the first word.

Observed

T5 outputs first word followed by gibberish:

After 300 epochs, here is what we see for the first 5 lines of source vs test_generation (test.target is just the first word of each line in test.source)
Test.source:
We raised a bloom, a monster
I let Satan corrupt and torment
Chapter in play is an old piece
Old skin disease liable to drain confidence
Keep a riot going inside a musical academy

test_generations:
We vsahmoastuosastostassymbossa
Issahrastahmoormentostormentastoshomment
Chapter vshygie'ny-futtahraffahtaftast
Old hygienohmahrastassahuasairtia
Keep'astifiahuassaivrasastoshygiesana

I wonder if any of the following could be affecting this:

  • choice of loss function
  • a corrupted character somewhere in one of the input/output
  • choice of task (I think it defaults to summarization)
  • need more epochs?
  • some other parameter to change?
@sshleifer
Copy link
Contributor

some other parameter to change?: BINGO

there is a min_length/max_length parameter you can pass to beam search (in many ways) that is affecting your generations.
If you eval offline with min_length=0, max_length=3 it should work.

@jsrozner
Copy link
Contributor Author

jsrozner commented Oct 14, 2020

Cool! Sorry for the n00biness.

  1. Is there somewhere I can read about when / why this happens? (or in brief, why does it happen?)
  2. min_length and max_length will just limit how long the output sequence can be? Where's the best place to input them? Just directly from finetune.py?
  3. Is there a different way to have the model learn when to stop outputting? (i.e to learn by itself that it should only be outputting one "word" since that's what all the train examples show)

@sshleifer
Copy link
Contributor

sshleifer commented Oct 14, 2020

  1. you can read the docstring for generate
  2. I would edit finetune.py around here
  3. It should learn good lengths within the hardcoded range. It's simply not allowed to go out of the hardcoded range.
    If you set min_length=0, max_length=10 I would guess it will learn to always generate word followed by </s> (This "eos" symbol is automatically added to input sequences by the T5Tokenizer.)

@jsrozner
Copy link
Contributor Author

jsrozner commented Oct 15, 2020

Thanks! I am rerunning with the max length (I didn't see a spot for min length).

I'm still a little confused as to why this happens though. For example,

  • why doesn't it get penalized for the gibberish? (is padding somehow affecting what it gets penalized for?)
  • why isn't the gibberish at all linguistic, even? I would expect it at least to add mostly english-like tokens? These strings seem entirely non-lingustic.

Related: is there an easy flag to change so that I could view part of the validation outputs at each epoch to keep track of when it learns to truncate? Right now I'm just waiting until end of training to look at the test generations.

@sshleifer
Copy link
Contributor

@jsrozner
Copy link
Contributor Author

jsrozner commented Oct 15, 2020

Okay thanks, I will work on these.

I realize these are unrelated T5 issues, but before I file other feature requests /bugs I just wanted to run them by you:

  • auto_lr_find and auto_scale_batch_size (pytorch lightning flags) when used from the finetune.sh script throw errors. Should these be usable? (I can debug and figure out why they're not working; but I want to know if they should be working)
  • I am unable to get the finetune.sh script to resume from a checkpoint (I played around with this for ~2 hours last night) and was unable to make it resume. Should this be supported?

@sshleifer
Copy link
Contributor

sshleifer commented Oct 15, 2020

auto*: Would be nice if they worked!
it should work with --resume_from_checkpoint, but that part of lightning has been very flaky.

I probably won't fix either of these but would definitely accept a PR that allow clargs that currently don't work. If you can't fix, you could also make separate issues for clargs that don't work, label them "Help Wanted" and see what happens.
If you make issues, make sure to include your PL version.

@danyaljj
Copy link
Contributor

@jsrozner did you finetune.py work for fine-tuning T5?

We're also having some difficulties. Wanted to make sure if it has worked for someone else, at least.

@jsrozner
Copy link
Contributor Author

@danyaljj will be fixed by #8435

@danyaljj
Copy link
Contributor

Thanks, @jsrozner for the update!
Does this address the issue here? Mainly your observation that:

But even after setting eval_beams=1, eval_max_gen_length=40, it still continues to generate many more tokens than it should

@sshleifer
Copy link
Contributor

Did you pass min_length=0 to generate?

@jsrozner
Copy link
Contributor Author

See issue #5142 for resolution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants