Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does MBartTokenizer remove the parameter decoder_input_ids? #8416

Closed
2 of 4 tasks
wmathor opened this issue Nov 9, 2020 · 3 comments · Fixed by #8421
Closed
2 of 4 tasks

Does MBartTokenizer remove the parameter decoder_input_ids? #8416

wmathor opened this issue Nov 9, 2020 · 3 comments · Fixed by #8421
Assignees

Comments

@wmathor
Copy link
Contributor

wmathor commented Nov 9, 2020

Environment info

  • transformers version:3.4.0
  • Platform:Google Colab
  • Python version:3.7
  • PyTorch version (GPU?):1.7.0+cu101
  • Tensorflow version (GPU?):2.x
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?:no

Who can help

Information

Model I am using (Bert, XLNet ...): mbart

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = MBartTokenizer.from_pretrained('facebook/mbart-large-cc25').prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"]
target_ids = batch["decoder_input_ids"]

Steps to reproduce the behavior:

KeyError                                  Traceback (most recent call last)
<ipython-input-11-b3eedaf10c3e> in <module>()
      3 batch = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro').prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
      4 input_ids = batch["input_ids"]
----> 5 target_ids = batch["decoder_input_ids"]

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in __getitem__(self, item)
    232         """
    233         if isinstance(item, str):
--> 234             return self.data[item]
    235         elif self._encodings is not None:
    236             return self._encodings[item]

KeyError: 'decoder_input_ids'
@sshleifer sshleifer linked a pull request Nov 9, 2020 that will close this issue
@sshleifer
Copy link
Contributor

The docs are incorrect, sorry about that.

Try

    from transformers import MBartForConditionalGeneration, MBartTokenizer
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
    tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
    article = "UN Chief Says There Is No Military Solution in Syria"
    batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX")
    translated_tokens = model.generate(**batch, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
    translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
    assert translation == "Şeful ONU declară că nu există o soluţie militară în Siria"

@wmathor
Copy link
Contributor Author

wmathor commented Nov 10, 2020

The docs are incorrect, sorry about that.

Try

    from transformers import MBartForConditionalGeneration, MBartTokenizer
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
    tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
    article = "UN Chief Says There Is No Military Solution in Syria"
    batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX")
    translated_tokens = model.generate(**batch, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
    translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
    assert translation == "Şeful ONU declară că nu există o soluţie militară în Siria"

thank you for your reply, If I don't want to generate, I just want to train. How should I change it?

example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"]
target_ids = batch["decoder_input_ids"] # Error
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()
model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels) #forward

@sshleifer
Copy link
Contributor

See this https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune.py#L138

the batch argument to that fn is the same as your batch (the output of prepare_seq2seq_batch)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants