Does MBartTokenizer remove the parameter decoder_input_ids? #8416

wmathor · 2020-11-09T12:51:58Z

Environment info

transformers version:3.4.0
Platform:Google Colab
Python version:3.7
PyTorch version (GPU?):1.7.0+cu101
Tensorflow version (GPU?):2.x
Using GPU in script?: no
Using distributed or parallel set-up in script?:no

Who can help

Information

Model I am using (Bert, XLNet ...): mbart

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = MBartTokenizer.from_pretrained('facebook/mbart-large-cc25').prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"]
target_ids = batch["decoder_input_ids"]

Steps to reproduce the behavior:

KeyError                                  Traceback (most recent call last)
<ipython-input-11-b3eedaf10c3e> in <module>()
      3 batch = MBartTokenizer.from_pretrained('facebook/mbart-large-en-ro').prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
      4 input_ids = batch["input_ids"]
----> 5 target_ids = batch["decoder_input_ids"]

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in __getitem__(self, item)
    232         """
    233         if isinstance(item, str):
--> 234             return self.data[item]
    235         elif self._encodings is not None:
    236             return self._encodings[item]

KeyError: 'decoder_input_ids'

The text was updated successfully, but these errors were encountered:

sshleifer · 2020-11-09T17:56:29Z

The docs are incorrect, sorry about that.

Try

    from transformers import MBartForConditionalGeneration, MBartTokenizer
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
    tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
    article = "UN Chief Says There Is No Military Solution in Syria"
    batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX")
    translated_tokens = model.generate(**batch, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
    translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
    assert translation == "Şeful ONU declară că nu există o soluţie militară în Siria"

wmathor · 2020-11-10T01:50:55Z

The docs are incorrect, sorry about that.

Try

    from transformers import MBartForConditionalGeneration, MBartTokenizer
    model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
    tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
    article = "UN Chief Says There Is No Military Solution in Syria"
    batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX")
    translated_tokens = model.generate(**batch, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
    translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
    assert translation == "Şeful ONU declară că nu există o soluţie militară în Siria"

thank you for your reply, If I don't want to generate, I just want to train. How should I change it?

example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian)
input_ids = batch["input_ids"]
target_ids = batch["decoder_input_ids"] # Error
decoder_input_ids = target_ids[:, :-1].contiguous()
labels = target_ids[:, 1:].clone()
model(input_ids=input_ids, decoder_input_ids=decoder_input_ids, labels=labels) #forward

sshleifer · 2020-11-10T14:35:52Z

See this https://github.com/huggingface/transformers/blob/master/examples/seq2seq/finetune.py#L138

the batch argument to that fn is the same as your batch (the output of prepare_seq2seq_batch)

LysandreJik assigned sshleifer Nov 9, 2020

sshleifer linked a pull request Nov 9, 2020 that will close this issue

[docs] improve bart/marian/mBART/pegasus docs #8421

Merged

sshleifer added the Documentation label Nov 9, 2020

sshleifer closed this as completed Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does MBartTokenizer remove the parameter decoder_input_ids? #8416

Does MBartTokenizer remove the parameter decoder_input_ids? #8416

wmathor commented Nov 9, 2020 •

edited by abhishekkrthakur

Loading

sshleifer commented Nov 9, 2020

wmathor commented Nov 10, 2020

sshleifer commented Nov 10, 2020

Does MBartTokenizer remove the parameter decoder_input_ids? #8416

Does MBartTokenizer remove the parameter decoder_input_ids? #8416

Comments

wmathor commented Nov 9, 2020 • edited by abhishekkrthakur Loading

Environment info

Who can help

Information

To reproduce

sshleifer commented Nov 9, 2020

wmathor commented Nov 10, 2020

sshleifer commented Nov 10, 2020

wmathor commented Nov 9, 2020 •

edited by abhishekkrthakur

Loading