Skip to content

Commit

Permalink
fixed multiplechoice tokenization (huggingface#12362)
Browse files Browse the repository at this point in the history
* fixed multiplechoice tokenization

The model would have seen two sequences:
1. [CLS]prompt[SEP]prompt[SEP]
2. [CLS]choice0[SEP]choice1[SEP]
that is not correct as we want a contextualized embedding of prompt and choice

* removed outer brackets for proper sequence generation
  • Loading branch information
cronoik authored and Iwontbecreative committed Jul 15, 2021
1 parent 9af1491 commit 96421ae
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion src/transformers/file_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -816,7 +816,7 @@ def _prepare_output_docstrings(output_type, config_class):
>>> choice1 = "It is eaten while held in the hand."
>>> labels = torch.tensor(0).unsqueeze(0) # choice0 is correct (according to Wikipedia ;)), batch size 1
>>> encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)
>>> encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors='pt', padding=True)
>>> outputs = model(**{{k: v.unsqueeze(0) for k,v in encoding.items()}}, labels=labels) # batch size is 1
>>> # the linear classifier still needs to be trained
Expand Down

0 comments on commit 96421ae

Please sign in to comment.