fixed multiplechoice tokenization (#12362)

* fixed multiplechoice tokenization The model would have seen two sequences: 1. [CLS]prompt[SEP]prompt[SEP] 2. [CLS]choice0[SEP]choice1[SEP] that is not correct as we want a contextualized embedding of prompt and choice * removed outer brackets for proper sequence generation
huggingface · Jun 25, 2021 · f866425 · f866425
1 parent 4a872ca
commit f866425
Showing 1 changed file with 1 addition and 1 deletion.
diff --git a/src/transformers/file_utils.py b/src/transformers/file_utils.py
@@ -816,7 +816,7 @@ def _prepare_output_docstrings(output_type, config_class):
         >>> choice1 = "It is eaten while held in the hand."
         >>> labels = torch.tensor(0).unsqueeze(0)  # choice0 is correct (according to Wikipedia ;)), batch size 1
 
-        >>> encoding = tokenizer([[prompt, prompt], [choice0, choice1]], return_tensors='pt', padding=True)
+        >>> encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors='pt', padding=True)
         >>> outputs = model(**{{k: v.unsqueeze(0) for k,v in encoding.items()}}, labels=labels)  # batch size is 1
 
         >>> # the linear classifier still needs to be trained