fix GPT2 token's `special_tokens_mask` when used with `add_bos_token=True` #19036

SaulLu · 2022-09-14T15:48:32Z

What does this PR do?

Fix: #19035

This PR allows to correct the mask of special tokens when using the tokenizer of GPT2 with add_bos_tokens=True

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Would love to have your input on it @sgugger , @LysandreJik , @patrickvonplaten and @ArthurZucker

ArthurZucker · 2022-09-14T16:00:24Z

src/transformers/models/gpt2/tokenization_gpt2.py

+        if not self.add_bos_token:
+            return super().get_special_tokens_mask(
+                token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=False
+            )
+
+        if token_ids_1 is None:


Do you think this should only be modified here and not in the tokenization_base class? I wonder if it could be affecting other models but silently, but my knowledge of the tokenizer is clearly lacking!

Yes I think it should be changed here, the other tokenizers that add special tokens override the method get_special_tokens_mask defined in tokenization_base class. ☺️

For example:

transformers/src/transformers/models/bert/tokenization_bert.py

Lines 293 to 319 in 693ba2c

def get_special_tokens_mask(

self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False

) -> List[int]:

"""

Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding

special tokens using the tokenizer `prepare_for_model` method.

Args:

token_ids_0 (`List[int]`):

List of IDs.

token_ids_1 (`List[int]`, *optional*):

Optional second list of IDs for sequence pairs.

already_has_special_tokens (`bool`, *optional*, defaults to `False`):

Whether or not the token list is already formatted with special tokens for the model.

Returns:

`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.

"""

if already_has_special_tokens:

return super().get_special_tokens_mask(

token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True

)

if token_ids_1 is not None:

return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]

return [1] + ([0] * len(token_ids_0)) + [1]

HuggingFaceDocBuilderDev · 2022-09-14T16:04:11Z

The documentation is not available anymore as the PR was closed or merged.

sgugger

Thanks for fixing!

…True` (#19036)

SaulLu added 2 commits September 14, 2022 15:41

fix OPT's with BOS token

6a8520b

handle GPT2 basic use case

56dade7

ArthurZucker reviewed Sep 14, 2022

View reviewed changes

SaulLu requested review from sgugger, patrickvonplaten and LysandreJik September 14, 2022 16:06

sgugger approved these changes Sep 14, 2022

View reviewed changes

SaulLu merged commit 0efbb6e into huggingface:main Sep 14, 2022

LysandreJik pushed a commit that referenced this pull request Sep 14, 2022

fix GPT2 token's special_tokens_mask when used with `add_bos_token=…

2182378

…True` (#19036)

IanMagnusson mentioned this pull request Sep 16, 2022

support for opt in training allenai/catwalk#80

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix GPT2 token's `special_tokens_mask` when used with `add_bos_token=True` #19036

fix GPT2 token's `special_tokens_mask` when used with `add_bos_token=True` #19036

SaulLu commented Sep 14, 2022 •

edited

Loading

ArthurZucker Sep 14, 2022

SaulLu Sep 14, 2022

HuggingFaceDocBuilderDev commented Sep 14, 2022 •

edited

Loading

sgugger left a comment

	def get_special_tokens_mask(
	self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
	) -> List[int]:
	"""
	Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
	special tokens using the tokenizer `prepare_for_model` method.

	Args:
	token_ids_0 (`List[int]`):
	List of IDs.
	token_ids_1 (`List[int]`, optional):
	Optional second list of IDs for sequence pairs.
	already_has_special_tokens (`bool`, optional, defaults to `False`):
	Whether or not the token list is already formatted with special tokens for the model.

	Returns:
	`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
	"""

	if already_has_special_tokens:
	return super().get_special_tokens_mask(
	token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True
	)

	if token_ids_1 is not None:
	return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
	return [1] + ([0] * len(token_ids_0)) + [1]

fix GPT2 token's special_tokens_mask when used with add_bos_token=True #19036

fix GPT2 token's special_tokens_mask when used with add_bos_token=True #19036

Conversation

SaulLu commented Sep 14, 2022 • edited Loading

What does this PR do?

Before submitting

Who can review?

ArthurZucker Sep 14, 2022

Choose a reason for hiding this comment

SaulLu Sep 14, 2022

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 14, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

fix GPT2 token's `special_tokens_mask` when used with `add_bos_token=True` #19036

fix GPT2 token's `special_tokens_mask` when used with `add_bos_token=True` #19036

SaulLu commented Sep 14, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 14, 2022 •

edited

Loading