-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix GPT2 token's special_tokens_mask
when used with add_bos_token=True
#19036
fix GPT2 token's special_tokens_mask
when used with add_bos_token=True
#19036
Conversation
if not self.add_bos_token: | ||
return super().get_special_tokens_mask( | ||
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=False | ||
) | ||
|
||
if token_ids_1 is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this should only be modified here and not in the tokenization_base
class? I wonder if it could be affecting other models but silently, but my knowledge of the tokenizer is clearly lacking!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think it should be changed here, the other tokenizers that add special tokens override the method get_special_tokens_mask
defined in tokenization_base
class.
For example:
transformers/src/transformers/models/bert/tokenization_bert.py
Lines 293 to 319 in 693ba2c
def get_special_tokens_mask( | |
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False | |
) -> List[int]: | |
""" | |
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |
special tokens using the tokenizer `prepare_for_model` method. | |
Args: | |
token_ids_0 (`List[int]`): | |
List of IDs. | |
token_ids_1 (`List[int]`, *optional*): | |
Optional second list of IDs for sequence pairs. | |
already_has_special_tokens (`bool`, *optional*, defaults to `False`): | |
Whether or not the token list is already formatted with special tokens for the model. | |
Returns: | |
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | |
""" | |
if already_has_special_tokens: | |
return super().get_special_tokens_mask( | |
token_ids_0=token_ids_0, token_ids_1=token_ids_1, already_has_special_tokens=True | |
) | |
if token_ids_1 is not None: | |
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1] | |
return [1] + ([0] * len(token_ids_0)) + [1] |
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing!
What does this PR do?
Fix: #19035
This PR allows to correct the mask of special tokens when using the tokenizer of GPT2 with
add_bos_tokens=True
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Would love to have your input on it @sgugger , @LysandreJik , @patrickvonplaten and @ArthurZucker