You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I tried to run run_mlm.py for bert model and albert model.
"pad" token is not masked when I run bert-base-uncased model , but "pad" token can be masked when I run albert-base-v2
"get_special_tokens_mask" function is called from "class PreTrainedTokenizerBase" when I run bert-base-uncased, but "get_special_tokens_mask" function is called from "class AlbertTokenizerFast" whenn I run albert-base-v2.
In PretrainedToknizerBase class,
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
all_special_ids = self.all_special_ids # cache the property
special_tokens_mask = [1 if token in all_special_ids else 0 for token in token_ids_0]
return special_tokens_mask
However in AlbertTokenizerFast class,
def get_special_tokens_mask(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
) -> List[int]:
if already_has_special_tokens:
if token_ids_1 is not None:
raise ValueError(
"You should not supply a second sequence if the provided sequence of "
"ids is already formatted with special tokens for the model."
)
return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))
if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]
=> These two functions are different. Thus when I use bert, all_special_ids( it contains cls, sep, pad id) are ids which cannot be masked. But when i use albert, only cls, sep ids cannot be masked. Thus pad token can be masked when i use albert.
I don't know why the functions are called from different class when I run bert-base-uncased or albert.
Do you know why??
And is it correct that pad token will be masked in albert model??
The text was updated successfully, but these errors were encountered:
woong97
changed the title
why padding tokens can be masked in albert model? Is it bug or right?
Why padding tokens can be masked in albert model? Is it bug or right?
Apr 9, 2021
I tried to run run_mlm.py for bert model and albert model.
"pad" token is not masked when I run bert-base-uncased model , but "pad" token can be masked when I run albert-base-v2
[bert command]
[albert command]
In examples/language-modeliing/run_mlm.py, I try to call tokenizer.get_special_tokens_mask.
"get_special_tokens_mask" function is called from "class PreTrainedTokenizerBase" when I run bert-base-uncased, but "get_special_tokens_mask" function is called from "class AlbertTokenizerFast" whenn I run albert-base-v2.
In PretrainedToknizerBase class,
However in AlbertTokenizerFast class,
=> These two functions are different. Thus when I use bert, all_special_ids( it contains cls, sep, pad id) are ids which cannot be masked. But when i use albert, only cls, sep ids cannot be masked. Thus pad token can be masked when i use albert.
I don't know why the functions are called from different class when I run bert-base-uncased or albert.
Do you know why??
And is it correct that pad token will be masked in albert model??
The text was updated successfully, but these errors were encountered: