Why padding tokens can be masked in albert model? Is it bug or right? #11158

woong97 · 2021-04-09T08:09:00Z

I tried to run run_mlm.py for bert model and albert model.
"pad" token is not masked when I run bert-base-uncased model , but "pad" token can be masked when I run albert-base-v2

[bert command]

%  python run_mlm.py --model_name_or_path bert-base-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --output_dir ./tmp/test-mlm --line_by_line

[albert command]

%  python run_mlm.py --model_name_or_path albert-base-v2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --output_dir ./tmp/test-mlm --line_by_line

In examples/language-modeliing/run_mlm.py, I try to call tokenizer.get_special_tokens_mask.

tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
print(tokenizer.get_special_tokens_mask([0, 100, 101, 102, 2, 3, 4], already_has_special_tokens=True))

"get_special_tokens_mask" function is called from "class PreTrainedTokenizerBase" when I run bert-base-uncased, but "get_special_tokens_mask" function is called from "class AlbertTokenizerFast" whenn I run albert-base-v2.

In PretrainedToknizerBase class,

def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
   all_special_ids = self.all_special_ids  # cache the property
   special_tokens_mask = [1 if token in all_special_ids else 0 for token in token_ids_0]

   return special_tokens_mask

However in AlbertTokenizerFast class,

def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
    if already_has_special_tokens:
        if token_ids_1 is not None:
            raise ValueError(
                "You should not supply a second sequence if the provided sequence of "
                "ids is already formatted with special tokens for the model."
            )
        return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))

    if token_ids_1 is not None:
        return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
    return [1] + ([0] * len(token_ids_0)) + [1]

=> These two functions are different. Thus when I use bert, all_special_ids( it contains cls, sep, pad id) are ids which cannot be masked. But when i use albert, only cls, sep ids cannot be masked. Thus pad token can be masked when i use albert.

I don't know why the functions are called from different class when I run bert-base-uncased or albert.
Do you know why??

And is it correct that pad token will be masked in albert model??

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-04-09T16:02:19Z

Related to #11163 by @sgugger

sgugger · 2021-04-09T18:14:10Z

This is solved by #11163

woong97 changed the title ~~why padding tokens can be masked in albert model? Is it bug or right?~~ Why padding tokens can be masked in albert model? Is it bug or right? Apr 9, 2021

sgugger closed this as completed Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why padding tokens can be masked in albert model? Is it bug or right? #11158

Why padding tokens can be masked in albert model? Is it bug or right? #11158

woong97 commented Apr 9, 2021 •

edited

Loading

LysandreJik commented Apr 9, 2021

sgugger commented Apr 9, 2021

Why padding tokens can be masked in albert model? Is it bug or right? #11158

Why padding tokens can be masked in albert model? Is it bug or right? #11158

Comments

woong97 commented Apr 9, 2021 • edited Loading

LysandreJik commented Apr 9, 2021

sgugger commented Apr 9, 2021

woong97 commented Apr 9, 2021 •

edited

Loading