Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why padding tokens can be masked in albert model? Is it bug or right? #11158

Closed
woong97 opened this issue Apr 9, 2021 · 2 comments
Closed

Why padding tokens can be masked in albert model? Is it bug or right? #11158

woong97 opened this issue Apr 9, 2021 · 2 comments

Comments

@woong97
Copy link

woong97 commented Apr 9, 2021

I tried to run run_mlm.py for bert model and albert model.
"pad" token is not masked when I run bert-base-uncased model , but "pad" token can be masked when I run albert-base-v2

[bert command]

%  python run_mlm.py --model_name_or_path bert-base-uncased --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --output_dir ./tmp/test-mlm --line_by_line

[albert command]

%  python run_mlm.py --model_name_or_path albert-base-v2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --do_train --do_eval --output_dir ./tmp/test-mlm --line_by_line

In examples/language-modeliing/run_mlm.py, I try to call tokenizer.get_special_tokens_mask.

tokenizer = AutoTokenizer.from_pretrained(model_args.model_name_or_path, **tokenizer_kwargs)
print(tokenizer.get_special_tokens_mask([0, 100, 101, 102, 2, 3, 4], already_has_special_tokens=True))

"get_special_tokens_mask" function is called from "class PreTrainedTokenizerBase" when I run bert-base-uncased, but "get_special_tokens_mask" function is called from "class AlbertTokenizerFast" whenn I run albert-base-v2.

In PretrainedToknizerBase class,

def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
   all_special_ids = self.all_special_ids  # cache the property
   special_tokens_mask = [1 if token in all_special_ids else 0 for token in token_ids_0]

   return special_tokens_mask

However in AlbertTokenizerFast class,

def get_special_tokens_mask(
        self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False
    ) -> List[int]:
    if already_has_special_tokens:
        if token_ids_1 is not None:
            raise ValueError(
                "You should not supply a second sequence if the provided sequence of "
                "ids is already formatted with special tokens for the model."
            )
        return list(map(lambda x: 1 if x in [self.sep_token_id, self.cls_token_id] else 0, token_ids_0))

    if token_ids_1 is not None:
        return [1] + ([0] * len(token_ids_0)) + [1] + ([0] * len(token_ids_1)) + [1]
    return [1] + ([0] * len(token_ids_0)) + [1]

=> These two functions are different. Thus when I use bert, all_special_ids( it contains cls, sep, pad id) are ids which cannot be masked. But when i use albert, only cls, sep ids cannot be masked. Thus pad token can be masked when i use albert.

I don't know why the functions are called from different class when I run bert-base-uncased or albert.
Do you know why??

And is it correct that pad token will be masked in albert model??

@woong97 woong97 changed the title why padding tokens can be masked in albert model? Is it bug or right? Why padding tokens can be masked in albert model? Is it bug or right? Apr 9, 2021
@LysandreJik
Copy link
Member

Related to #11163 by @sgugger

@sgugger
Copy link
Collaborator

sgugger commented Apr 9, 2021

This is solved by #11163

@sgugger sgugger closed this as completed Apr 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants