Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tests/test_tokenization_reformer.py #6333

Closed
sshleifer opened this issue Aug 7, 2020 · 6 comments
Closed

add tests/test_tokenization_reformer.py #6333

sshleifer opened this issue Aug 7, 2020 · 6 comments
Labels
Help wanted Extra attention is needed, help appreciated Tests Related to tests

Comments

@sshleifer
Copy link
Contributor

sshleifer commented Aug 7, 2020

I don't think there is any common test coverage for ReformerTokenizer. besides through integration tests.
Good source for copy/modification is XLMRobertaTokenizationTest

@sshleifer sshleifer added Help wanted Extra attention is needed, help appreciated Tests Related to tests labels Aug 7, 2020
@sshleifer sshleifer changed the title add test_tokenization_reformer.py add tests/test_tokenization_reformer.py Aug 7, 2020
@D-Roberts
Copy link
Contributor

I can help with this.

@sshleifer
Copy link
Contributor Author

Awesome!

@D-Roberts
Copy link
Contributor

D-Roberts commented Aug 12, 2020

@sshleifer I put together the test code and find that the following test is failing:

self = < tests.test_tokenization_reformer.ReformerTokenizationTest
testMethod = test_torch_encode_plus_sent_to_model 
@slow
@require_torch
def test_torch_encode_plus_sent_to_model(self):
    import torch
    from transformers import MODEL_MAPPING, TOKENIZER_MAPPING

    MODEL_TOKENIZER_MAPPING = merge_model_tokenizer_mappings(MODEL_MAPPING, TOKENIZER_MAPPING)

    tokenizers = self.get_tokenizers(do_lower_case=False)
    for tokenizer in tokenizers:
        with self.subTest(f"{tokenizer.__class__.__name__}"):

            if tokenizer.__class__ not in MODEL_TOKENIZER_MAPPING:
                return

            config_class, model_class = MODEL_TOKENIZER_MAPPING[tokenizer.__class__]
            config = config_class()

            if config.is_encoder_decoder or config.pad_token_id is None:
                return

            model = model_class(config)

            # Make sure the model contains at least the full vocabulary size in its embedding matrix
            is_using_common_embeddings = hasattr(model.get_input_embeddings(), "weight")
            
assert (
    (model.get_input_embeddings().weight.shape[0] >= len(tokenizer))
    if is_using_common_embeddings
    else True
)
AssertionError:
assert False

Upon further investigation I found a discrepancy between the pre-trained tokenizer and pre-trained model config around the pad token id and resulting vocab size. Please see the example below:

from transformers import ReformerTokenizer, ReformerModel model = ReformerModel.from_pretrained("google/reformer-crime-and-punishment") tokenizer = ReformerTokenizer.from_pretrained("google/reformer-crime-and-punishment") print(tokenizer.vocab_size) 320 print(len(tokenizer)) 321 print(model.config.vocab_size) 320 print(model.get_input_embeddings().weight.shape[0]) 320 print(tokenizer.get_vocab()['<pad>']) 320 print(model.config.pad_token_id) 0 print(tokenizer.get_vocab()['<unk>']) 0

What is your suggestion for moving forward?

@sshleifer
Copy link
Contributor Author

My suggestion would be to check in tokenization_utils_base how __len__ works, and try to make it so that ReformerTokenizer's len is 320.

@D-Roberts
Copy link
Contributor

@sshleifer Test merged.

@sshleifer
Copy link
Contributor Author

Thx @D-Roberts !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help wanted Extra attention is needed, help appreciated Tests Related to tests
Projects
None yet
Development

No branches or pull requests

2 participants