Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEAT score of bert-large-cased seems wrong #1

Open
inmoonlight opened this issue May 1, 2021 · 5 comments · May be fixed by #2
Open

SEAT score of bert-large-cased seems wrong #1

inmoonlight opened this issue May 1, 2021 · 5 comments · May be fixed by #2

Comments

@inmoonlight
Copy link

Dear authors,
It seems the reported SEAT score of bert-large-cased is wrong.

I was able to reproduce the results based on the current code base, however, I found two errors in the code.

1. Even though I called bert-large-cased, tokenized tokens are all lowercased.

import pytorch_pretrained_bert as bert

version = 'bert-large-cased'

tokenizer = bert.BertTokenizer.from_pretrained(version)
text = 'SEAT score of bert-large-CASED'
tokenized = tokenizer.tokenize(text)  # ['seat', 'score', 'of', 'be', '##rt', '-', 'large', '-', 'case', '##d']

2. The score is calculated from the first token embedding vector, not a [CLS] token.

Below is the sentbias/encoders/bert.py, and you can find text is not prepended with a [CLS] token.

''' Convenience functions for handling BERT '''
import torch
import pytorch_pretrained_bert as bert


def load_model(version='bert-large-uncased'):
    ''' Load BERT model and corresponding tokenizer '''
    tokenizer = bert.BertTokenizer.from_pretrained(version)
    model = bert.BertModel.from_pretrained(version)
    model.eval()

    return model, tokenizer


def encode(model, tokenizer, texts):
    ''' Use tokenizer and model to encode texts '''
    encs = {}
    for text in texts:
        tokenized = tokenizer.tokenize(text)  # <<< BUG: a [CLS] token should be prepended
        indexed = tokenizer.convert_tokens_to_ids(tokenized)
        segment_idxs = [0] * len(tokenized)
        tokens_tensor = torch.tensor([indexed])
        segments_tensor = torch.tensor([segment_idxs])
        enc, _ = model(tokens_tensor, segments_tensor, output_all_encoded_layers=False)

        enc = enc[:, 0, :]  # extract the last rep of the first input
        encs[text] = enc.detach().view(-1).numpy()
    return encs
@W4ngatang
Copy link
Owner

W4ngatang commented May 10, 2021 via email

@inmoonlight inmoonlight linked a pull request May 23, 2021 that will close this issue
@inmoonlight
Copy link
Author

@W4ngatang

Thanks for the reply!

To mitigate error propagation, I fixed the codes to resolve the two abovementioned issues.

@realliyifei
Copy link

Thank you for the informative conversation @inmoonlight @W4ngatang

I have some quick questions:

  1. So the purpose of enc[:, 0, :] here is to extract the last representation of token [CLS]? (though as mentioned above, it won’t be much different even instead pick the representation of the first subword’s token).
  2. Is there any reason not to average the tokens along 0-axis (i.e. the average of each token’s last representation), which I thought might make more sense in evaluation given it catches each token’s information?
  3. Is there any reason the code doesn’t put the model, token etc. on GPU here?

@W4ngatang
Copy link
Owner

W4ngatang commented May 11, 2022 via email

@realliyifei
Copy link

Thank you for your reply. That’s very helpful, Alex!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants