Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Labelling schemes #13

Open
3 tasks
namiyousef opened this issue Mar 3, 2022 · 1 comment
Open
3 tasks

Labelling schemes #13

namiyousef opened this issue Mar 3, 2022 · 1 comment
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@namiyousef
Copy link
Owner

As of right now, the BigBird model (loaded using AutoModelForTokenClassification) takes in encoded inputs using (AutoTokenizer). When the model is trained, e.g. model(**inputs, labels=labels) the size of labels must be the same size as the tensors in inputs. Does this always have to be the case?

Example

If I have a sentence "I am Yousef Nami" the corresponding labels (for standard NER) should be: ["O", "O", "B-PERSON", "I-PERSON"].

However, after tokenisation, the sentence becomes: ['[CLS]', '▁I', '▁am', '▁Y', 'ous', 'ef', '▁N', 'ami', '[SEP]'] and so BigBird expects the output to be something like this: ['O', 'O', 'O', 'B-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON', 'I-PERSON', 'O'].

We need to answer the following:

  • Does the target variable size always have to match the embedding size? If so, why?
  • Which is the correct way of representing the target variables corresponding to tokenised entities, e.g. does ['Yousef', 'B-PERSON'] become [['▁Y', 'ous', 'ef'], ['B-PERSON', 'B-PERSON', 'B-PERSON']] or [['▁Y', 'ous', 'ef'], ['B-PERSON', 'I-PERSON', 'I-PERSON']]
  • Do the [CLS] and [SEP] variables turn into 'O'? What effect does this have on the classification?
@namiyousef namiyousef added bug Something isn't working help wanted Extra attention is needed labels Mar 3, 2022
@namiyousef namiyousef added this to the Submit first model milestone Mar 3, 2022
namiyousef added a commit that referenced this issue Mar 3, 2022
@namiyousef
Copy link
Owner Author

@ValerieF412 asked Yihong the following question:
Out of the tokenisation techniques below, which one should we go for?
Sample sentence = "Yousef Nami", Labels = "B-PER I-PER", tokenised = "_Y ou sef _Na mi" (length = 5)

Strat 1: Token labels same as word labels

Intuition: we are interested in predicting the correctness of the words, not the tokens. So when we tokenise, we map the labels to tokens as well.
Resulting output: "B-PER B-PER B-PER I-PER I-PER"

Strat 2: Only have a single start label, the rest should be intermediate

Intuition: we are interested in predicting only where each sequence starts, and not so much the actual classification of the words once a sequence starts.
Resulting output: "B-PER I-PER I-PER I-PER I-PER"

Strat 3: only have labels for the start tokens of words which identify an entity

Intuition: similar to above, but perhaps that removing the intermediates decreases noise? This might make sense for an NER example, but will it apply to argument mining?
Resulting output: "B-PER x x I-PER x"

Question: are the blank tokens just the out of vocab token "O" or do we have a separate label?

Yihong responded, suggesting the following strategy:

Strat 4: Add a third end label for the tokens

Intuition: not sure, but would assume it's similar to the intuition for Strat 2, except that we give more information on sequence ending as well?
Resulting output: "B-PER I-PER I-PER I-PER E-PER", where E-PER signifies the end of an entity sequence.
Each of the above methods may be valid, and they may have different effects on how you evaluate the model.

@namiyousef namiyousef added documentation Improvements or additions to documentation and removed bug Something isn't working help wanted Extra attention is needed labels Apr 1, 2022
@namiyousef namiyousef changed the title NER Transformer based classification embedding size always same as classification output size? Labelling schemes Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

3 participants