You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As of right now, the BigBird model (loaded using AutoModelForTokenClassification) takes in encoded inputs using (AutoTokenizer). When the model is trained, e.g. model(**inputs, labels=labels) the size of labels must be the same size as the tensors in inputs. Does this always have to be the case?
Example
If I have a sentence "I am Yousef Nami" the corresponding labels (for standard NER) should be: ["O", "O", "B-PERSON", "I-PERSON"].
However, after tokenisation, the sentence becomes: ['[CLS]', '▁I', '▁am', '▁Y', 'ous', 'ef', '▁N', 'ami', '[SEP]'] and so BigBird expects the output to be something like this: ['O', 'O', 'O', 'B-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON', 'I-PERSON', 'O'].
We need to answer the following:
Does the target variable size always have to match the embedding size? If so, why?
Which is the correct way of representing the target variables corresponding to tokenised entities, e.g. does ['Yousef', 'B-PERSON'] become [['▁Y', 'ous', 'ef'], ['B-PERSON', 'B-PERSON', 'B-PERSON']] or [['▁Y', 'ous', 'ef'], ['B-PERSON', 'I-PERSON', 'I-PERSON']]
Do the [CLS] and [SEP] variables turn into 'O'? What effect does this have on the classification?
The text was updated successfully, but these errors were encountered:
@ValerieF412 asked Yihong the following question:
Out of the tokenisation techniques below, which one should we go for?
Sample sentence = "Yousef Nami", Labels = "B-PER I-PER", tokenised = "_Y ou sef _Na mi" (length = 5)
Strat 1: Token labels same as word labels
Intuition: we are interested in predicting the correctness of the words, not the tokens. So when we tokenise, we map the labels to tokens as well.
Resulting output: "B-PER B-PER B-PER I-PER I-PER"
Strat 2: Only have a single start label, the rest should be intermediate
Intuition: we are interested in predicting only where each sequence starts, and not so much the actual classification of the words once a sequence starts.
Resulting output: "B-PER I-PER I-PER I-PER I-PER"
Strat 3: only have labels for the start tokens of words which identify an entity
Intuition: similar to above, but perhaps that removing the intermediates decreases noise? This might make sense for an NER example, but will it apply to argument mining?
Resulting output: "B-PER x x I-PER x"
Question: are the blank tokens just the out of vocab token "O" or do we have a separate label?
Yihong responded, suggesting the following strategy:
Strat 4: Add a third end label for the tokens
Intuition: not sure, but would assume it's similar to the intuition for Strat 2, except that we give more information on sequence ending as well?
Resulting output: "B-PER I-PER I-PER I-PER E-PER", where E-PER signifies the end of an entity sequence.
Each of the above methods may be valid, and they may have different effects on how you evaluate the model.
namiyousef
changed the title
NER Transformer based classification embedding size always same as classification output size?
Labelling schemes
Apr 1, 2022
As of right now, the BigBird model (loaded using
AutoModelForTokenClassification
) takes in encoded inputs using (AutoTokenizer
). When the model is trained, e.g.model(**inputs, labels=labels)
the size oflabels
must be the same size as the tensors ininputs
. Does this always have to be the case?Example
If I have a sentence "I am Yousef Nami" the corresponding labels (for standard NER) should be:
["O", "O", "B-PERSON", "I-PERSON"]
.However, after tokenisation, the sentence becomes:
['[CLS]', '▁I', '▁am', '▁Y', 'ous', 'ef', '▁N', 'ami', '[SEP]']
and so BigBird expects the output to be something like this:['O', 'O', 'O', 'B-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON', 'I-PERSON', 'O']
.We need to answer the following:
['Yousef', 'B-PERSON']
become[['▁Y', 'ous', 'ef'], ['B-PERSON', 'B-PERSON', 'B-PERSON']]
or[['▁Y', 'ous', 'ef'], ['B-PERSON', 'I-PERSON', 'I-PERSON']]
[CLS]
and[SEP]
variables turn into'O'
? What effect does this have on the classification?The text was updated successfully, but these errors were encountered: