Labelling schemes #13

namiyousef · 2022-03-03T18:40:08Z

As of right now, the BigBird model (loaded using AutoModelForTokenClassification) takes in encoded inputs using (AutoTokenizer). When the model is trained, e.g. model(**inputs, labels=labels) the size of labels must be the same size as the tensors in inputs. Does this always have to be the case?

Example

If I have a sentence "I am Yousef Nami" the corresponding labels (for standard NER) should be: ["O", "O", "B-PERSON", "I-PERSON"].

However, after tokenisation, the sentence becomes: ['[CLS]', '▁I', '▁am', '▁Y', 'ous', 'ef', '▁N', 'ami', '[SEP]'] and so BigBird expects the output to be something like this: ['O', 'O', 'O', 'B-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON' OR 'I-PERSON', 'B-PERSON', 'I-PERSON', 'O'].

We need to answer the following:

Does the target variable size always have to match the embedding size? If so, why?
Which is the correct way of representing the target variables corresponding to tokenised entities, e.g. does ['Yousef', 'B-PERSON'] become [['▁Y', 'ous', 'ef'], ['B-PERSON', 'B-PERSON', 'B-PERSON']] or [['▁Y', 'ous', 'ef'], ['B-PERSON', 'I-PERSON', 'I-PERSON']]
Do the [CLS] and [SEP] variables turn into 'O'? What effect does this have on the classification?

The text was updated successfully, but these errors were encountered:

…see #13 for updates)

namiyousef · 2022-03-08T18:56:42Z

@ValerieF412 asked Yihong the following question:
Out of the tokenisation techniques below, which one should we go for?
Sample sentence = "Yousef Nami", Labels = "B-PER I-PER", tokenised = "_Y ou sef _Na mi" (length = 5)

Strat 1: Token labels same as word labels

Intuition: we are interested in predicting the correctness of the words, not the tokens. So when we tokenise, we map the labels to tokens as well.
Resulting output: "B-PER B-PER B-PER I-PER I-PER"

Strat 2: Only have a single start label, the rest should be intermediate

Intuition: we are interested in predicting only where each sequence starts, and not so much the actual classification of the words once a sequence starts.
Resulting output: "B-PER I-PER I-PER I-PER I-PER"

Strat 3: only have labels for the start tokens of words which identify an entity

Intuition: similar to above, but perhaps that removing the intermediates decreases noise? This might make sense for an NER example, but will it apply to argument mining?
Resulting output: "B-PER x x I-PER x"

Question: are the blank tokens just the out of vocab token "O" or do we have a separate label?

Yihong responded, suggesting the following strategy:

Strat 4: Add a third end label for the tokens

Intuition: not sure, but would assume it's similar to the intuition for Strat 2, except that we give more information on sequence ending as well?
Resulting output: "B-PER I-PER I-PER I-PER E-PER", where E-PER signifies the end of an entity sequence.
Each of the above methods may be valid, and they may have different effects on how you evaluate the model.

namiyousef added bug Something isn't working help wanted Extra attention is needed labels Mar 3, 2022

namiyousef added this to the Submit first model milestone Mar 3, 2022

namiyousef assigned ValerieF412 and olicm0601 Mar 3, 2022

namiyousef added a commit that referenced this issue Mar 3, 2022

#5 BUGFIX: extends class labels to embedding dimension size (HOTFIX, …

a59de67

…see #13 for updates)

namiyousef added documentation Improvements or additions to documentation and removed bug Something isn't working help wanted Extra attention is needed labels Apr 1, 2022

namiyousef changed the title ~~NER Transformer based classification embedding size always same as classification output size?~~ Labelling schemes Apr 1, 2022

namiyousef mentioned this issue Apr 19, 2022

API Endpoint Documentation #52

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Labelling schemes #13

Labelling schemes #13

namiyousef commented Mar 3, 2022

namiyousef commented Mar 8, 2022

Labelling schemes #13

Labelling schemes #13

Comments

namiyousef commented Mar 3, 2022

Example

namiyousef commented Mar 8, 2022

Strat 1: Token labels same as word labels

Strat 2: Only have a single start label, the rest should be intermediate

Strat 3: only have labels for the start tokens of words which identify an entity

Strat 4: Add a third end label for the tokens