High F1 score. But poor accuracy during Inference due to tokenisation #5541

sudharsan2020 · 2020-07-06T11:32:34Z

🐛 Bug

Information

I am using Bert-Base-cased model to train my custom Named entity recognition(NER) model with a sequence length of 512.

Language I am using the model on: English

The problem arises when using:

the official example scripts: token-classification/run_ner.py

The tasks I am working on is:

an official GLUE/SQUaD task: Named entity recognition
my own task or dataset: Custom Dataset

To reproduce

Steps to reproduce the behavior:

1.Use the default NER Pipeline to load the custom trained model
self.model_prediction_pipeline = pipeline( "ner", model=model_path, tokenizer= model_path, grouped_entities=True )
2. I've attached the Evaluation results of the model.
eval_loss = 0.021479165139844086
eval_precision = 0.8725970149253731
eval_recall = 0.8868932038834951
eval_f1 = 0.8796870297923562
epoch = 5.0

Expected behavior

Model should produce a good accuracy corresponding to the F1 score.
However during Inference, I am not getting an accuracy over 30%
Not sure if the inconsistent tokenisation leads to poor results.

Environment info

transformers version: 3.0.0
Platform: Linux-4.15.0-109-generic-x86_64-with-debian-buster-sid
Python version: 3.7.7
PyTorch version (GPU?): 1.4.0
Tensorflow version (GPU?):NA
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

The text was updated successfully, but these errors were encountered:

LysandreJik · 2020-07-06T13:20:39Z

Hello! Why do you believe the tokenization to be the issue here?

sudharsan2020 · 2020-07-06T15:05:43Z

@LysandreJik Thanks for reaching out.

Please find my observations with the inconsistency in the Tokenizer(possible issue), since I was using the HuggingFace provided script for training the custom NER Model.

1. Expected name:
AUDLEY THOMPSON

Predicted name:
{'entity_group': 'B-PER', 'score': 0.9993636608123779, 'word': 'AUDLE'},
{'entity_group': 'I-PER', 'score': 0.8126876294612885, 'word': '##Y THOMPS'}

Issue:
Last two letters got skipped

2. Expected name:
DANIEL, BROWN

Predicted name:
{'entity_group': 'B-PER', 'score': 0.9559168517589569, 'word': 'DAN'},
{'entity_group': 'I-PER', 'score': 0.9092316627502441, 'word': '##IE'},
{'entity_group': 'B-PER', 'score': 0.5071505904197693, 'word': '##L'},
{'entity_group': 'I-PER', 'score': 0.849787175655365, 'word': ', BROWN'}

Issue:
The wordpiece tokenizer splits the begin entity into smaller pieces. However model predicts that as an "I-PER" entity which makes it really difficult to merge continuous entities

3. Expected name:
VINEY, PAJTSHIA

Predicted name:
{'entity_group': 'B-PER', 'score': 0.9991838335990906, 'word': 'VI'},
{'entity_group': 'I-PER', 'score': 0.9591831763585409, 'word': '##Y , PA'}
{'entity_group': 'I-PER', 'score': 0.7927274107933044, 'word': '##IA'}

Issue:
'NE' word is missed in the name: 'VINEY'
'JTSH' word is missed in the name: 'PAJTSHIA'

4. Expected name:
Pierson, Garcia

Predicted name:
{'entity_group': 'B-PER', 'score': 0.9972472190856934, 'word': 'Pierson'},
{'entity_group': 'I-PER', 'score': 0.8200799822807312, 'word': 'GA'},
{'entity_group': 'I-PER', 'score': 0.8131067156791687, 'word': '##IA'}

Issue:
'RC' word is missed in the name: 'Garcia'

Please let me know if I am missing something.
Missing characters and split tokens are major reasons for the accuracy drop while merging the Begin(B-PER) and Info(I-PER) entities.

stale · 2020-09-04T21:53:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Sep 4, 2020

stale bot closed this as completed Sep 11, 2020

cceyda mentioned this issue Sep 28, 2020

[WIP] Ner pipeline grouped_entities fixes #5970

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High F1 score. But poor accuracy during Inference due to tokenisation #5541

High F1 score. But poor accuracy during Inference due to tokenisation #5541

sudharsan2020 commented Jul 6, 2020

LysandreJik commented Jul 6, 2020

sudharsan2020 commented Jul 6, 2020 •

edited

Loading

stale bot commented Sep 4, 2020

High F1 score. But poor accuracy during Inference due to tokenisation #5541

High F1 score. But poor accuracy during Inference due to tokenisation #5541

Comments

sudharsan2020 commented Jul 6, 2020

🐛 Bug

Information

To reproduce

Expected behavior

Environment info

LysandreJik commented Jul 6, 2020

sudharsan2020 commented Jul 6, 2020 • edited Loading

stale bot commented Sep 4, 2020

sudharsan2020 commented Jul 6, 2020 •

edited

Loading