-
Notifications
You must be signed in to change notification settings - Fork 27.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Duplicate grouped entities when using 'ner' pipeline #5609
Comments
Can you check whether this still occurs after recently merged #4987? |
Thanks for the response. Is there a special repo I have to pull from or can I just update transformers. Assuming the latter, I just re-ran |
No, you would have to install from source as explained in the readme. |
Just cloned the repo (as directed in readme) and noticed that the issue was resolved! Any estimation when the next update will be released? |
I was still having problems similar to issues #5077 #4816 #5377 After some debugging these are the possible reasons & fixes for wrong groupings: Looking for feedback from maintainers on my [WIP] PR #5970
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
🐛 Bug
Information
Model I am using (Bert, XLNet ...): 'ner' pipeline
Language I am using the model on (English, Chinese ...): English
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
We should receive
[{'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}
, but instead the output has duplicated 'New York':[{'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}, {'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}]
.The Cause of the Issue According to Me
After reading 3.0.2, I noticed that lines 1047-1049 were added. I think this was done to fix a prior issue that caused the last named entity in the sequence to be occasionally omitted when
grouped_entities=True
. Long story short, I think this snippet was a patch that only shifted the problem from being an occasional named entity omission to an occasional named entity duplicate.The for-loop that precedes this snippet is inconsistent in that sometimes the last named entity gets successfully added anyway (e.g. if the
if
clause on 1025 (first iteration) or 1032 is entered on the last iteration). In this case, there is a duplicate entry upon the calling of the new code at 1047. On the converse, the last named entity won’t be added if theelse
clause in line 1041 is entered on the last iteration. In this case, the final named entity correctly gets added after the new code snippet is run.In short, there is a duplicate (I think) if (i) there is only one recognized named entity or (ii) the last named entity is one such that the tokenizer cut it up into multiple tokens. Otherwise, there is no duplicate.
nlp(‘Welcome to Dallas’) -> duplicate 'Dallas' because 'Dallas' is the only named entity
nlp(‘HuggingFace is not located in Dallas’) -> no duplicate because there are multiple entities and the final one 'Dallas' is not tokenized into multiple tokens
nlp(‘HuggingFace is located in New York City’) -> duplicate ‘New York City’ because the final named entity 'New York City' is tokenized into multiple tokens
Environment info
transformers
version: 3.0.2The text was updated successfully, but these errors were encountered: