Duplicate grouped entities when using 'ner' pipeline #5609

JamesDeAntonis · 2020-07-08T16:21:09Z

🐛 Bug

Information

Model I am using (Bert, XLNet ...): 'ner' pipeline

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

the official example scripts: (give details below)
my own modified scripts: (give details below)

The tasks I am working on is:

an official GLUE/SQUaD task: (give the name)
my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Have transformers 3.0.2 installed
Run the below code

from transformers import pipeline
nlp = pipeline('ner', grouped_entities=True)
nlp('Welcome to New York')

Expected behavior

We should receive [{'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}, but instead the output has duplicated 'New York': [{'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}, {'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}].

The Cause of the Issue According to Me

After reading 3.0.2, I noticed that lines 1047-1049 were added. I think this was done to fix a prior issue that caused the last named entity in the sequence to be occasionally omitted when grouped_entities=True. Long story short, I think this snippet was a patch that only shifted the problem from being an occasional named entity omission to an occasional named entity duplicate.

The for-loop that precedes this snippet is inconsistent in that sometimes the last named entity gets successfully added anyway (e.g. if the if clause on 1025 (first iteration) or 1032 is entered on the last iteration). In this case, there is a duplicate entry upon the calling of the new code at 1047. On the converse, the last named entity won’t be added if the else clause in line 1041 is entered on the last iteration. In this case, the final named entity correctly gets added after the new code snippet is run.

In short, there is a duplicate (I think) if (i) there is only one recognized named entity or (ii) the last named entity is one such that the tokenizer cut it up into multiple tokens. Otherwise, there is no duplicate.

nlp(‘Welcome to Dallas’) -> duplicate 'Dallas' because 'Dallas' is the only named entity
nlp(‘HuggingFace is not located in Dallas’) -> no duplicate because there are multiple entities and the final one 'Dallas' is not tokenized into multiple tokens
nlp(‘HuggingFace is located in New York City’) -> duplicate ‘New York City’ because the final named entity 'New York City' is tokenized into multiple tokens

Environment info

transformers version: 3.0.2
Platform: Linux-5.3.0-1031-azure-x86_64-with-glibc2.10
Python version: 3.8.1
PyTorch version (GPU?): 1.5.1 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

The text was updated successfully, but these errors were encountered:

julien-c · 2020-07-08T20:54:33Z

Can you check whether this still occurs after recently merged #4987?

JamesDeAntonis · 2020-07-08T21:02:56Z

Thanks for the response.

Is there a special repo I have to pull from or can I just update transformers. Assuming the latter, I just re-ran pip install --upgrade transformers. After doing this, the bug persists.

julien-c · 2020-07-08T21:07:18Z

No, you would have to install from source as explained in the readme.

JamesDeAntonis · 2020-07-09T01:11:37Z

Just cloned the repo (as directed in readme) and noticed that the issue was resolved! Any estimation when the next update will be released?

cceyda · 2020-07-22T13:13:36Z

I was still having problems similar to issues #5077 #4816 #5377

After some debugging these are the possible reasons & fixes for wrong groupings:

Looking for feedback from maintainers on my [WIP] PR #5970

[Bug Fix] add an option ignore_subwords to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.
- The simplest fix is to just group the subwords with the first wordpiece.
  - [TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
  - [TODO] handle different wordpiece_prefix ## ? possible approaches:
    get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
    have an _is_subword(token)
[Bug Fix] Shouldn't group entities that are both 'B' even if they are same type
- (B-type1 B-type1) != (B-type1 I-type1)
[Feature add] added option to skip_special_tokens. Cause It was harder to remove them after grouping.
[Additional Changes] remove B/I prefix on returned grouped_entities
[Feature Request/TODO] Return indexes?
[Bug TODO] can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

stale · 2020-09-20T17:00:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cceyda mentioned this issue Jul 22, 2020

[WIP] Ner pipeline grouped_entities fixes #5970

Merged

3 tasks

stale bot added the wontfix label Sep 20, 2020

stale bot closed this as completed Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate grouped entities when using 'ner' pipeline #5609

Duplicate grouped entities when using 'ner' pipeline #5609

JamesDeAntonis commented Jul 8, 2020 •

edited

Loading

julien-c commented Jul 8, 2020

JamesDeAntonis commented Jul 8, 2020 •

edited

Loading

julien-c commented Jul 8, 2020

JamesDeAntonis commented Jul 9, 2020

cceyda commented Jul 22, 2020 •

edited

Loading

stale bot commented Sep 20, 2020

Duplicate grouped entities when using 'ner' pipeline #5609

Duplicate grouped entities when using 'ner' pipeline #5609

Comments

JamesDeAntonis commented Jul 8, 2020 • edited Loading

🐛 Bug

Information

To reproduce

Expected behavior

The Cause of the Issue According to Me

Environment info

julien-c commented Jul 8, 2020

JamesDeAntonis commented Jul 8, 2020 • edited Loading

julien-c commented Jul 8, 2020

JamesDeAntonis commented Jul 9, 2020

cceyda commented Jul 22, 2020 • edited Loading

stale bot commented Sep 20, 2020

JamesDeAntonis commented Jul 8, 2020 •

edited

Loading

JamesDeAntonis commented Jul 8, 2020 •

edited

Loading

cceyda commented Jul 22, 2020 •

edited

Loading