Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate grouped entities when using 'ner' pipeline #5609

Closed
2 of 4 tasks
JamesDeAntonis opened this issue Jul 8, 2020 · 6 comments
Closed
2 of 4 tasks

Duplicate grouped entities when using 'ner' pipeline #5609

JamesDeAntonis opened this issue Jul 8, 2020 · 6 comments
Labels

Comments

@JamesDeAntonis
Copy link
Contributor

JamesDeAntonis commented Jul 8, 2020

🐛 Bug

Information

Model I am using (Bert, XLNet ...): 'ner' pipeline

Language I am using the model on (English, Chinese ...): English

The problem arises when using:

  • the official example scripts: (give details below)
  • my own modified scripts: (give details below)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

  1. Have transformers 3.0.2 installed
  2. Run the below code
from transformers import pipeline
nlp = pipeline('ner', grouped_entities=True)
nlp('Welcome to New York')

Expected behavior

We should receive [{'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}, but instead the output has duplicated 'New York': [{'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}, {'entity_group': 'I-LOC', 'score': 0.9984402656555176, 'word': 'New York'}].

The Cause of the Issue According to Me

After reading 3.0.2, I noticed that lines 1047-1049 were added. I think this was done to fix a prior issue that caused the last named entity in the sequence to be occasionally omitted when grouped_entities=True. Long story short, I think this snippet was a patch that only shifted the problem from being an occasional named entity omission to an occasional named entity duplicate.

The for-loop that precedes this snippet is inconsistent in that sometimes the last named entity gets successfully added anyway (e.g. if the if clause on 1025 (first iteration) or 1032 is entered on the last iteration). In this case, there is a duplicate entry upon the calling of the new code at 1047. On the converse, the last named entity won’t be added if the else clause in line 1041 is entered on the last iteration. In this case, the final named entity correctly gets added after the new code snippet is run.

In short, there is a duplicate (I think) if (i) there is only one recognized named entity or (ii) the last named entity is one such that the tokenizer cut it up into multiple tokens. Otherwise, there is no duplicate.

nlp(‘Welcome to Dallas’) -> duplicate 'Dallas' because 'Dallas' is the only named entity
nlp(‘HuggingFace is not located in Dallas’) -> no duplicate because there are multiple entities and the final one 'Dallas' is not tokenized into multiple tokens
nlp(‘HuggingFace is located in New York City’) -> duplicate ‘New York City’ because the final named entity 'New York City' is tokenized into multiple tokens

Environment info

  • transformers version: 3.0.2
  • Platform: Linux-5.3.0-1031-azure-x86_64-with-glibc2.10
  • Python version: 3.8.1
  • PyTorch version (GPU?): 1.5.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: no
  • Using distributed or parallel set-up in script?: no
@julien-c
Copy link
Member

julien-c commented Jul 8, 2020

Can you check whether this still occurs after recently merged #4987?

@JamesDeAntonis
Copy link
Contributor Author

JamesDeAntonis commented Jul 8, 2020

Thanks for the response.

Is there a special repo I have to pull from or can I just update transformers. Assuming the latter, I just re-ran pip install --upgrade transformers. After doing this, the bug persists.

@julien-c
Copy link
Member

julien-c commented Jul 8, 2020

No, you would have to install from source as explained in the readme.

@JamesDeAntonis
Copy link
Contributor Author

Just cloned the repo (as directed in readme) and noticed that the issue was resolved! Any estimation when the next update will be released?

@cceyda
Copy link
Contributor

cceyda commented Jul 22, 2020

I was still having problems similar to issues #5077 #4816 #5377

After some debugging these are the possible reasons & fixes for wrong groupings:

Looking for feedback from maintainers on my [WIP] PR #5970

  • [Bug Fix] add an option ignore_subwords to ignore subsequent ##wordpieces in predictions. Because some models train on only the first token of a word and not on the subsequent wordpieces (BERT NER default). So it makes sense doing the same thing at inference time.

    • The simplest fix is to just group the subwords with the first wordpiece.
      • [TODO] how to handle ignored scores? just set them to 0 and calculate zero invariant mean ?
      • [TODO] handle different wordpiece_prefix ## ? possible approaches:
        get it from tokenizer? but currently most tokenizers dont have a wordpiece_prefix property?
        have an _is_subword(token)
  • [Bug Fix] Shouldn't group entities that are both 'B' even if they are same type

    • (B-type1 B-type1) != (B-type1 I-type1)
  • [Feature add] added option to skip_special_tokens. Cause It was harder to remove them after grouping.

  • [Additional Changes] remove B/I prefix on returned grouped_entities

  • [Feature Request/TODO] Return indexes?

  • [Bug TODO] can't use fast tokenizer with grouped_entities ('BertTokenizerFast' object has no attribute 'convert_tokens_to_string')

@stale
Copy link

stale bot commented Sep 20, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 20, 2020
@stale stale bot closed this as completed Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants