You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There should be indexes in the output of the grouped entity NER pipeline
The standard NER pipeline from transformers outputs entities that contain the word, score, entity type, and index. The following snippet demonstrates the normal behavior of the NER pipeline with the default grouped_entities=False option.
fromtransformersimportpipelinenlp_without_grouping=pipeline("ner")
sequence="Hugging Face Inc. is a company based in New York City."print(nlp_without_grouping(sequence))
[
{'word': 'Hu', 'score': 0.9992662668228149, 'entity': 'I-ORG', 'index': 1},
{'word': '##gging', 'score': 0.9808881878852844, 'entity': 'I-ORG', 'index': 2},
{'word': 'Face', 'score': 0.9953625202178955, 'entity': 'I-ORG', 'index': 3},
{'word': 'Inc', 'score': 0.9993382096290588, 'entity': 'I-ORG', 'index': 4},
{'word': 'New', 'score': 0.9990268349647522, 'entity': 'I-LOC', 'index': 11},
{'word': 'York', 'score': 0.9988483190536499, 'entity': 'I-LOC', 'index': 12},
{'word': 'City', 'score': 0.9991773366928101, 'entity': 'I-LOC', 'index': 13}
]
However, the NER pipeline with grouped_entities=True outputs only word, score, and entity type. Here's the code snippet and output. There's also the problem of 'New York City' being duplicated, but I will address that in a new issue.
fromtransformersimportpipelinenlp_with_grouping=pipeline("ner", grouped_entities=True)
sequence="Hugging Face Inc. is a company based in New York City."print(nlp_with_grouping(sequence))
[
{'entity_group': 'I-ORG', 'score': 0.9937137961387634, 'word': 'Hugging Face Inc'},
{'entity_group': 'I-LOC', 'score': 0.9990174969037374, 'word': 'New York City'},
{'entity_group': 'I-LOC', 'score': 0.9990174969037374, 'word': 'New York City'}
]
I believe that the grouped entities returned should also include the tokens of the entities. Sample output would look as such
Any application that requires users to locate grouped named entities would require some sort of index. This feature is present in the standard NER pipeline and should also exist in the grouped entity NER pipeline as well.
In my case, I am trying to append the type to the text right after the named entity ("Apple" would become "Apple <I-ORG>") so I need to be able to locate the named entity within my phrase.
Your contribution
I have been able to fix this by adding two lines to group_sub_entities function
defgroup_sub_entities(self, entities: List[dict]) ->dict:
""" Returns grouped sub entities """# Get the first entity in the entity groupentity=entities[0]["entity"]
scores=np.mean([entity["score"] forentityinentities])
tokens= [entity["word"] forentityinentities]
indexes= [entity["index"] forentityinentities] # my added lineentity_group= {
"entity_group": entity,
"score": np.mean(scores),
"word": self.tokenizer.convert_tokens_to_string(tokens),
"indexes": indexes# my added line
}
returnentity_group
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🚀 Feature request
There should be indexes in the output of the grouped entity NER pipeline
The standard NER pipeline from transformers outputs entities that contain the word, score, entity type, and index. The following snippet demonstrates the normal behavior of the NER pipeline with the default
grouped_entities=False
option.However, the NER pipeline with
grouped_entities=True
outputs only word, score, and entity type. Here's the code snippet and output. There's also the problem of 'New York City' being duplicated, but I will address that in a new issue.I believe that the grouped entities returned should also include the tokens of the entities. Sample output would look as such
Motivation
Any application that requires users to locate grouped named entities would require some sort of index. This feature is present in the standard NER pipeline and should also exist in the grouped entity NER pipeline as well.
In my case, I am trying to append the type to the text right after the named entity ("Apple" would become "Apple <I-ORG>") so I need to be able to locate the named entity within my phrase.
Your contribution
I have been able to fix this by adding two lines to
group_sub_entities
functiontransformers/src/transformers/pipelines.py
Line 1042 in 7fad617
The text was updated successfully, but these errors were encountered: