Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add indexes to grouped entity NER pipeline #5676

Closed
prithvikannan opened this issue Jul 11, 2020 · 4 comments
Closed

Add indexes to grouped entity NER pipeline #5676

prithvikannan opened this issue Jul 11, 2020 · 4 comments
Labels

Comments

@prithvikannan
Copy link

prithvikannan commented Jul 11, 2020

🚀 Feature request

There should be indexes in the output of the grouped entity NER pipeline

The standard NER pipeline from transformers outputs entities that contain the word, score, entity type, and index. The following snippet demonstrates the normal behavior of the NER pipeline with the default grouped_entities=False option.

from transformers import pipeline
nlp_without_grouping = pipeline("ner")
sequence = "Hugging Face Inc. is a company based in New York City."
print(nlp_without_grouping(sequence))

[
    {'word': 'Hu', 'score': 0.9992662668228149, 'entity': 'I-ORG', 'index': 1},
    {'word': '##gging', 'score': 0.9808881878852844, 'entity': 'I-ORG', 'index': 2},
    {'word': 'Face', 'score': 0.9953625202178955, 'entity': 'I-ORG', 'index': 3},
    {'word': 'Inc', 'score': 0.9993382096290588, 'entity': 'I-ORG', 'index': 4},
    {'word': 'New', 'score': 0.9990268349647522, 'entity': 'I-LOC', 'index': 11},
    {'word': 'York', 'score': 0.9988483190536499, 'entity': 'I-LOC', 'index': 12},
    {'word': 'City', 'score': 0.9991773366928101, 'entity': 'I-LOC', 'index': 13}
]

However, the NER pipeline with grouped_entities=True outputs only word, score, and entity type. Here's the code snippet and output. There's also the problem of 'New York City' being duplicated, but I will address that in a new issue.

from transformers import pipeline
nlp_with_grouping = pipeline("ner", grouped_entities=True) 
sequence = "Hugging Face Inc. is a company based in New York City."
print(nlp_with_grouping(sequence))

[
    {'entity_group': 'I-ORG', 'score': 0.9937137961387634, 'word': 'Hugging Face Inc'},
    {'entity_group': 'I-LOC', 'score': 0.9990174969037374, 'word': 'New York City'},
    {'entity_group': 'I-LOC', 'score': 0.9990174969037374, 'word': 'New York City'}
]

I believe that the grouped entities returned should also include the tokens of the entities. Sample output would look as such

[
    {'entity_group': 'I-ORG', 'score': 0.9930560886859894, 'word': 'Hugging Face Inc', 'indexes': [1, 2, 3, 4]},
    {'entity_group': 'I-LOC', 'score': 0.998809814453125, 'word': 'New York City', 'indexes': [11, 12, 13]},
    {'entity_group': 'I-LOC', 'score': 0.998809814453125, 'word': 'New York City', 'indexes': [11, 12, 13]}
]

Motivation

Any application that requires users to locate grouped named entities would require some sort of index. This feature is present in the standard NER pipeline and should also exist in the grouped entity NER pipeline as well.

In my case, I am trying to append the type to the text right after the named entity ("Apple" would become "Apple <I-ORG>") so I need to be able to locate the named entity within my phrase.

Your contribution

I have been able to fix this by adding two lines to group_sub_entities function

def group_sub_entities(self, entities: List[dict]) -> dict:

    def group_sub_entities(self, entities: List[dict]) -> dict:
        """
        Returns grouped sub entities
        """
        # Get the first entity in the entity group
        entity = entities[0]["entity"]
        scores = np.mean([entity["score"] for entity in entities])
        tokens = [entity["word"] for entity in entities]
        indexes = [entity["index"] for entity in entities]    # my added line

        entity_group = {
            "entity_group": entity,
            "score": np.mean(scores),
            "word": self.tokenizer.convert_tokens_to_string(tokens),
            "indexes": indexes    # my added line
        }
        return entity_group
@stale
Copy link

stale bot commented Sep 11, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Sep 11, 2020
@sasi143
Copy link

sasi143 commented Sep 15, 2020

I am facing the same issue, Does this issue got fixed

@stale stale bot removed the wontfix label Sep 15, 2020
@stale
Copy link

stale bot commented Nov 14, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Nov 14, 2020
@stale stale bot closed this as completed Nov 22, 2020
@Narsil
Copy link
Contributor

Narsil commented Dec 8, 2020

Fixed by #8781

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants