Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Counting multiple entity mentions #22

Open
acxcv opened this issue Feb 1, 2023 · 1 comment
Open

Counting multiple entity mentions #22

acxcv opened this issue Feb 1, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@acxcv
Copy link

acxcv commented Feb 1, 2023

I edited this post. I had been confused with how entity counts are handled in processed docs.

Example

>>> from collections import Counter

text = Barcelona is a city on the coast of northeastern Spain. It is the capital and largest city of the autonomous community of Catalonia, as well as the second most populous municipality of Spain.
doc = nlp(text)

Expected behavior

>>> Counter(doc.ents)
Counter({'city': 2, 'Spain': 2, 'Barcelona': 1, 'Catalan': 1, 'Spanish': 1, 'autonomous community': 1, 'Catalonia': 1, 'populous': 1, 'municipality': 1}) 

Actual behavior

>>> Counter(doc.ents)
Counter({Barcelona: 1, Catalan: 1, Spanish: 1, city: 1, Spain: 1, city: 1, autonomous community: 1, Catalonia: 1, populous: 1, municipality: 1, Spain: 1})

This makes sense because there may be entities that share the same surface form but point to a different identifier in the KG. However, it did cause some confusion with me.

This seems to be equivalent to

>>> Counter(doc.spans['dbpedia_spotlight'])
Counter({Barcelona: 1, Catalan: 1, Spanish: 1, city: 1, Spain: 1, city: 1, autonomous community: 1, Catalonia: 1, populous: 1, municipality: 1, Spain: 1}) 

Solution
In order to achieve what I wanted, I needed to call

>>> Counter(ent.kb_id_ for ent in doc.ents)
Counter({'http://dbpedia.org/resource/City': 2, 'http://dbpedia.org/resource/Spain': 2, 'http://dbpedia.org/resource/Province_of_Barcelona': 1, 'http://dbpedia.org/resource/Catalan_language': 1, 'http://dbpedia.org/resource/Spanish_language': 1, 'http://dbpedia.org/resource/Autonomous_communities_of_Spain': 1, 'http://dbpedia.org/resource/Catalonia': 1, 'http://dbpedia.org/resource/Populous_(company)': 1, 'http://dbpedia.org/resource/Municipalities_of_Spain': 1})                                         

In this example, counting kb_id_s led to the same result as counting __str__ properties. However, I would discourage from using __str__ except you're interested in surface forms only.

>>> Counter(ent.__str__() for ent in doc.ents)
Counter({'city': 2, 'Spain': 2, 'Barcelona': 1, 'Catalan': 1, 'Spanish': 1, 'autonomous community': 1, 'Catalonia': 1, 'populous': 1, 'municipality': 1})

If this becomes relevant to other users, I suggest implementing something like doc.unique_ents for the set of entities in the document and doc.unique_ent_counts for the Counter dict.

@acxcv acxcv changed the title Entities occuring multiple times in a text How to handle entities occurring multiple times in a text? Feb 2, 2023
@acxcv acxcv changed the title How to handle entities occurring multiple times in a text? Counting multiple entity mentions Feb 2, 2023
@MartinoMensio
Copy link
Owner

Hi @acxcv ,
Thank you for clarifying your issue. I see that the current way of doing it is not very intuitive with Counter(ent.kb_id_ for ent in doc.ents).

The thing that is making me think at this moment is that we have two different definitions of entities:

  • the first one, which is the most natural to think about, is the entity itself. "Spain" in your text appears twice, and therefore it should be counted as one single entity with 2 appearances.
  • the second one, that instead should probably be called "entity mention" or "entity appearance" (in the Spacy world they are Span), which also contains the information about where in the text it appears (1st "Spain" with ent.start=9 and ent.start_char=49, while the 2nd "Spain" with ent.start=34 and ent.start_char=185).

The behaviour of the Counter could be changed depending on the definition of __hash__(), __eq__() and __cmp__() that define the identity and comparison between multiple objects. In the spacy world, the entities belong to the Span class and the definition of the hash and comparison are considering also the position (start and end): https://github.com/explosion/spaCy/blob/master/spacy/tokens/span.pyx#L156

The desired behaviour would be:
doc.ents[1] == doc.ents[6] because they have the same ID (http://dbpedia.org/resource/Spain) I only see positive points in this result.

But, due to the definition of a Span (the class that holds entities), the result is:
doc.ents[1] != doc.ents[6] because they have different starting and ending position

I think one reason of this happening, is also because the standard built-in models of spacy (e.g. en_core_web_md and en_core_web_lg) do not have the concept of entity ID (only NER and not NEL). Maybe in the future this will change as I see they are developing a lot of exciting things on the EntityLinker side.

Unfortunately, I think that it would be better to keep this behaviour in the default operations with the entities.
As you suggested, adding new properties in the doc would be better.
The new properties would go under the extension object: for example doc._.unique_ents, as putting them directly under doc would be more difficult.

I can provide an implementation, but not very soon because I am quite busy writing my PhD thesis at the moment.

Martino

@MartinoMensio MartinoMensio added the enhancement New feature or request label Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants