Counting multiple entity mentions #22

acxcv · 2023-02-01T18:38:24Z

I edited this post. I had been confused with how entity counts are handled in processed docs.

Example

>>> from collections import Counter

text = Barcelona is a city on the coast of northeastern Spain. It is the capital and largest city of the autonomous community of Catalonia, as well as the second most populous municipality of Spain.
doc = nlp(text)

Expected behavior

>>> Counter(doc.ents)
Counter({'city': 2, 'Spain': 2, 'Barcelona': 1, 'Catalan': 1, 'Spanish': 1, 'autonomous community': 1, 'Catalonia': 1, 'populous': 1, 'municipality': 1})

Actual behavior

>>> Counter(doc.ents)
Counter({Barcelona: 1, Catalan: 1, Spanish: 1, city: 1, Spain: 1, city: 1, autonomous community: 1, Catalonia: 1, populous: 1, municipality: 1, Spain: 1})

This makes sense because there may be entities that share the same surface form but point to a different identifier in the KG. However, it did cause some confusion with me.

This seems to be equivalent to

>>> Counter(doc.spans['dbpedia_spotlight'])
Counter({Barcelona: 1, Catalan: 1, Spanish: 1, city: 1, Spain: 1, city: 1, autonomous community: 1, Catalonia: 1, populous: 1, municipality: 1, Spain: 1})

Solution
In order to achieve what I wanted, I needed to call

>>> Counter(ent.kb_id_ for ent in doc.ents)
Counter({'http://dbpedia.org/resource/City': 2, 'http://dbpedia.org/resource/Spain': 2, 'http://dbpedia.org/resource/Province_of_Barcelona': 1, 'http://dbpedia.org/resource/Catalan_language': 1, 'http://dbpedia.org/resource/Spanish_language': 1, 'http://dbpedia.org/resource/Autonomous_communities_of_Spain': 1, 'http://dbpedia.org/resource/Catalonia': 1, 'http://dbpedia.org/resource/Populous_(company)': 1, 'http://dbpedia.org/resource/Municipalities_of_Spain': 1})

In this example, counting kb_id_s led to the same result as counting __str__ properties. However, I would discourage from using __str__ except you're interested in surface forms only.

>>> Counter(ent.__str__() for ent in doc.ents)
Counter({'city': 2, 'Spain': 2, 'Barcelona': 1, 'Catalan': 1, 'Spanish': 1, 'autonomous community': 1, 'Catalonia': 1, 'populous': 1, 'municipality': 1})

If this becomes relevant to other users, I suggest implementing something like doc.unique_ents for the set of entities in the document and doc.unique_ent_counts for the Counter dict.

The text was updated successfully, but these errors were encountered:

MartinoMensio · 2023-02-03T10:13:05Z

Hi @acxcv ,
Thank you for clarifying your issue. I see that the current way of doing it is not very intuitive with Counter(ent.kb_id_ for ent in doc.ents).

The thing that is making me think at this moment is that we have two different definitions of entities:

the first one, which is the most natural to think about, is the entity itself. "Spain" in your text appears twice, and therefore it should be counted as one single entity with 2 appearances.
the second one, that instead should probably be called "entity mention" or "entity appearance" (in the Spacy world they are Span), which also contains the information about where in the text it appears (1st "Spain" with ent.start=9 and ent.start_char=49, while the 2nd "Spain" with ent.start=34 and ent.start_char=185).

The behaviour of the Counter could be changed depending on the definition of __hash__(), __eq__() and __cmp__() that define the identity and comparison between multiple objects. In the spacy world, the entities belong to the Span class and the definition of the hash and comparison are considering also the position (start and end): https://github.com/explosion/spaCy/blob/master/spacy/tokens/span.pyx#L156

The desired behaviour would be:
doc.ents[1] == doc.ents[6] because they have the same ID (http://dbpedia.org/resource/Spain) I only see positive points in this result.

But, due to the definition of a Span (the class that holds entities), the result is:
doc.ents[1] != doc.ents[6] because they have different starting and ending position

I think one reason of this happening, is also because the standard built-in models of spacy (e.g. en_core_web_md and en_core_web_lg) do not have the concept of entity ID (only NER and not NEL). Maybe in the future this will change as I see they are developing a lot of exciting things on the EntityLinker side.

Unfortunately, I think that it would be better to keep this behaviour in the default operations with the entities.
As you suggested, adding new properties in the doc would be better.
The new properties would go under the extension object: for example doc._.unique_ents, as putting them directly under doc would be more difficult.

I can provide an implementation, but not very soon because I am quite busy writing my PhD thesis at the moment.

Martino

acxcv changed the title ~~Entities occuring multiple times in a text~~ How to handle entities occurring multiple times in a text? Feb 2, 2023

acxcv changed the title ~~How to handle entities occurring multiple times in a text?~~ Counting multiple entity mentions Feb 2, 2023

MartinoMensio added the enhancement New feature or request label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counting multiple entity mentions #22

Counting multiple entity mentions #22

acxcv commented Feb 1, 2023 •

edited

Loading

MartinoMensio commented Feb 3, 2023

Counting multiple entity mentions #22

Counting multiple entity mentions #22

Comments

acxcv commented Feb 1, 2023 • edited Loading

MartinoMensio commented Feb 3, 2023

acxcv commented Feb 1, 2023 •

edited

Loading