Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong ED result despite wildly different context and wiki embeddings #122

Open
abhinavkulkarni opened this issue Oct 31, 2022 · 0 comments

Comments

@abhinavkulkarni
Copy link

abhinavkulkarni commented Oct 31, 2022

Hi,

I seem to get the wrong response for the following totally obvious case. Below is a description of Erika B Hess who is an artist (link). This entity is not part of wiki_2019 DB, however, there is another Erika Hess, the skier (link) present in the DB to which the ED links.

As the following code snippet shows, the context surrounding the first mention is very different from the Wikipedia description of the skier however, ED links to the skier with high confidence.

import requests

API_URL = "http://0.0.0.0:5555"
text_doc = "Erika Hess is a painter, curator, and the host of the I Like Your Work Podcast. Her paintings and drawings center around gender, motherhood, and the environment.   In this episode, Erika and I talk about the emotions in her work, what she’s learned through her experience and why it’s so important to have a community behind you as an artist.    Also mentioned in today’s episode:     Widening the space as an artist 3:48 Erika’s background and her current work 6:41 How feelings and emotions are intertwined in Erika’s work 12:09 Being a mother and a painter 14:36 The importance of community in art, especially as a woman 20:50 How the I Like Your Work Podcast started and what it’s taught Erika 25:50 Erika’s membership and what encouraged her to start one 37:45 The importance of boundaries as an artist 44:53    If you enjoyed this episode, please rate, review and share it!    Connect with Erika:  Erika’s work: https://www.erikabhess.com/ Erika’s podcast: www.ilikeyourworkpodcast.com/home Erika’s membership: https://www.ilikeyourworkpodcast.com/yourlisteners"

ed_result = requests.post(API_URL, json={
    "text": text_doc,
    "spans": [(0, 10)]
}).json()

print(ed_result)

I get the following output:

[[0, 10, 'Erika Hess', 'Erika_Hess', 0.3872783780234141, 0.0, 'NULL']]

From my experimentation, a threshold of 0.35 seems appropriate for accepting an ED result.

So, I launched into an experiment, wherein I replaced all the occurrences of Erika Hess with other famous ladies (none of whom are painters or artists) to see if REL links to them (even though - there's a complete mismatch between the mention context and entity Wikipedia embedding).

import requests

API_URL = "http://0.0.0.0:5555"
text_doc = "Erika Hess is a painter, curator, and the host of the I Like Your Work Podcast. Her paintings and drawings center around gender, motherhood, and the environment.   In this episode, Erika and I talk about the emotions in her work, what she’s learned through her experience and why it’s so important to have a community behind you as an artist.    Also mentioned in today’s episode:     Widening the space as an artist 3:48 Erika’s background and her current work 6:41 How feelings and emotions are intertwined in Erika’s work 12:09 Being a mother and a painter 14:36 The importance of community in art, especially as a woman 20:50 How the I Like Your Work Podcast started and what it’s taught Erika 25:50 Erika’s membership and what encouraged her to start one 37:45 The importance of boundaries as an artist 44:53    If you enjoyed this episode, please rate, review and share it!    Connect with Erika:  Erika’s work: https://www.erikabhess.com/ Erika’s podcast: www.ilikeyourworkpodcast.com/home Erika’s membership: https://www.ilikeyourworkpodcast.com/yourlisteners"


replacements = [
    {'Erika': 'Kara', 'Hess': 'Swisher'},  # journalist
    {'Erika': 'Nancy', 'Hess': 'Pelosi'},  # politician
    {'Erika': 'Kamala', 'Hess': 'Harris'},  # politician
    {'Erika': 'Michelle', 'Hess': 'Obama'},  # politician
    {'Erika': 'Hillary', 'Hess': 'Clinton'},  # politician
    {'Erika': 'Elizabeth', 'Hess': 'Warren'},  # politician
    {'Erika': 'Alexandria', 'Hess': 'Ocasio-Cortez'},  # politician
    {'Erika': 'Kim', 'Hess': 'Kardashian'},  # celibrity
]

for replacement in replacements:
    text_doc_ = text_doc
    for k,v in replacement.items():
        text_doc_ = text_doc_.replace(k, v)

    # First two words is the mention
    words = text_doc_.split()[:2]
    length = len(words[0]) + 1 + len(words[1])
    span = (0, length)

    ed_result = requests.post(API_URL, json={
        "text": text_doc_,
        "spans": [span]
    }).json()

    print(ed_result)

I get the following output:

[[0, 12, 'Kara Swisher', 'Kara_Swisher', 0.3872783780234141, 0.0, 'NULL']]
[[0, 12, 'Nancy Pelosi', 'Nancy_Pelosi', 0.9522967375717587, 0.0, 'NULL']]
[[0, 13, 'Kamala Harris', 'Kamala_Harris', 0.45961057079516854, 0.0, 'NULL']]
[[0, 14, 'Michelle Obama', 'Michelle_Obama', 0.40272704217312827, 0.0, 'NULL']]
[[0, 15, 'Hillary Clinton', 'Hillary_Clinton', 0.45967763466114964, 0.0, 'NULL']]
[[0, 16, 'Elizabeth Warren', 'Elizabeth_Warren', 0.7480285258692932, 0.0, 'NULL']]
[[0, 24, 'Alexandria Ocasio-Cortez', 'Alexandria_Ocasio-Cortez', 0.3872783780234141, 0.0, 'NULL']]
[[0, 14, 'Kim Kardashian', 'Kim_Kardashian', 0.3872783780234141, 0.0, 'NULL']]

It seems REL ED seems to emphasize full-string match a lot and doesn't emphasize context and wiki embedding matching. This is a problem when evaluating ED on out-of-DB mentions which have a string match with another entity in the DB. Is this problem primarily because the REL ED system is not well-trained to disambiguate not in DB entities? I am running into this problem a lot where often times a single name mention (such as Barack) gets linked to an entity (such as Barack Obama), even though there's a complete mismatch between the context and the Wiki description of the linked entity.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant