Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tweaking ED pipeline #121

Open
abhinavkulkarni opened this issue Oct 26, 2022 · 8 comments
Open

Tweaking ED pipeline #121

abhinavkulkarni opened this issue Oct 26, 2022 · 8 comments

Comments

@abhinavkulkarni
Copy link

abhinavkulkarni commented Oct 26, 2022

Hi,

Thanks again for the great work!

I am currently evaluating REL for ED purposes and comparing it against other ED techniques, chiefly against BLINK from Facebook AI Research. They both take into account the context in which a mention occurs, are two-staged, and use neural approaches. BLINK does well, but can be slow and requires a GPU to run, which is a limitation for me.

Although REL is fast and lightweight, I find that it often misses a few obvious cases. I am looking for some guidance as to how I can tweak the internal workings of REL to achieve accurate results.

The following results have been obtained by running REL on a podcast description and a particular episode description - separated by a newline.

That is, in the code

text_doc = podcast_summary + '\n' + episode_summary
el_result = requests.post(API_URL, json={
    "text": text_doc,
    "spans": []
}).json()
  • For this episode, mention Shadi Hamid is identified as Brookings_Institution with score 0.9991938769817352 and NER tag PER. This is particularly egregious. Shadi Hamid's Wikipedia page is not being returned as the 1st candidate.

  • For this episode, mention Lauren Bonner from the podcast description is being identified as Lauren_Samuels with score 0.9993583559989929 even though the last names are quite different while mention Ray J is (correctly) identified as Ray_J albeit with a lower score 0.8136761486530304.

  • For this episode, mention Charlamagne Tha God from the podcast description gets only 0.7140538295110067 score even though words like comedians, outspoken celebrities, and thought-leaders appear in the context (which should make it easy to match his embedding learned from his Wikipedia profile which contains similar words).

  • For this episode, mention Dave Smith is always identified as Dave_Smith_(engineer) with very high confidence, even though Dave_Smith_(comedian), the correct answer appears in the candidate set and has even words such as government, foreign policy, and all things Libertarian in the context which should have had a greater match with his description on Wikipedia.

The last point is particularly important since Dave Smith is quite a common name and there are at least 4 Dave Smiths in Wikipedia - but with very differing descriptions.

Thanks!

@arjenpdevries
Copy link
Member

Very interesting project, and nice failure analysis!

My suspicion is that the REL data is too old for your more contemporary data. E.g., the Shadi Hamid WP page was only filled with some text in 2020, and I think Lauren Bonner does not have a Wikipedia page. End 2013, the Charlamagne Tha God page only had one occurrence of "Comedy", the rest is really about works and shows, so harder to match. The Dave Smith page was created in 2021, so not in our data.

(PS: Did you use the 2014 or the 2019 data?)

I think you need a newer Wikipedia target (and newer embeddings) to get better results for these cases. I'll ask the team if anyone is working on this anyways, or can help out.

@abhinavkulkarni
Copy link
Author

@arjenpdevries: Thanks for the prompt response. I am using wiki_2019 dataset.

Okay, so Shadi Hamid case is resolved as his page was filled after 2019.

Still, Charlamagne Tha God case is perplexing, not quite sure what his page looked like in 2019, but he's a pretty big celebrity and a comedian, so I am sure his page - even back in 2019 should have had those references.

As for Lauren Bonner, yes, it's not in WP, however a very different candidate Lauren_Samuels was returned with a high probability 0.9993583559989929 - I suppose I can keep a list of WP titles and aliases and prune out such outliers using some implementation of Levenshtein distance - but that'll make ED slower.

Also, thanks for clarifying Dave Smith situation.

Looking forward to the newer WP target and embeddings, if they are available.

Thanks!

@arjenpdevries
Copy link
Member

The golden route would be to link to a different KG that actually specializes in your domain - but we have not really experimented with other KGs either, yet. Or make a Wikipedia page for Lauren Bonner ;-) I assume BLINK has the same problem in this specific case, it also links to WP as a target, doesn't it?

@hasibi
Copy link
Member

hasibi commented Oct 26, 2022

Indeed a nice analysis, thanks for sharing it with us.

I also recommend using REL with a new Wikipedia version. A version of REL trained on Wikipedia dump 2021-03 is made available by Giovanni Sorice from University of Pisa at the SoBigData platform.

Note that the SoBigData API of REL is made by a third party and the results may not necessarily be in line with what we report in the paper and confirm.

Updating REL to new Wikipedia versions is also in our todo list, but it might take a while until it is available (other features of REL have priority now).

If you want to deploy REL on another Wikipedia version, you can follow this tutorial: https://rel.readthedocs.io/en/latest/tutorials/deploy_REL_new_wiki/

@hasibi
Copy link
Member

hasibi commented Oct 26, 2022

Also for the Charlamagne Tha God example, the entity is identified correctly, but you are wondering why the confidence score is relatively low, right?
Note that the confidence score is not calibrated score in range [0,1], but a learned value.

1 similar comment
@hasibi

This comment was marked as duplicate.

@abhinavkulkarni
Copy link
Author

@hasibi: There seem to be two scores - one for NER and the other for ED. It seems I was looking at the NER score.

Can you provide more details about the ED score? Is the lower the better? Would you recommend a threshold that achieves a good trade-off between precision and recall?

Concatenating podcast and episode description (episode link), I get the following result:

[
 [
  0,
  19,
  "Charlamagne Tha God",
  "Charlamagne_tha_God",
  0.3872783780234141,
  0.7140538295110067,
  "PER"
 ],
 [
  540,
  13,
  "Michael Cohen",
  "Michael_Cohen_(lawyer)",
  0.45666060472584635,
  0.9998109936714172,
  "PER"
 ],
 [
  564,
  10,
  "Jim Norton",
  "Jim_Norton_(comedian)",
  0.4219384342889336,
  0.9998306930065155,
  "PER"
 ],
 [
  650,
  19,
  "Charlamagne Tha God",
  "Charlamagne_tha_God",
  0.3872783780234141,
  0.9017136096954346,
  "PER"
 ],
 [
  677,
  16,
  "Stephen A. Smith",
  "Stephen_A._Smith",
  0.3872783780234141,
  0.9961599707603455,
  "PER"
 ]
]

@hasibi
Copy link
Member

hasibi commented Oct 31, 2022

I was talking about ED scores. The higher, the better. We do not recommend a threshold, as we optimize on F1. If you want to get higher precision, in the expense of lowering recall, you can choose your own threshold. This threshold is mainly introduced for downstream tasks, where confidence of the entity linker is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants