Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity Mapping Preprocessing #169

Open
kimwongyuda opened this issue Nov 17, 2022 · 1 comment
Open

Entity Mapping Preprocessing #169

kimwongyuda opened this issue Nov 17, 2022 · 1 comment

Comments

@kimwongyuda
Copy link

Hi, first of all, thank you for the nice work.

Let's take the below input example.

"Everaldo has played for Guarani and Santa Cruz in the Campeonato Brasileiro, before moving to Mexico where he played for Chiapas and Necaxa." , entity: Guarani .

When training the model through the input, [MASK] token is added for masking Guarani entity.
Then, the model is trained by predicting [MASK] as Guarani through Cross Entropy Loss.

However, when we analyze entity_vocab.json, there isn't "Guarani".
The entity_vocab.json only have "Guarani language", "Guarani FC", "Tupi\u2013Guarani languages", "Guarani mythology".
In that example, I believe that Guarani means Guarani FC.

Therefore, is the model trained to predict [MASK] as Guarani FC?
If yes, we need to let the model know Guarani means Guarani FC.
And, I guess that we need to match Guarani with Guarani FC.

The preprocessing in https://github.com/studio-ousia/luke/blob/master/pretraining.md, deals with such issues ?

Thank you.

@ryokan0123
Copy link
Contributor

Hi @kimwongyuda,

The pretraining data of LUKE is constructed with text from Wikipedia, which is already annotated with ground-truth entities.
So, the ambiguity of entity mentions will not be an issue in pretraining.

In your example, if "Guarani" is not in the entity vocabulary, the answer entity of [MASK] will be [UNK], or such entities are ignored depending on the setting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants