-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Named Entity Recognition and Classification for languages other than EN/FR #136
Comments
Hello @dddpt ! Sorry for the slow response :( For non-English/French texts, no NER is used, which means "terms" are selected only via Wikipedia anchors of this language. NER is nice for general text (like journalism, history. ...), because the named entity classes are very general. NER does not help for more specialized domains, like scientific domains. Wikipedia vocabulary is bringing reliable terms, and NER actually is often noisy. |
Hi @kermitt2, Thanks for the answer ;-) A Wikipedia anchor is the text of a link from a wikipedia article to another right? So it means that each time a term (or a sequence of terms?) corresponds to any anchor in the Wikipedia of the corresponding language, it is recognized as an entity? (and while I'm at it, is there a technical report/article detailing entity-fishing in addition to the readthedocs?) |
It is recognized as entity candidate, this is how more or less all entity linking tools work (although often not at full scale). In English for instance, there are 206 million "terms" (so anchors, plus article titles and synomyms - single or multiple word terms) considered by entity-fishing for every input. Each term of these 206 million terms is associated with one or several Wikidata entities.
Well indeed plenty of words/multi-word terms might be considered (what I call "mention"), leading a massive amount of entity candidates. The challenge is to 1) select the most likely correct entity candidates 2) decide if the most-likely one is acceptable (so reject some "linking", because the term is used as common word, not as a reference to a particular entity). Only a few candidates are finally selected as final label entities. The "best" mentions and entities are selected by learning the disambiguation made by the wikipedia contributors when adding anchors in Wikipedia.
This presentation at WikiDataCon https://grobid.s3.amazonaws.com/presentations/29-10-2017.pdf |
Great, thanks for the detailed reply 👍 |
I am using entity-fishing on a corpus of ~35k documents with a french, an italian and a german version.
In the entity-fishing documentation, there is this paragraph:
What does it mean for non english/french texts?
Is another named entity recognition system used?
Should I expect worse results for entity recognition on german and italian?
The german wikipedia has the best coverage of the topics in my corpus so I was thinking of focusing on the german version of the corpus. Now I'm wondering if I should instead focus on the french version hoping for better performance on recognition. Any hints?
Thanks for this great tool! :-)
The text was updated successfully, but these errors were encountered: