You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jul 25, 2024. It is now read-only.
After DH2019 and a meeting before that with Marijn Schraagen, I have the idea that a multiNER might not be the best solution for collecting as many locations from a(ny) historical corpus. Instead, I think, it might be a better idea to enhance one NER system with a list of placenames, preferably historical and multilingual, so that the chances of finding placenames are significantly increased. In light of this project, all other entities are ignored anyway.
Do a brief proof of concept with SpaCy, enhanced with both these Dutch historical placenames and relevant bits of Geonames. Find entities and compare with either the Italian or Dutch Golden Standard (or both) using evaluate.py.
The text was updated successfully, but these errors were encountered:
I ran Spacy with two sets from the Geonames database: 1) with the featureclasses P, L, and A (see the README at the bottom of this page; 2) with only P. In both cases I extracted the regular (i.e. English) place name and its alternate Italian name. For 1) this resulted in 271432 toponyms, and for 2) 162317. These list were added to Spacy in three ways: a) as patterns in the EntityRulerbefore the ner pipeline; b) as patterns in the EntityRulerafter the ner pipeline; and c) as plain strings to match using a PhraseMatcher.
Results, however, were very disappointing. Giving only the scores for LOC, here is Spacy without any toponyms added:
SpaCy config
precision
recall
f1-score
None
0.399
0.644
0.441
And with the various modifications:
SpaCy config
precision
recall
f1-score
a1
0.264
0.740
0.356
a2
0.307
0.654
0.384
b1
0.263
0.734
0.354
b2
x
x
x
c1
0.337
0.651
0.406
c2
0.335
0.651
0.405
Note that b2 was never run, because of pure disappointment on the developers' side.
Perhaps I was expecting too much?
Anyhow, this does not seem the way to go for now, although I am still tempted by the method, and investigating further why this doesn't work as well as expected (e.g. what kinds of placenames are missed, which words are said to be LOC when they aren't, what can be done to improve this?).
After DH2019 and a meeting before that with Marijn Schraagen, I have the idea that a multiNER might not be the best solution for collecting as many locations from a(ny) historical corpus. Instead, I think, it might be a better idea to enhance one NER system with a list of placenames, preferably historical and multilingual, so that the chances of finding placenames are significantly increased. In light of this project, all other entities are ignored anyway.
Do a brief proof of concept with SpaCy, enhanced with both these Dutch historical placenames and relevant bits of Geonames. Find entities and compare with either the Italian or Dutch Golden Standard (or both) using
evaluate.py
.The text was updated successfully, but these errors were encountered: