Entity identification for historical documents in Dutch, developed within the Clariah+ project at VU Amsterdam.
While the primary use case is to process historical Dutch documents, the more general idea of this project is to develop an adaptive framework that can process any set of Dutch documents. This means, for instance, documents with or without recognized entities (gold NER or not); documents with entities that can be linked or not (in-KB entities or not), etc.
We achieve this flexibility in two ways:
- we create an unsupervised system, based on recent techniques, like BERT embeddings
- we involve human experts, by allowing them to enrich or alter the tool output
The current solution is entirely unsupervised, and works as follows:
- Obtain documents (supported formats so far: mediawiki and NIF
- Extract entity mentions (gold NER or by running SpaCy)
- Create initial NAF documents containing recognized entities too
- Compute BERT sentence+mention embeddings
- Enrich them with word2vec document embeddings
- Bucket mentions based on similarity of mentions
- Cluster embeddings for a bucket based on the HAC algorithm
- Run evaluation with rand index-based score
We compare our identity clustering algorithm against 5 baselines:
- string-similarity - forms that are identical or sufficiently similar are coreferential.
- one-form-one-identity - all occurrences of the same form refer to the same entity.
- one-form-and-type-one-identity - all occurrences of the same form, when this form is of the same semantic type, refer to the same entity.
- one-form-in-document-one-identity - all occurrences of the same form within a document are coreferential. All occurrences across documents are not.
- one-form-and-type-in-document-one-identity - all occurrences of the same form that have the same semantic type within a document are coreferential; the rest are not.
-
The scripts
make_wiki_corpus.py
andmake_nif_corpus.py
create a corpus (as Python classes) from the source data we download in mediawiki or NIF format, respectively. The scriptmake_wiki_corpus.py
expects the filedata/input_data/nlwikinews-latest-pages-articles.xml
as input, which is a collection of Wikinews documents in Dutch in XML format. The scriptmake_nif_corpus.py
expects the iput fileabstracts_nl{num}.ttl
, wherenum
is a number between 0 and 43, inclusive. These extraction scripts use some functions frompickle_utils.py
and fromwiki_utils.py
. -
The script
main.py
executes the algorithm procedure described above. It relies on functions in several utility files:algorithm_utils.py
,bert_utils.py
,analysis_utils.py
,pickle_utils.py
,naf_utils.py
. -
Evaluation functions are stored in the file
evaluation.py
. -
Baselines are run by running the file
baselines.py
(with no arguments). -
The classes we work with are defined in the file
classes.py
. -
Configuration files are found in the folder
cfg
. These are loaded through the scriptconfig.py
. -
All data is stored in the folder
data
.
To prepare your environment with the right packages, run bash install.sh
.
Then download the corpora you would like to work with, and store it in: data/{corpus_name}/input_data
. To reuse the config files found in cfg
and run wikinews or dbpedia abstracts, you can do the following.
for wikinews
, downloadnlwikinews-latest-pages-articles.xml
, for example from here. Then store it indata/wikinews/input_data
(make sure you unpack it).- for
dbpedia_abstracts
, you can download .ttl files from this link. Each .ttl contains many abstracts, so it is advisable to start with 1 file to understand what is going on. Download and unpack the file, then store it indata/dbpedia_abstracts/input_data
Then you should be able to run make_wiki_corpus.py
and make_nif_corpus.py
to load the corpora; and you should be able to run directly main.py
in order to process the corpora with our tool. Make sure that you use the right config file in these scripts (e.g., wikinews50.cfg
will let you process 50 files from Wikinews).
- Filip Ilievski (f.ilievski@vu.nl)
- Sophie Arnoult (sophie.arnoult@posteo.net)