This project was created as part of a step-by-step video tutorial. It uses spaCy's entity linking functionality and Prodigy to disambiguate "Emerson" mentions in text to unique identifiers from Wikidata. As an example use-case, we consider three different people called Emerson: an Australian tennis player, an American writer, and a Brazilian footballer. See here for the previous scripts for spaCy v2.x.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
download |
Download a spaCy model with pretrained vectors and NER component |
kb |
Create the Knowledge Base in spaCy and write it to file |
corpus |
Create a training and dev set from the manually annotated data |
train |
Train a new Entity Linking component |
evaluate |
Final evaluation on the dev data and printing the results |
clean |
Remove intermediate files |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
download → kb → corpus → train → evaluate |
training |
kb → corpus → train → evaluate |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/emerson_annotated_text.jsonl |
Local | The annotated data |
assets/entities.csv |
Local | The entities in the knowledge base |
assets/emerson_input_text.txt |
Local | The original input text |
To perform the manual annotation in Prodigy, we have written a custom recipe
el_recipe.py
.
As input, we need to provide the Knowledge base my_kb
and NER pipeline
my_nlp
that are created with the scripts described in the previous section.
Further, the file emerson_input_text.txt
lists
30 sentences from Wikipedia containing just the mention "Emerson" and not the
full name. These sentences are then annotated with Prodigy by executing the
command
prodigy entity_linker.manual emersons_annotated assets/emerson_input_text.txt temp/my_nlp/ temp/my_kb assets/entities.csv -F scripts/el_recipe.py
The final results are stored to file with
prodigy db-out emersons_annotated >> emerson_annotated_text.jsonl
This JSONL file is included here as well in the assets
subdirectory
so the scripts can be run without having to (re)do this manual annotation.