Skip to content

Latest commit

 

History

History

nel_emerson

🪐 Weasel Project: Disambiguation of "Emerson" mentions in sentences (Entity Linking)

This project was created as part of a step-by-step video tutorial. It uses spaCy's entity linking functionality and Prodigy to disambiguate "Emerson" mentions in text to unique identifiers from Wikidata. As an example use-case, we consider three different people called Emerson: an Australian tennis player, an American writer, and a Brazilian footballer. See here for the previous scripts for spaCy v2.x.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command Description
download Download a spaCy model with pretrained vectors and NER component
kb Create the Knowledge Base in spaCy and write it to file
corpus Create a training and dev set from the manually annotated data
train Train a new Entity Linking component
evaluate Final evaluation on the dev data and printing the results
clean Remove intermediate files

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow Steps
all downloadkbcorpustrainevaluate
training kbcorpustrainevaluate

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File Source Description
assets/emerson_annotated_text.jsonl Local The annotated data
assets/entities.csv Local The entities in the knowledge base
assets/emerson_input_text.txt Local The original input text

Prodigy annotation

To perform the manual annotation in Prodigy, we have written a custom recipe el_recipe.py.

As input, we need to provide the Knowledge base my_kb and NER pipeline my_nlp that are created with the scripts described in the previous section. Further, the file emerson_input_text.txt lists 30 sentences from Wikipedia containing just the mention "Emerson" and not the full name. These sentences are then annotated with Prodigy by executing the command

prodigy entity_linker.manual emersons_annotated assets/emerson_input_text.txt temp/my_nlp/ temp/my_kb assets/entities.csv -F scripts/el_recipe.py

The final results are stored to file with

prodigy db-out emersons_annotated >> emerson_annotated_text.jsonl

This JSONL file is included here as well in the assets subdirectory so the scripts can be run without having to (re)do this manual annotation.