Name		Name	Last commit message	Last commit date
parent directory ..
assets		assets
configs		configs
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
README.md		README.md
project.yml		project.yml
requirements.txt		requirements.txt
test_project_nel_emerson.py		test_project_nel_emerson.py

README.md

🪐 Weasel Project: Disambiguation of "Emerson" mentions in sentences (Entity Linking)

This project was created as part of a step-by-step video tutorial. It uses spaCy's entity linking functionality and Prodigy to disambiguate "Emerson" mentions in text to unique identifiers from Wikidata. As an example use-case, we consider three different people called Emerson: an Australian tennis player, an American writer, and a Brazilian footballer. See here for the previous scripts for spaCy v2.x.

📋 project.yml

The project.yml defines the data assets required by the project, as well as the available commands and workflows. For details, see the Weasel documentation.

⏯ Commands

The following commands are defined by the project. They can be executed using weasel run [name]. Commands are only re-run if their inputs have changed.

Command	Description
`download`	Download a spaCy model with pretrained vectors and NER component
`kb`	Create the Knowledge Base in spaCy and write it to file
`corpus`	Create a training and dev set from the manually annotated data
`train`	Train a new Entity Linking component
`evaluate`	Final evaluation on the dev data and printing the results
`clean`	Remove intermediate files

⏭ Workflows

The following workflows are defined by the project. They can be executed using weasel run [name] and will run the specified commands in order. Commands are only re-run if their inputs have changed.

Workflow	Steps
`all`	`download` → `kb` → `corpus` → `train` → `evaluate`
`training`	`kb` → `corpus` → `train` → `evaluate`

🗂 Assets

The following assets are defined by the project. They can be fetched by running weasel assets in the project directory.

File	Source	Description
`assets/emerson_annotated_text.jsonl`	Local	The annotated data
`assets/entities.csv`	Local	The entities in the knowledge base
`assets/emerson_input_text.txt`	Local	The original input text

Prodigy annotation

To perform the manual annotation in Prodigy, we have written a custom recipe el_recipe.py.

As input, we need to provide the Knowledge base my_kb and NER pipeline my_nlp that are created with the scripts described in the previous section. Further, the file emerson_input_text.txt lists 30 sentences from Wikipedia containing just the mention "Emerson" and not the full name. These sentences are then annotated with Prodigy by executing the command

prodigy entity_linker.manual emersons_annotated assets/emerson_input_text.txt temp/my_nlp/ temp/my_kb assets/entities.csv -F scripts/el_recipe.py

The final results are stored to file with

prodigy db-out emersons_annotated >> emerson_annotated_text.jsonl

This JSONL file is included here as well in the assets subdirectory so the scripts can be run without having to (re)do this manual annotation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nel_emerson

nel_emerson

README.md

🪐 Weasel Project: Disambiguation of "Emerson" mentions in sentences (Entity Linking)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Prodigy annotation

Files

nel_emerson

Directory actions

More options

Directory actions

More options

Latest commit

History

nel_emerson

Folders and files

parent directory

README.md

🪐 Weasel Project: Disambiguation of "Emerson" mentions in sentences (Entity Linking)

📋 project.yml

⏯ Commands

⏭ Workflows

🗂 Assets

Prodigy annotation