diff --git a/docs/working-with-spacy.md b/docs/working-with-spacy.md index c8b2355dc..551709378 100644 --- a/docs/working-with-spacy.md +++ b/docs/working-with-spacy.md @@ -173,12 +173,11 @@ Then we have sentences: ## Entity Linking -Unfortunately, spaCy does not provide any pre-trained Entity Linking model currently. However, we found another great -Entity Linking package called [Radboud Entity Linker (REL)](https://github.com/informagi/REL#rel-radboud-entity-linker). +Unfortunately, spaCy does not provide any pre-trained entity linking model currently. +However, we found another great entity linking package called [Radboud Entity Linker (REL)](https://github.com/informagi/REL#rel-radboud-entity-linker). -In this section, we introduce an entity linking [script](../scripts/entity_linking.py) which links texts to both Wikipedia and Wikidata entities, using spaCy NER and -REL Entity Linker. The input should be a JSONL file which has one json object per line, like [this](https://github.com/castorini/pyserini/blob/master/integrations/resources/sample_collection_jsonl/documents.jsonl), -while the output is also a JSONL file, where each json object is of format: +In this section, we introduce an entity linking [script](../scripts/entity_linking.py) which links texts to both Wikipedia and Wikidata entities, using spaCy NER and REL Entity Linker. +The input should be a JSONL file which has one json object per line, like [this](https://github.com/castorini/pyserini/blob/master/integrations/resources/sample_collection_jsonl/documents.jsonl), while the output is also a JSONL file, where each json object is of format: ``` { @@ -211,20 +210,17 @@ For example, given the input file ### Input Prep -Let us take MS MARCO passage dataset as an example. We need to download the MS MARCO passage dataset and convert the tsv collection into jsonl files by following the -detailed instruction [here](https://github.com/x389liu/pyserini/blob/master/docs/experiments-msmarco-passage.md#data-prep). -Now we should have 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, and each file path can be considered as -`input_path` in our scripts. +Let us take MS MARCO passage dataset as an example. +We need to download the MS MARCO passage dataset and convert the tsv collection into jsonl files by following the detailed instruction [here](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md#data-prep). +Now we should have 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, and each file path can be considered as `input_path` in our scripts. ### REL -First, we follow the github [instruction](https://github.com/informagi/REL#installation-from-source) to install REL and -download required generic file, appropriate wikipedia corpus as well as the corresponding ED model. Then we set up -variable `base_url` as explained in this [tutorial](https://github.com/informagi/REL/blob/master/tutorials/01_How_to_get_started.md#how-to-get-started). +First, we follow the Github [instruction](https://github.com/informagi/REL#installation-from-source) to install REL and download required generic file, appropriate wikipedia corpus as well as the corresponding ED model. +Then we set up variable `base_url` as explained in this [tutorial](https://github.com/informagi/REL/blob/master/tutorials/01_How_to_get_started.md#how-to-get-started). Note that the `base_url` and ED model path are required as `rel_base_url` and `rel_ed_model_path` in our script respectively. -Another parameter `rel_wiki_version` depends on the version of wikipedia corpus downloaded, e.g. -`wiki_2019` for 2019 Wikipedia corpus. +Another parameter `rel_wiki_version` depends on the version of wikipedia corpus downloaded, e.g. `wiki_2019` for 2019 Wikipedia corpus. ### wikimapper @@ -243,6 +239,5 @@ python entity_linking.py --input_path [input_jsonl_file] --rel_base_url [base_ur --spacy_model [en_core_web_sm, en_core_web_lg, etc.] --output_path [output_jsonl_file] ``` -It should take about 5 to 10 minutes to run entity linking on 5,000 MS MARCO passages on Compute Canada. See -[this](https://github.com/castorini/onboarding/blob/master/docs/cc-guide.md#compute-canada) for instructions about -running scripts on Compute Canada. +It should take about 5 to 10 minutes to run entity linking on 5,000 MS MARCO passages on Compute Canada. +See [this](https://github.com/castorini/onboarding/blob/master/docs/cc-guide.md#compute-canada) for instructions about running scripts on Compute Canada.