Skip to content

Commit

Permalink
Fix typos and reformat sentences
Browse files Browse the repository at this point in the history
  • Loading branch information
x389liu committed Nov 12, 2020
1 parent c038312 commit c7ece99
Showing 1 changed file with 12 additions and 17 deletions.
29 changes: 12 additions & 17 deletions docs/working-with-spacy.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,12 +173,11 @@ Then we have sentences:

## Entity Linking

Unfortunately, spaCy does not provide any pre-trained Entity Linking model currently. However, we found another great
Entity Linking package called [Radboud Entity Linker (REL)](https://github.com/informagi/REL#rel-radboud-entity-linker).
Unfortunately, spaCy does not provide any pre-trained entity linking model currently.
However, we found another great entity linking package called [Radboud Entity Linker (REL)](https://github.com/informagi/REL#rel-radboud-entity-linker).

In this section, we introduce an entity linking [script](../scripts/entity_linking.py) which links texts to both Wikipedia and Wikidata entities, using spaCy NER and
REL Entity Linker. The input should be a JSONL file which has one json object per line, like [this](https://github.com/castorini/pyserini/blob/master/integrations/resources/sample_collection_jsonl/documents.jsonl),
while the output is also a JSONL file, where each json object is of format:
In this section, we introduce an entity linking [script](../scripts/entity_linking.py) which links texts to both Wikipedia and Wikidata entities, using spaCy NER and REL Entity Linker.
The input should be a JSONL file which has one json object per line, like [this](https://github.com/castorini/pyserini/blob/master/integrations/resources/sample_collection_jsonl/documents.jsonl), while the output is also a JSONL file, where each json object is of format:

```
{
Expand Down Expand Up @@ -211,20 +210,17 @@ For example, given the input file

### Input Prep

Let us take MS MARCO passage dataset as an example. We need to download the MS MARCO passage dataset and convert the tsv collection into jsonl files by following the
detailed instruction [here](https://github.com/x389liu/pyserini/blob/master/docs/experiments-msmarco-passage.md#data-prep).
Now we should have 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, and each file path can be considered as
`input_path` in our scripts.
Let us take MS MARCO passage dataset as an example.
We need to download the MS MARCO passage dataset and convert the tsv collection into jsonl files by following the detailed instruction [here](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md#data-prep).
Now we should have 9 jsonl files in `collections/msmarco-passage/collection_jsonl`, and each file path can be considered as `input_path` in our scripts.

### REL

First, we follow the github [instruction](https://github.com/informagi/REL#installation-from-source) to install REL and
download required generic file, appropriate wikipedia corpus as well as the corresponding ED model. Then we set up
variable `base_url` as explained in this [tutorial](https://github.com/informagi/REL/blob/master/tutorials/01_How_to_get_started.md#how-to-get-started).
First, we follow the Github [instruction](https://github.com/informagi/REL#installation-from-source) to install REL and download required generic file, appropriate wikipedia corpus as well as the corresponding ED model.
Then we set up variable `base_url` as explained in this [tutorial](https://github.com/informagi/REL/blob/master/tutorials/01_How_to_get_started.md#how-to-get-started).

Note that the `base_url` and ED model path are required as `rel_base_url` and `rel_ed_model_path` in our script respectively.
Another parameter `rel_wiki_version` depends on the version of wikipedia corpus downloaded, e.g.
`wiki_2019` for 2019 Wikipedia corpus.
Another parameter `rel_wiki_version` depends on the version of wikipedia corpus downloaded, e.g. `wiki_2019` for 2019 Wikipedia corpus.

### wikimapper

Expand All @@ -243,6 +239,5 @@ python entity_linking.py --input_path [input_jsonl_file] --rel_base_url [base_ur
--spacy_model [en_core_web_sm, en_core_web_lg, etc.] --output_path [output_jsonl_file]
```

It should take about 5 to 10 minutes to run entity linking on 5,000 MS MARCO passages on Compute Canada. See
[this](https://github.com/castorini/onboarding/blob/master/docs/cc-guide.md#compute-canada) for instructions about
running scripts on Compute Canada.
It should take about 5 to 10 minutes to run entity linking on 5,000 MS MARCO passages on Compute Canada.
See [this](https://github.com/castorini/onboarding/blob/master/docs/cc-guide.md#compute-canada) for instructions about running scripts on Compute Canada.

0 comments on commit c7ece99

Please sign in to comment.