Skip to content

Commit

Permalink
LTR documentation refactoring (castorini#1093)
Browse files Browse the repository at this point in the history
+ ltr documentation refactoring
+ move query in searcher
  • Loading branch information
stephaniewhoo authored Apr 1, 2022
1 parent a7d31b2 commit 79cd796
Show file tree
Hide file tree
Showing 11 changed files with 281 additions and 167 deletions.
40 changes: 24 additions & 16 deletions docs/experiments-ltr-msmarco-document-reranking.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,19 +11,10 @@ Learning-to-rank serves as a second stage-reranker after BM25 retrieval; we use

We're going to use the repository's root directory as the working directory.

First, prepare queries:

```bash
mkdir collections/msmarco-ltr-document

python scripts/ltr_msmarco/convert_queries.py \
--input tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
--output collections/msmarco-ltr-document/queries.dev.small.json
```

The above scripts convert queries to JSON objects with `text`, `text_unlemm`, `raw`, and `text_bert_tok` fields.
Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).

Download our already trained IBM model:

```bash
Expand All @@ -43,16 +34,15 @@ Now, we have all things ready and can run inference:
```bash
python -m pyserini.search.lucene.ltr \
--index msmarco-doc-per-passage-ltr \
--queries collections/msmarco-ltr-document \
--model collections/msmarco-ltr-document/msmarco-passage-ltr-mrr-v1 \
--ibm-model collections/msmarco-ltr-document/ibm_model/ \
--data document \
--topic tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
--qrel tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--output runs/run.ltr.msmarco-doc.tsv \
--granularity document \
--max-passage --hits 10000
```

**TODO**: `--queries collections/msmarco-ltr-document` should refer to the file, i.e., `collections/msmarco-ltr-document/queries.dev.small.json`.

After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:

```bash
Expand All @@ -66,12 +56,29 @@ QueriesRanked: 5193
#####################
```

**TODO**: Add conversion to `trec_eval` format here - basically, make the passage and document pages parallel with each other.
We can also use the official TREC evaluation tool, `trec_eval`, to compute metrics other than MRR@10.
For that we first need to convert the run file into TREC format:

## Building the Index from Scratch
```bash
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \
--input runs/run.ltr.msmarco-doc.tsv --output runs/run.ltr.msmarco-doc.trec

**TODO**: This needs to be changed to the standard doc corpus, not the expansion one...
$ python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
--input tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--output collections/msmarco-ltr-document/qrels.dev.small.trec
```

And then run the `trec_eval` tool:

```bash
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
collections/msmarco-ltr-document/qrels.dev.small.trec runs/run.ltr.msmarco-doc.trec

map all 0.3109
recall_1000 all 0.9268
```

## Building the Index from Scratch
First, we need to download collections.

```bash
Expand All @@ -95,6 +102,7 @@ python scripts/ltr_msmarco/convert_passage_doc.py \
```

The above script will convert the collection and queries to json files with `text_unlemm`, `analyzed`, `text_bert_tok` and `raw` fields.
Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
Next, we need to convert the MS MARCO json collection into Anserini's jsonl files (which have one json object per line):

```bash
Expand Down
34 changes: 8 additions & 26 deletions docs/experiments-ltr-msmarco-passage-reranking.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,33 +9,12 @@ LTR serves as a second-stage reranker after BM25 retrieval.

## Data Prep

We're going to use `collections/msmarco-ltr-passage/` as the working directory to preprocess the data.
First, download the MS MACRO passage dataset `collectionandqueries.tar.gz`, per instructions [here](experiments-msmarco-passage.md).
Then:
We're going to use root as the working directory.

```bash
mkdir collections/msmarco-ltr-passage/

python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.eval.small.tsv \
--output collections/msmarco-ltr-passage/queries.eval.small.json

python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.dev.small.tsv \
--output collections/msmarco-ltr-passage/queries.dev.small.json

python scripts/ltr_msmarco/convert_queries.py \
--input collections/msmarco-passage/queries.train.tsv \
--output collections/msmarco-ltr-passage/queries.train.json
mkdir collections/msmarco-ltr-document
```

**TODO**: Change to the queries already stored in `tools/topics-and-qrels/`; we don't need to process training queries, and we actually don't need to download the corpus at this point (only for building the index from scratch below).

The above scripts convert queries to JSON objects with `text`, `text_unlemm`, `raw`, and `text_bert_tok` fields.
The first two scripts take ~1 min and the third one is a bit longer (~1.5h) since it processes _all_ the training queries (although not necessary for running the commands below).

Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).

## Performing Retrieval

Download our already trained IBM model:
Expand All @@ -57,10 +36,10 @@ The following command generates our reranking result with our prebuilt index:
```bash
python -m pyserini.search.lucene.ltr \
--index msmarco-passage-ltr \
--queries collections/msmarco-ltr-passage \
--model runs/msmarco-passage-ltr-mrr-v1 \
--ibm-model collections/msmarco-ltr-passage/ibm_model/ \
--data passage \
--topic tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
--qrel tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
--output runs/run.ltr.msmarco-passage.tsv
```

Expand Down Expand Up @@ -109,7 +88,9 @@ On the other hand, recall@1000 provides the upper bound effectiveness of downstr

## Building the Index from Scratch

To build an index from scratch, we need to first preprocess the collection:
To build an index from scratch, we need to preprocess the collection:

First, download the MS MACRO passage dataset `collectionandqueries.tar.gz`, per instructions [here](experiments-msmarco-passage.md).

```bash
python scripts/ltr_msmarco/convert_passage.py \
Expand All @@ -118,6 +99,7 @@ python scripts/ltr_msmarco/convert_passage.py \
```

The above script will convert the collection to JSON files with `text_unlemm`, `analyzed`, `text_bert_tok` and `raw` fields.
Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
Next, we need to convert the MS MARCO JSON collection into Anserini's JSONL format:

```bash
Expand Down
72 changes: 22 additions & 50 deletions docs/experiments-msmarco-irst.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,21 +14,12 @@ For IRST, we make the corpus as well as the pre-built indexes available to downl
Here, we start from MS MARCO [passage corpus](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) that has already been processed.
As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).


The below scripts convert queries to json objects with text, text_unlemm, raw, and text_bert_tok fields


```bash
mkdir irst_test
python scripts/ltr_msmarco/convert_queries.py --input path_to_topics --output irst_test/queries.irst_topics.dev.small.json
```
Here the path_to_topics represents the path to topics file saved in tools/topics-and-qrels/ folder, e.g., tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt


### Performing End-to-End Retrieval Using Pretrained Model

Download pretrained IBM models:
```bash
mkdir irst_test/

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
```
Expand All @@ -38,8 +29,8 @@ Next we can run our script to get our end-to-end results.
IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
--query_path irst_test/queries.irst_topics.dev.small.json \
--topics topics \
--tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
--index msmarco-passage-ltr \
--output irst_test/regression_test_sum.irst_topics.txt \
--alpha 0.1
Expand All @@ -48,29 +39,28 @@ python -m pyserini.search.lucene.irst \
IRST (Max)
```bash
python -m pyserini.search.lucene.irst \
--tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
--query_path irst_test/queries.irst_topics.dev.small.json \
--topics topics \
--tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
--index msmarco-passage-ltr \
--output irst_test/regression_test_max.irst_topics.txt \
--alpha 0.3 \
--max_sim
--max-sim
```

For different topics, the `--input`,`--topics` and `--qrel` are different, since Pyserini has all these topics available, we can pass in
For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
different values to run on different datasets.

`--input`: <br />
`--topics`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/topics.dl19-passage.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/topics.dl20.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt` <br />

`--topics`: <br />
`--irst_topics`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `dl19-passage` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `dl20` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `msmarco-passage-dev-subset` <br />



After the run finishes, we can also evaluate the results using the official MS MARCO evaluation script:

For TREC DL 2019, use this command to evaluate your run file:
Expand All @@ -81,23 +71,14 @@ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl19-passage irs

Similarly for TREC DL 2020,
```bash
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-passage irst_test/regression_test_sum.dl19-passage.txt
python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-passage irst_test/regression_test_sum.dl20.txt
```

For MS MARCO Passage V1, no need to use -l 2 option:
```bash
python -m pyserini.eval.trec_eval -c -m ndcg_cut -m map -m recip_rank msmarco-passage-dev-subset irst_test/regression_test_sum.msmarco-passage-dev-subset.txt
```

`--qrel_file`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/qrels.dl19-passage.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/qrels.dl20-passage.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt` <br />


Note that we evaluate MRR and NDCG at a cutoff of 10 hits to match the official evaluation metrics.



## Document Reranking

Expand All @@ -107,15 +88,6 @@ Note that we evaluate MRR and NDCG at a cutoff of 10 hits to match the official
For MSMARCO DOC, each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing.
We utilized the MaxP technique during the ranking, that is scoring documents based on one of its highest-scoring passage.

The below scripts convert queries to json objects with text, text_unlemm, raw, and text_bert_tok fields

```bash
mkdir irst_test
python scripts/ltr_msmarco/convert_queries.py --input path_to_topics --output irst_test/queries.irst_topics.dev.small.json
```

We can now perform retrieval in anserini to generate baseline:

### Performing End-to-End Retrieval Using Pretrained Model


Expand All @@ -132,7 +104,7 @@ IRST (Sum)
```bash
python -m pyserini.search.lucene.irst \
--tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
--query_path irst_test/queries.irst_topics.dev.small.json \
--topics topics \
--index msmarco-document-segment-ltr \
--output irst_test/regression_test_sum.irst_topics.txt \
--alpha 0.3 \
Expand All @@ -143,23 +115,28 @@ IRST (Max)
```bash
python -m pyserini.search.lucene.irst \
--tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
--query_path irst_test/queries.irst_topics.dev.small.json \
--topics topics \
--index msmarco-document-segment-ltr \
--output irst_test/regression_test_max.irst_topics.txt \
--alpha 0.3 \
--hits 10000 \
--max_sim
--max-sim
```


For different topics, the `--input`,`--topics` and `--qrel` are different, since Pyserini has all these topics available, we can pass in
For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
different values to run on different datasets.

`--input/topics`: <br />
`--topics`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/topics.dl19-doc.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/topics.dl20.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/topics.msmarco-doc.dev.txt` <br />

`--irst_topics`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `dl19-doc` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `dl20-doc` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `msmarco-doc` <br />

The reranked runfile contains top 10000 document segments, thus we need to use MaxP technique to get score for each document.

```bash
Expand All @@ -172,7 +149,6 @@ We can use the official TREC evaluation tool, trec_eval, to compute other metric
python tools/scripts/msmarco/convert_msmarco_to_trec_run.py --input irst_test/regression_test_sum_maxP.irst_topics.tsv --output irst_test/regression_test_sum_maxP.irst_topics.trec
```


For TREC DL 2019, use this command to evaluate your run file:

```bash
Expand All @@ -187,11 +163,7 @@ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-doc irst_te
For MS MARCO Passage V1, no need to use -l 2 option:
```bash
python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut -m map -m recip_rank msmarco-doc-dev irst_test/regression_test_sum_maxP.msmarco-doc.trec

`--qrel_file`: <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/qrels.dl19-doc.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/qrels.dl20-doc.txt` <br />
&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/qrels.msmarco-doc.dev.txt` <br />
```

## Results
### Passage Ranking Datasets
Expand Down
16 changes: 5 additions & 11 deletions integrations/sparse/test_lucenesearcher_check_irst.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,18 +36,13 @@ def setUp(self):
ibm_model_tar_name = 'ibm_model_1_bert_tok_20211117.tar.gz'
os.system(f'wget {ibm_model_url} -P irst_test/')
os.system(f'tar -xzvf irst_test/{ibm_model_tar_name} -C irst_test')
# queries process
os.system('python scripts/ltr_msmarco/convert_queries.py \
--input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
--output irst_test/queries.dev.small.json')
# qrel
self.qrels_path = f'{self.pyserini_root}/tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt'

def test_sum_aggregation(self):
os.system('python -m pyserini.search.lucene.irst \
--qrels tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
--tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
--query_path irst_test/queries.dev.small.json \
--topics ./tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
--tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
--index msmarco-passage-ltr \
--output irst_test/regression_test_sum.txt \
--alpha 0.1 ')
Expand All @@ -67,13 +62,12 @@ def test_sum_aggregation(self):

def test_max_aggregation(self):
os.system('python -m pyserini.search.lucene.irst \
--qrels tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
--tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
--query_path irst_test/queries.dev.small.json \
--topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
--tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
--index msmarco-passage-ltr \
--output irst_test/regression_test_max.txt \
--alpha 0.3 \
--max_sim')
--max-sim')

score_cmd = f'{self.pyserini_root}/tools/eval/trec_eval.9.0.4/trec_eval \
-c -M1000 -m map -m ndcg_cut.20 {self.qrels_path} irst_test/regression_test_max.txt'
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,11 +39,13 @@ def test_reranking(self):
ibm_model_url = 'https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/model-ltr-ibm.tar.gz'
ibm_model_tar_name = 'model-ltr-ibm.tar.gz'
os.system(f'wget {ibm_model_url} -P ltr_test/')

# queries process
os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-doc.dev.txt --output ltr_test/queries.dev.small.json')
os.system(f'python -m pyserini.search.lucene.ltr --data document --model ltr_test/msmarco-passage-ltr-mrr-v1/ --index msmarco-doc-per-passage-ltr --ibm-model ltr_test/ibm_model/ --queries ltr_test --output ltr_test/{outp} --max-passage --hits 10000')
os.system(f'python -m pyserini.search.lucene.ltr \
--topic tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
--model ltr_test/msmarco-passage-ltr-mrr-v1/ \
--qrel tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--index msmarco-doc-per-passage-ltr --ibm-model ltr_test/ibm_model/ \
--granularity document --output ltr_test/{outp} --max-passage --hits 10000')

result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run ltr_test/{outp}', shell=True).decode(sys.stdout.encoding)
a,b = result.find('#####################\nMRR @100:'), result.find('\nQueriesRanked: 5193\n#####################\n')
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,12 @@ def test_reranking(self):
os.system(f'wget {ibm_model_url} -P ltr_test/')
os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
#queries process
os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt --output ltr_test/queries.dev.small.json')
os.system(f'python -m pyserini.search.lucene.ltr --model ltr_test/msmarco-passage-ltr-mrr-v1 --data passage --index msmarco-passage-ltr --ibm-model ltr_test/ibm_model/ --queries ltr_test --output-format tsv --output ltr_test/{outp}')
os.system(f'python -m pyserini.search.lucene.ltr \
--model ltr_test/msmarco-passage-ltr-mrr-v1 \
--topic tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
--qrel tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
--index msmarco-passage-ltr --ibm-model ltr_test/ibm_model/ \
--output-format tsv --output ltr_test/{outp}')
result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_passage_eval.py tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt ltr_test/{outp}', shell=True).decode(sys.stdout.encoding)
a,b = result.find('#####################\nMRR @10:'), result.find('\nQueriesRanked: 6980\n#####################\n')
mrr = result[a+31:b]
Expand Down
Loading

0 comments on commit 79cd796

Please sign in to comment.