LTR documentation refactoring (castorini#1093)

+ ltr documentation refactoring + move query in searcher
crystina-z · Apr 1, 2022 · 79cd796 · 79cd796
1 parent a7d31b2
commit 79cd796
Show file tree

Hide file tree

Showing 11 changed files with 281 additions and 167 deletions.
diff --git a/docs/experiments-ltr-msmarco-document-reranking.md b/docs/experiments-ltr-msmarco-document-reranking.md
@@ -11,19 +11,10 @@ Learning-to-rank serves as a second stage-reranker after BM25 retrieval; we use
 
 We're going to use the repository's root directory as the working directory. 
 
-First, prepare queries:
-
 ```bash
 mkdir collections/msmarco-ltr-document
-
-python scripts/ltr_msmarco/convert_queries.py \
-  --input tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
-  --output collections/msmarco-ltr-document/queries.dev.small.json
 ```
 
-The above scripts convert queries to JSON objects with `text`, `text_unlemm`, `raw`, and `text_bert_tok` fields.
-Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
-
 Download our already trained IBM model:
 
 ```bash
@@ -43,16 +34,15 @@ Now, we have all things ready and can run inference:
 ```bash
 python -m pyserini.search.lucene.ltr \
   --index msmarco-doc-per-passage-ltr \
-  --queries collections/msmarco-ltr-document \
   --model collections/msmarco-ltr-document/msmarco-passage-ltr-mrr-v1 \
   --ibm-model collections/msmarco-ltr-document/ibm_model/ \
-  --data document \
+  --topic tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
+  --qrel tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
   --output runs/run.ltr.msmarco-doc.tsv \
+  --granularity document \
   --max-passage --hits 10000
 ```
 
-**TODO**: `--queries collections/msmarco-ltr-document` should refer to the file, i.e., `collections/msmarco-ltr-document/queries.dev.small.json`.
-
 After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:
 
 ```bash
@@ -66,12 +56,29 @@ QueriesRanked: 5193
 #####################
 ```
 
-**TODO**: Add conversion to `trec_eval` format here - basically, make the passage and document pages parallel with each other.
+We can also use the official TREC evaluation tool, `trec_eval`, to compute metrics other than MRR@10.
+For that we first need to convert the run file into TREC format:
 
-## Building the Index from Scratch
+```bash
+$ python -m pyserini.eval.convert_msmarco_run_to_trec_run \
+    --input runs/run.ltr.msmarco-doc.tsv --output runs/run.ltr.msmarco-doc.trec
 
-**TODO**: This needs to be changed to the standard doc corpus, not the expansion one...
+$ python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
+    --input tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+    --output collections/msmarco-ltr-document/qrels.dev.small.trec
+```
 
+And then run the `trec_eval` tool:
+
+```bash
+$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
+    collections/msmarco-ltr-document/qrels.dev.small.trec runs/run.ltr.msmarco-doc.trec
+
+map                   	all	0.3109
+recall_1000           	all	0.9268
+```
+
+## Building the Index from Scratch
 First, we need to download collections.
 
 ```bash
@@ -95,6 +102,7 @@ python scripts/ltr_msmarco/convert_passage_doc.py \
 ```
 
 The above script will convert the collection and queries to json files with `text_unlemm`, `analyzed`, `text_bert_tok` and `raw` fields.
+Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
 Next, we need to convert the MS MARCO json collection into Anserini's jsonl files (which have one json object per line):
 
 ```bash

diff --git a/docs/experiments-ltr-msmarco-passage-reranking.md b/docs/experiments-ltr-msmarco-passage-reranking.md
@@ -9,33 +9,12 @@ LTR serves as a second-stage reranker after BM25 retrieval.
 
 ## Data Prep
 
-We're going to use `collections/msmarco-ltr-passage/` as the working directory to preprocess the data.
-First, download the MS MACRO passage dataset `collectionandqueries.tar.gz`, per instructions [here](experiments-msmarco-passage.md).
-Then:
+We're going to use root as the working directory.
 
 ```bash
-mkdir collections/msmarco-ltr-passage/
-
-python scripts/ltr_msmarco/convert_queries.py \
-  --input collections/msmarco-passage/queries.eval.small.tsv \
-  --output collections/msmarco-ltr-passage/queries.eval.small.json 
-
-python scripts/ltr_msmarco/convert_queries.py \
-  --input collections/msmarco-passage/queries.dev.small.tsv \
-  --output collections/msmarco-ltr-passage/queries.dev.small.json
-
-python scripts/ltr_msmarco/convert_queries.py \
-  --input collections/msmarco-passage/queries.train.tsv \
-  --output collections/msmarco-ltr-passage/queries.train.json
+mkdir collections/msmarco-ltr-document
 ```
 
-**TODO**: Change to the queries already stored in `tools/topics-and-qrels/`; we don't need to process training queries, and we actually don't need to download the corpus at this point (only for building the index from scratch below).
-
-The above scripts convert queries to JSON objects with `text`, `text_unlemm`, `raw`, and `text_bert_tok` fields.
-The first two scripts take ~1 min and the third one is a bit longer (~1.5h) since it processes _all_ the training queries (although not necessary for running the commands below).
-
-Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
-
 ## Performing Retrieval
 
 Download our already trained IBM model:
@@ -57,10 +36,10 @@ The following command generates our reranking result with our prebuilt index:
 ```bash
 python -m pyserini.search.lucene.ltr \
   --index msmarco-passage-ltr \
-  --queries collections/msmarco-ltr-passage \
   --model runs/msmarco-passage-ltr-mrr-v1 \
   --ibm-model collections/msmarco-ltr-passage/ibm_model/ \
-  --data passage \
+  --topic tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
+  --qrel tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   --output runs/run.ltr.msmarco-passage.tsv
 ```
 
@@ -109,7 +88,9 @@ On the other hand, recall@1000 provides the upper bound effectiveness of downstr
 
 ## Building the Index from Scratch
 
-To build an index from scratch, we need to first preprocess the collection:
+To build an index from scratch, we need to preprocess the collection:
+
+First, download the MS MACRO passage dataset `collectionandqueries.tar.gz`, per instructions [here](experiments-msmarco-passage.md).
 
 ```bash
 python scripts/ltr_msmarco/convert_passage.py \
@@ -118,6 +99,7 @@ python scripts/ltr_msmarco/convert_passage.py \
 ```
 
 The above script will convert the collection to JSON files with `text_unlemm`, `analyzed`, `text_bert_tok` and `raw` fields.
+Note that the tokenization script depends on spaCy; our implementation currently depends on v3.2.1 (this is potentially important as tokenization might change from version to version).
 Next, we need to convert the MS MARCO JSON collection into Anserini's JSONL format:
 
 ```bash

diff --git a/docs/experiments-msmarco-irst.md b/docs/experiments-msmarco-irst.md
@@ -14,21 +14,12 @@ For IRST, we make the corpus as well as the pre-built indexes available to downl
 Here, we start from MS MARCO [passage corpus](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) that has already been processed.
 As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
 
-
-The below scripts convert queries to json objects with text, text_unlemm, raw, and text_bert_tok fields
-
-
-```bash
-mkdir irst_test
-python scripts/ltr_msmarco/convert_queries.py --input path_to_topics --output irst_test/queries.irst_topics.dev.small.json
-```
-Here the path_to_topics represents the path to topics file saved in tools/topics-and-qrels/ folder, e.g., tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
-
-
 ### Performing End-to-End Retrieval Using Pretrained Model
 
 Download pretrained IBM models:
 ```bash
+mkdir irst_test/
+
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
 tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
 ```
@@ -38,8 +29,8 @@ Next we can run our script to get our end-to-end results.
 IRST (Sum) 
 ```bash
 python -m pyserini.search.lucene.irst \
-  --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
-  --query_path irst_test/queries.irst_topics.dev.small.json \
+  --topics topics \
+  --tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
   --index msmarco-passage-ltr \
   --output irst_test/regression_test_sum.irst_topics.txt \
   --alpha 0.1
@@ -48,29 +39,28 @@ python -m pyserini.search.lucene.irst \
 IRST (Max)
 ```bash
 python -m pyserini.search.lucene.irst \
-  --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
-  --query_path irst_test/queries.irst_topics.dev.small.json \
+  --topics topics \
+  --tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
   --index msmarco-passage-ltr \
   --output irst_test/regression_test_max.irst_topics.txt \
   --alpha 0.3 \
-  --max_sim
+  --max-sim
 ```
 
-For different topics, the `--input`,`--topics` and `--qrel` are different, since Pyserini has all these topics available, we can pass in
+For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
 different values to run on different datasets.
 
-`--input`: <br />
+`--topics`: <br />
 &nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/topics.dl19-passage.txt` <br />
 &nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/topics.dl20.txt` <br />
 &nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt` <br />
 
-`--topics`: <br />
+`--irst_topics`: <br />
 &nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `dl19-passage` <br />
 &nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `dl20` <br />
 &nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `msmarco-passage-dev-subset` <br />
 
 
-
 After the run finishes, we can also evaluate the results using the official MS MARCO evaluation script:
 
 For TREC DL 2019, use this command to evaluate your run file:
@@ -81,23 +71,14 @@ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl19-passage irs
 
 Similarly for TREC DL 2020,
 ```bash
-python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-passage irst_test/regression_test_sum.dl19-passage.txt
+python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-passage irst_test/regression_test_sum.dl20.txt
 ```
 
 For MS MARCO Passage V1, no need to use -l 2 option:
 ```bash
 python -m pyserini.eval.trec_eval -c -m ndcg_cut -m map -m recip_rank msmarco-passage-dev-subset irst_test/regression_test_sum.msmarco-passage-dev-subset.txt
 ```
 
-`--qrel_file`: <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/qrels.dl19-passage.txt` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/qrels.dl20-passage.txt` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt` <br />
-
-
-Note that we evaluate MRR and NDCG at a cutoff of 10 hits to match the official evaluation metrics.
-
-
 
 ## Document Reranking 
 
@@ -107,15 +88,6 @@ Note that we evaluate MRR and NDCG at a cutoff of 10 hits to match the official
 For MSMARCO DOC, each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing. 
 We utilized the MaxP technique during the ranking, that is scoring documents based on one of its highest-scoring passage.
 
-The below scripts convert queries to json objects with text, text_unlemm, raw, and text_bert_tok fields
-
-```bash
-mkdir irst_test
-python scripts/ltr_msmarco/convert_queries.py --input path_to_topics --output irst_test/queries.irst_topics.dev.small.json
-```
-
-We can now perform retrieval in anserini to generate baseline:
-
 ### Performing End-to-End Retrieval Using Pretrained Model
 
 
@@ -132,7 +104,7 @@ IRST (Sum)
 ```bash
 python -m pyserini.search.lucene.irst \
   --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
-  --query_path irst_test/queries.irst_topics.dev.small.json \
+  --topics topics \
   --index msmarco-document-segment-ltr \
   --output irst_test/regression_test_sum.irst_topics.txt \
   --alpha 0.3 \
@@ -143,23 +115,28 @@ IRST (Max)
 ```bash
 python -m pyserini.search.lucene.irst \
   --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
-  --query_path irst_test/queries.irst_topics.dev.small.json \
+  --topics topics \
   --index msmarco-document-segment-ltr \
   --output irst_test/regression_test_max.irst_topics.txt \
   --alpha 0.3 \
   --hits 10000 \
-  --max_sim 
+  --max-sim 
 ```
 
 
-For different topics, the `--input`,`--topics` and `--qrel` are different, since Pyserini has all these topics available, we can pass in
+For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
 different values to run on different datasets.
 
-`--input/topics`: <br />
+`--topics`: <br />
 &nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/topics.dl19-doc.txt` <br />
 &nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/topics.dl20.txt` <br />
 &nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/topics.msmarco-doc.dev.txt` <br />
 
+`--irst_topics`: <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `dl19-doc` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `dl20-doc` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `msmarco-doc` <br />
+
 The reranked runfile contains top 10000 document segments, thus we need to use MaxP technique to get score for each document.
 
 ```bash
@@ -172,7 +149,6 @@ We can use the official TREC evaluation tool, trec_eval, to compute other metric
 python tools/scripts/msmarco/convert_msmarco_to_trec_run.py --input irst_test/regression_test_sum_maxP.irst_topics.tsv --output irst_test/regression_test_sum_maxP.irst_topics.trec
 ```
 
-
 For TREC DL 2019, use this command to evaluate your run file:
 
 ```bash
@@ -187,11 +163,7 @@ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-doc irst_te
 For MS MARCO Passage V1, no need to use -l 2 option:
 ```bash
 python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut -m map -m recip_rank msmarco-doc-dev irst_test/regression_test_sum_maxP.msmarco-doc.trec
-
-`--qrel_file`: <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/qrels.dl19-doc.txt` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/qrels.dl20-doc.txt` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/qrels.msmarco-doc.dev.txt` <br />
+```
 
 ## Results
 ### Passage Ranking Datasets

diff --git a/integrations/sparse/test_lucenesearcher_check_irst.py b/integrations/sparse/test_lucenesearcher_check_irst.py
@@ -36,18 +36,13 @@ def setUp(self):
         ibm_model_tar_name = 'ibm_model_1_bert_tok_20211117.tar.gz'
         os.system(f'wget {ibm_model_url} -P irst_test/')
         os.system(f'tar -xzvf irst_test/{ibm_model_tar_name} -C irst_test')
-        # queries process
-        os.system('python scripts/ltr_msmarco/convert_queries.py \
-            --input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
-            --output irst_test/queries.dev.small.json')
         # qrel
         self.qrels_path = f'{self.pyserini_root}/tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt'
 
     def test_sum_aggregation(self):
         os.system('python -m pyserini.search.lucene.irst \
-            --qrels tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
-            --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
-            --query_path irst_test/queries.dev.small.json \
+            --topics ./tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
+            --tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
             --index msmarco-passage-ltr \
             --output irst_test/regression_test_sum.txt \
             --alpha 0.1 ')
@@ -67,13 +62,12 @@ def test_sum_aggregation(self):
 
     def test_max_aggregation(self):
         os.system('python -m pyserini.search.lucene.irst \
-            --qrels tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
-            --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
-            --query_path irst_test/queries.dev.small.json \
+            --topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
+            --tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
             --index msmarco-passage-ltr \
             --output irst_test/regression_test_max.txt \
             --alpha 0.3 \
-            --max_sim')
+            --max-sim')
 
         score_cmd = f'{self.pyserini_root}/tools/eval/trec_eval.9.0.4/trec_eval \
                 -c -M1000 -m map -m ndcg_cut.20 {self.qrels_path} irst_test/regression_test_max.txt'

diff --git a/integrations/sparse/test_lucenesearcher_check_ltr_msmarco_document.py b/integrations/sparse/test_lucenesearcher_check_ltr_msmarco_document.py
@@ -39,11 +39,13 @@ def test_reranking(self):
         ibm_model_url = 'https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/model-ltr-ibm.tar.gz'
         ibm_model_tar_name = 'model-ltr-ibm.tar.gz'
         os.system(f'wget {ibm_model_url} -P ltr_test/')
-
-        # queries process
         os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
-        os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-doc.dev.txt --output ltr_test/queries.dev.small.json')
-        os.system(f'python -m pyserini.search.lucene.ltr --data document --model ltr_test/msmarco-passage-ltr-mrr-v1/ --index msmarco-doc-per-passage-ltr --ibm-model ltr_test/ibm_model/ --queries ltr_test --output ltr_test/{outp} --max-passage --hits 10000')
+        os.system(f'python -m pyserini.search.lucene.ltr  \
+                    --topic tools/topics-and-qrels/topics.msmarco-doc.dev.txt \
+                    --model ltr_test/msmarco-passage-ltr-mrr-v1/   \
+                    --qrel tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
+                    --index msmarco-doc-per-passage-ltr --ibm-model ltr_test/ibm_model/ \
+                    --granularity document --output ltr_test/{outp} --max-passage --hits 10000')
 
         result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_doc_eval.py --judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt --run ltr_test/{outp}', shell=True).decode(sys.stdout.encoding)
         a,b = result.find('#####################\nMRR @100:'), result.find('\nQueriesRanked: 5193\n#####################\n')

diff --git a/integrations/sparse/test_lucenesearcher_check_ltr_msmarco_passage.py b/integrations/sparse/test_lucenesearcher_check_ltr_msmarco_passage.py
@@ -41,8 +41,12 @@ def test_reranking(self):
         os.system(f'wget {ibm_model_url} -P ltr_test/')
         os.system(f'tar -xzvf ltr_test/{ibm_model_tar_name} -C ltr_test')
         #queries process
-        os.system('python scripts/ltr_msmarco/convert_queries.py --input tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt --output ltr_test/queries.dev.small.json')
-        os.system(f'python -m pyserini.search.lucene.ltr --model ltr_test/msmarco-passage-ltr-mrr-v1 --data passage --index msmarco-passage-ltr --ibm-model ltr_test/ibm_model/ --queries ltr_test --output-format tsv --output ltr_test/{outp}')
+        os.system(f'python -m pyserini.search.lucene.ltr \
+                    --model ltr_test/msmarco-passage-ltr-mrr-v1 \
+                    --topic tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt \
+                    --qrel tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
+                    --index msmarco-passage-ltr --ibm-model ltr_test/ibm_model/ \
+                    --output-format tsv --output ltr_test/{outp}')
         result = subprocess.check_output(f'python tools/scripts/msmarco/msmarco_passage_eval.py tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt ltr_test/{outp}', shell=True).decode(sys.stdout.encoding)
         a,b = result.find('#####################\nMRR @10:'), result.find('\nQueriesRanked: 6980\n#####################\n')
         mrr = result[a+31:b]