IRST with normal index + wp stat pickle file (#1116)

+ change irst with normal index and term freq of wp in a pickle file Co-authored-by: Yuqi Liu <y899liu@uwaterloo.ca>
castorini · Apr 15, 2022 · ac4cc9a · ac4cc9a
1 parent 5e05f60
commit ac4cc9a
Show file tree

Hide file tree

Showing 4 changed files with 278 additions and 144 deletions.
diff --git a/docs/experiments-msmarco-irst.md b/docs/experiments-msmarco-irst.md
@@ -14,37 +14,47 @@ For IRST, we make the corpus as well as the pre-built indexes available to downl
 Here, we start from MS MARCO [passage corpus](https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md) that has already been processed.
 As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
 
-### Performing End-to-End Retrieval Using Pretrained Model
+### Performing End-to-End Retrieval Using Already Trained Model
 
-Download pretrained IBM models:
+The IBM model we used in this experiment is referenced in the Boytsov et al. [paper](https://arxiv.org/pdf/2102.06815.pdf)
+Note that there is a separate guide for training the IBM Model on [FlexNeuART](https://github.com/oaqa/FlexNeuART/tree/master/demo)
+
+Download trained IBM model:
 ```bash
 mkdir irst_test/
 
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
 tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
 ```
 
+Download term freq statistics for wp collection:
+```bash
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-passage.20220411.pickle -P irst_test/
+```
+
 Next we can run our script to get our end-to-end results.
 
 IRST (Sum) 
 ```bash
 python -m pyserini.search.lucene.irst \
   --topics topics \
-  --tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
-  --index msmarco-passage-ltr \
+  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
+  --index msmarco-v1-passage \
   --output irst_test/regression_test_sum.irst_topics.txt \
-  --alpha 0.1
+  --alpha 0.1 \
+  --wp-stat irst_test/bert_wp_term_freq.msmarco-passage.20220411.pickle
 ```
 
 IRST (Max)
 ```bash
 python -m pyserini.search.lucene.irst \
   --topics topics \
-  --tran-path irst_test/ibm_model_1_bert_tok_20211117/ \
-  --index msmarco-passage-ltr \
+  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
+  --index msmarco-v1-passage \
   --output irst_test/regression_test_max.irst_topics.txt \
   --alpha 0.3 \
-  --max-sim
+  --max-sim \
+  --wp-stat irst_test/bert_wp_term_freq.msmarco-passage.20220411.pickle
 ```
 
 For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
@@ -76,127 +86,204 @@ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-passage irs
 
 For MS MARCO Passage V1, no need to use -l 2 option:
 ```bash
-python -m pyserini.eval.trec_eval -c -m ndcg_cut -m map -m recip_rank msmarco-passage-dev-subset irst_test/regression_test_sum.msmarco-passage-dev-subset.txt
+python -m pyserini.eval.trec_eval -c -M 10 -m ndcg_cut.10 -m map -m recip_rank msmarco-passage-dev-subset irst_test/regression_test_sum.msmarco-passage-dev-subset.txt
 ```
 
-
 ## Document Reranking 
 
 
 ### Data Preprocessing
 
-For MSMARCO DOC, each MS MARCO document is first segmented into passages, each passage is treated as a unit of indexing. 
-We utilized the MaxP technique during the ranking, that is scoring documents based on one of its highest-scoring passage.
-
-### Performing End-to-End Retrieval Using Pretrained Model
-
+Now, we perform experiment on full document.
+### Performing End-to-End Retrieval Using Already Trained Model
 
-Download pretrained IBM models. Please note that we did not have time to train a new IBM model on MSMARCO DOC data, we used the trained MSMARCO Passage IBM Model1 instead.
+Download trained IBM models. Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.
 
 ```bash
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
 tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
 ```
 
+Download term freq statistics for wp collection:
+```bash
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-doc.20220411.pickle -P irst_test/
+```
+
 Next we can run our script to get our retrieval results.
 
 IRST (Sum) 
 ```bash
 python -m pyserini.search.lucene.irst \
-  --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
+  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --topics topics \
-  --index msmarco-document-segment-ltr \
+  --index msmarco-v1-doc \
   --output irst_test/regression_test_sum.irst_topics.txt \
   --alpha 0.3 \
-  --hits 10000
+  --hits 1000 \
+  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc.20220411.pickle
 ```
 
 IRST (Max)
 ```bash
 python -m pyserini.search.lucene.irst \
-  --tran_path irst_test/ibm_model_1_bert_tok_20211117/ \
+  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
   --topics topics \
-  --index msmarco-document-segment-ltr \
+  --index msmarco-v1-doc \
   --output irst_test/regression_test_max.irst_topics.txt \
   --alpha 0.3 \
-  --hits 10000 \
-  --max-sim 
+  --hits 1000 \
+  --max-sim \
+  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc.20220411.pickle
 ```
 
 
 For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
 different values to run on different datasets.
 
 `--topics`: <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `tools/topics-and-qrels/topics.dl19-doc.txt` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `tools/topics-and-qrels/topics.dl20.txt` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `tools/topics-and-qrels/topics.msmarco-doc.dev.txt` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Document: `tools/topics-and-qrels/topics.dl19-doc.txt` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Document: `tools/topics-and-qrels/topics.dl20.txt` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Document V1: `tools/topics-and-qrels/topics.msmarco-doc.dev.txt` <br />
 
 `--irst_topics`: <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Passage: `dl19-doc` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Passage: `dl20-doc` <br />
-&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Passage V1: `msmarco-doc` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Document: `dl19-doc-full` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Document: `dl20-doc-full` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Document V1: `msmarco-doc-full` <br />
 
-The reranked runfile contains top 10000 document segments, thus we need to use MaxP technique to get score for each document.
+We can use the official TREC evaluation tool, trec_eval, to compute other metrics. For that we first need to convert the runs into TREC format:
+
+For TREC DL 2019, use this command to evaluate your run file:
 
 ```bash
-python scripts/ltr_msmarco/generate_document_score_withmaxP.py --input irst_test/regression_test_sum.irst_topics.txt --output irst_test/regression_test_sum_maxP.irst_topics.tsv
+python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl19-doc irst_test/regression_test_sum.dl19-doc-full.txt
 ```
 
-We can use the official TREC evaluation tool, trec_eval, to compute other metrics. For that we first need to convert the runs into TREC format:
+Similarly for TREC DL 2020
+```bash
+python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl20-doc irst_test/regression_test_sum.dl20-doc-full.txt
+```
+
+For MS MARCO Doc V1
+```bash
+python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut.10 -m map -m recip_rank msmarco-doc-dev irst_test/regression_test_sum.msmarco-doc-full.txt
+```
+
+
+## Document Segment Reranking 
+
+
+### Data Preprocessing
+
+For MS MARCO doc, each document is first segmented into passages, each passage is treated as a unit of indexing. 
+We utilized the MaxP technique during the ranking, that is scoring documents based on one of its highest-scoring passage.
+
+### Performing End-to-End Retrieval Using Already Trained Model
+
+
+Download trained IBM models. Please note that we did not have time to train a new IBM model on MS MARCO doc data, we used the trained MS MARCO passage IBM Model1 instead.
+
+```bash
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/pyserini-models/ibm_model_1_bert_tok_20211117.tar.gz -P irst_test/
+tar -xzvf irst_test/ibm_model_1_bert_tok_20211117.tar.gz -C irst_test
+```
 
+Download term freq statistics for wp collection:
 ```bash
-python tools/scripts/msmarco/convert_msmarco_to_trec_run.py --input irst_test/regression_test_sum_maxP.irst_topics.tsv --output irst_test/regression_test_sum_maxP.irst_topics.trec
+wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle -P irst_test/
 ```
 
+Next we can run our script to get our retrieval results.
+
+IRST (Sum) 
+```bash
+python -m pyserini.search.lucene.irst \
+  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
+  --topics topics \
+  --index msmarco-v1-doc-segmented \
+  --output irst_test/regression_test_sum.irst_topics.txt \
+  --alpha 0.3 \
+  --segments \
+  --hits 10000 \
+  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle
+```
+
+IRST (Max)
+```bash
+python -m pyserini.search.lucene.irst \
+  --translation-model irst_test/ibm_model_1_bert_tok_20211117/ \
+  --topics topics \
+  --index msmarco-v1-doc-segmented \
+  --output irst_test/regression_test_max.irst_topics.txt \
+  --alpha 0.3 \
+  --hits 10000 \
+  --segments \
+  --max-sim \
+  --wp-stat irst_test/bert_wp_term_freq.msmarco-doc-segmented.20220411.pickle
+```
+
+
+For different topics, the `--topics` and `--irst_topics` are different, since Pyserini has all these topics available, we can pass in
+different values to run on different datasets.
+
+`--topics`: <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Document: `tools/topics-and-qrels/topics.dl19-doc.txt` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Document: `tools/topics-and-qrels/topics.dl20.txt` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Document V1: `tools/topics-and-qrels/topics.msmarco-doc.dev.txt` <br />
+
+`--irst_topics`: <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2019 Document: `dl19-doc-seg` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;TREC DL 2020 Document: `dl20-doc-seg` <br />
+&nbsp;&nbsp;&nbsp;&nbsp;MS MARCO Document V1: `msmarco-doc-seg` <br />
+
+We can use the official TREC evaluation tool, trec_eval, to compute other metrics. For that we first need to convert the runs into TREC format:
+
 For TREC DL 2019, use this command to evaluate your run file:
 
 ```bash
-python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl19-doc irst_test/regression_test_sum_maxP.dl19-doc.trec
+python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl19-doc irst_test/regression_test_sum.dl19-doc-seg.txt
 ```
 
-Similarly for TREC DL 2020,
+Similarly for TREC DL 2020,  no need to use -l 2 option:
 ```bash
-python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -l 2 dl20-doc irst_test/regression_test_sum_maxP.dl20-doc.trec
+python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -M 100 dl20-doc irst_test/regression_test_max.dl20-doc-seg.txt
 ```
 
-For MS MARCO Passage V1, no need to use -l 2 option:
+For MS MARCO Doc V1, no need to use -l 2 option:
 ```bash
-python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut -m map -m recip_rank msmarco-doc-dev irst_test/regression_test_sum_maxP.msmarco-doc.trec
+python -m pyserini.eval.trec_eval -c -M 100 -m ndcg_cut.10 -m map -m recip_rank msmarco-doc-dev irst_test/regression_test_sum.msmarco-doc-seg.txt
 ```
 
 ## Results
 ### Passage Ranking Datasets
 
-| Topics                | Method                        | MRR    | nDCG@10 | Map |
+| Topics                | Method                        | MRR@10    | nDCG@10 | Map |
 |:-------------------------|:------------------------|:------:|:--------:|:-----------:|
 | DL19                | IRST(Sum)               | - | 0.526   | 0.328     |
 | DL19                 | IRST(Max)              | - | 0.537   | 0.328      |
 | DL20                | IRST(Sum)               | -| 0.558   | 0.352      |
 | DL20                | IRST(Max)               | -| 0.546   | 0.337      |
-| MS MARCO Dev                | IRST(Sum)               | 0.233| -   | -      |
-| MS MARCO Dev                | IRST(Max)               | 0.227| -   | -      |
+| MS MARCO Dev                | IRST(Sum)               | 0.221| -   | -      |
+| MS MARCO Dev                | IRST(Max)               | 0.215| -   | -      |
 
 
 ### Document Ranking Datasets
 
-| Topics                | Method                  | MRR    | nDCG@10 | Map |
+| Topics                | Method                  | MRR@100    | nDCG@10 | Map |
 |:-------------------------|:------------------------|:------:|:--------:|:-----------:|
-| DL19                | IRST(Sum)               | - | 0.567   | 0.352     |
-| DL19                 | IRST(Max)              | - | 0.537   | 0.324      |
-| DL20                | IRST(Sum)               | -| 0.561   | 0.363      |
-| DL20                | IRST(Max)               | -| 0.524   | 0.332      |
-| MS MARCO Dev                | IRST(Sum)               | 0.308| -   | -      |
-| MS MARCO Dev                | IRST(Max)               | 0.273| -   | -      |
+| DL19                | IRST(Sum)               | - | 0.551   | 0.253     |
+| DL19                 | IRST(Max)              | - | 0.491   | 0.221     |
+| DL20                | IRST(Sum)               | - | 0.556   | 0.383     |
+| DL20                | IRST(Max)               | - | 0.502   | 0.337     |
+| MS MARCO Dev                | IRST(Sum)               |0.303 | -   | -      |
+| MS MARCO Dev                | IRST(Max)               |0.253 | -   | -      |
 
-## Build Index from Scratch
+### Document Segment Ranking Datasets
 
-Note that we have used our pre-built index in the above steps. You can also build index by yourself following instructions below.
-
-### Passage Index
-Please follow steps in [ltr experiment documentation](https://github.com/castorini/pyserini/blob/master/docs/experiments-ltr-msmarco-passage-reranking.md#building-the-index-from-scratch). 
-
-### Document Index
-We use the [script](https://github.com/castorini/docTTTTTquery/blob/master/convert_msmarco_passages_doc_to_anserini.py) in docTTTTTquery with default stride and window length to obtain segmented documents.
-
-Then follow instructions in [ltr experiment](https://github.com/castorini/pyserini/blob/master/docs/experiments-ltr-msmarco-document-reranking.md#building-the-index-from-scratch).
+| Topics                | Method                  | MRR@100    | nDCG@10 | Map |
+|:-------------------------|:------------------------|:------:|:--------:|:-----------:|
+| DL19                | IRST(Sum)               | - | 0.560   | 0.271      |
+| DL19                 | IRST(Max)              | - | 0.520   | 0.243      |
+| DL20                | IRST(Sum)               | - | 0.536   |  0.376     |
+| DL20                | IRST(Max)               | - | 0.510   |   0.350    |
+| MS MARCO Dev                | IRST(Sum)               |0.296 | -   | -      |
+| MS MARCO Dev                | IRST(Max)               |0.260 | -   | -      |