diff --git a/README.md b/README.md index 43d578c8c..e7fecc406 100644 --- a/README.md +++ b/README.md @@ -387,22 +387,22 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe + Reproducing [runs directly from the Python package](docs/pypi-reproduction.md) + Reproducing [Robust04 baselines for ad hoc retrieval](docs/experiments-robust04.md) -+ Reproducing the [BM25 baseline for MS MARCO (V1) Passage Ranking](docs/experiments-msmarco-passage.md) -+ Reproducing the [BM25 baseline for MS MARCO (V1) Document Ranking](docs/experiments-msmarco-doc.md) -+ Reproducing the [multi-field BM25 baseline for MS MARCO (V1) Document Ranking from Elasticsearch](docs/experiments-elastic.md) -+ Reproducing [BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md) -+ Reproducing [DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md) -+ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1)](docs/experiments-unicoil.md) -+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md) -+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V2) Passage Ranking](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md) -+ Reproducing [uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md) -+ Reproducing [SPLADEv2 experiments for MS MARCO (V1) Passage Ranking](docs/experiments-spladev2.md) ++ Reproducing the [BM25 baseline for MS MARCO V1 Passage Ranking](docs/experiments-msmarco-passage.md) ++ Reproducing the [BM25 baseline for MS MARCO V1 Document Ranking](docs/experiments-msmarco-doc.md) ++ Reproducing the [multi-field BM25 baseline for MS MARCO V1 Document Ranking from Elasticsearch](docs/experiments-elastic.md) ++ Reproducing [BM25 baselines on the MS MARCO V2 Collections](docs/experiments-msmarco-v2.md) ++ Reproducing [DeepImpact experiments for MS MARCO V1 Passage Ranking](docs/experiments-deepimpact.md) ++ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO V1](docs/experiments-unicoil.md) ++ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO V1 Passage Ranking](docs/experiments-unicoil-tilde-expansion.md) ++ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO V2 Passage Ranking](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md) ++ Reproducing [uniCOIL experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-unicoil.md) ++ Reproducing [SPLADEv2 experiments for MS MARCO V1 Passage Ranking](docs/experiments-spladev2.md) ### Dense Retrieval -+ Reproducing [TCT-ColBERTv1 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert.md) -+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert-v2.md) -+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md) ++ Reproducing [TCT-ColBERTv1 experiments on the MS MARCO V1 Collections](docs/experiments-tct_colbert.md) ++ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO V1 Collections](docs/experiments-tct_colbert-v2.md) ++ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md) + Reproducing [DPR experiments](docs/experiments-dpr.md) + Reproducing [BPR experiments](docs/experiments-bpr.md) + Reproducing [ANCE experiments](docs/experiments-ance.md) diff --git a/docs/experiments-spladev2.md b/docs/experiments-spladev2.md index 38b63741b..940ade516 100644 --- a/docs/experiments-spladev2.md +++ b/docs/experiments-spladev2.md @@ -1,4 +1,4 @@ -# Pyserini: SPLADEv2 for MS MARCO Passage Ranking +# Pyserini: SPLADEv2 for MS MARCO V1 Passage Ranking This page describes how to reproduce with Pyserini the DistilSPLADE-max experiments in the following paper: @@ -13,9 +13,8 @@ We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with SPLADE processing: ```bash +# Alternate mirrors of the same data, pick one: wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/ - -# Alternate mirror wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/ @@ -23,7 +22,6 @@ tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/ To confirm, `msmarco-passage-distill-splade-max.tar` should have MD5 checksum of `95b89a7dfd88f3685edcc2d1ffb120d1`. - ## Indexing We can now index these documents: @@ -48,9 +46,8 @@ To ensure that the tokenization in the index aligns exactly with the queries, we First, fetch the MS MARCO passage ranking dev set queries: ``` +# Alternate mirrors of the same data, pick one: wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz -P collections/ - -# Alternate mirror wget https://vault.cs.uwaterloo.ca/s/DrL4HLqgmT6orJL/download -O collections/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz ``` @@ -91,6 +88,7 @@ The final evaluation metric is very close to the one reported in the paper (0.36 Alternatively, we can use one-the-fly query encoding. First, download the model checkpoint from NAVER's github [repo](https://github.com/naver/splade/tree/main/weights/splade_max): + ```bash mkdir splade-distil-max cd splade-distil-max diff --git a/docs/experiments-unicoil-tilde-expansion.md b/docs/experiments-unicoil-tilde-expansion.md index c9f25a78f..11c236bd0 100644 --- a/docs/experiments-unicoil-tilde-expansion.md +++ b/docs/experiments-unicoil-tilde-expansion.md @@ -1,4 +1,4 @@ -# Pyserini: uniCOIL (w/ TILDE) for MS MARCO (V1) Passage Ranking +# Pyserini: uniCOIL w/ TILDE for MS MARCO V1 Passage Ranking This page describes how to reproduce experiments using uniCOIL with TILDE document expansion on the MS MARCO passage corpus, as described in the following paper: @@ -11,21 +11,22 @@ The original uniCOIL model is described here: In the original uniCOIL paper, doc2query-T5 is used to perform document expansion, which is slow and expensive. As an alternative, Zhuang and Zuccon proposed to use the TILDE model to expand the documents instead, resulting in a faster and cheaper process that is just as effective. -For details of how to use TILDE to expand documents, please refer to the [TIDLE repo](https://github.com/ielab/TILDE). +For details of how to use TILDE to expand documents, please refer to the [TILDE repo](https://github.com/ielab/TILDE). For additional details on the original uniCOIL design (with doc2query-T5 expansion), please refer to the [COIL repo](https://github.com/luyug/COIL/tree/main/uniCOIL). In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE, i.e., gone through document expansion and term re-weighting. Thus, no neural inference is involved. +Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md) based on Java. + ## Data Prep We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing: ```bash +# Alternate mirrors of the same data, pick one: wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/ - -# Alternate mirror wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar tar -xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/ @@ -40,7 +41,7 @@ We can now index these docs: ``` python -m pyserini.index -collection JsonVectorCollection \ -input collections/msmarco-passage-unicoil-tilde-expansion-b8/ \ - -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \ + -index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \ -generator DefaultLuceneDocumentGenerator -impact -pretokenized \ -threads 12 ``` @@ -57,8 +58,8 @@ We can now run retrieval: ```bash python -m pyserini.search --topics msmarco-passage-dev-subset \ --encoder ielab/unicoil-tilde200-msmarco-passage \ - --index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \ - --output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \ + --index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \ + --output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv \ --impact \ --hits 1000 --batch 32 --threads 12 \ --output-format msmarco @@ -72,7 +73,7 @@ A complete run typically takes around 20 minutes. The output is in MS MARCO output format, so we can directly evaluate: ```bash -python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv ``` The results should be as follows: @@ -91,9 +92,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights. First, fetch the queries: ``` +# Alternate mirrors of the same data, pick one: wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz -P collections/ - -# Alternate mirror wget https://vault.cs.uwaterloo.ca/s/GZEPQkNQGoszHTx/download -O collections/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz ``` @@ -103,8 +103,8 @@ We can now run retrieval: ```bash python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \ - --index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \ - --output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \ + --index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \ + --output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv \ --impact \ --hits 1000 --batch 32 --threads 12 \ --output-format msmarco @@ -116,7 +116,7 @@ Since we're not applying neural inference over the queries, retrieval is faster, The output is in MS MARCO output format, so we can directly evaluate: ```bash -python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv +python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv ``` The results should be as follows: diff --git a/docs/experiments-unicoil.md b/docs/experiments-unicoil.md index 16540fed3..8b3a502cc 100644 --- a/docs/experiments-unicoil.md +++ b/docs/experiments-unicoil.md @@ -1,4 +1,4 @@ -# Pyserini: uniCOIL (w/ doc2query-T5) for MS MARCO (V1) +# Pyserini: uniCOIL w/ doc2query-T5 for MS MARCO V1 This page describes how to reproduce the uniCOIL experiments in the following paper: @@ -8,7 +8,7 @@ In this guide, we start with a version of the MS MARCO passage corpus that has a Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL). -Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil.md) based on Java. +Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-unicoil.md) based on Java. ## Passage Ranking @@ -18,12 +18,11 @@ We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing: ```bash -wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-b8.tar -P collections/ - -# Alternate mirror +# Alternate mirrors of the same data, pick one: +wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-b8.tar -P collections/ wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msmarco-passage-unicoil-b8.tar -tar -xvf collections/msmarco-passage-unicoil-b8.tar -C collections/ +tar xvf collections/msmarco-passage-unicoil-b8.tar -C collections/ ``` To confirm, `msmarco-passage-unicoil-b8.tar` should have MD5 checksum of `eb28c059fad906da2840ce77949bffd7`. @@ -35,7 +34,7 @@ We can now index these docs: ``` python -m pyserini.index -collection JsonVectorCollection \ -input collections/msmarco-passage-unicoil-b8/ \ - -index indexes/lucene-index.msmarco-passage-unicoil-b8 \ + -index indexes/lucene-index.msmarco-passage.unicoil-b8 \ -generator DefaultLuceneDocumentGenerator -impact -pretokenized \ -threads 12 ``` @@ -52,7 +51,7 @@ We can now run retrieval: ```bash python -m pyserini.search --topics msmarco-passage-dev-subset \ --encoder castorini/unicoil-d2q-msmarco-passage \ - --index indexes/lucene-index.msmarco-passage-unicoil-b8 \ + --index indexes/lucene-index.msmarco-passage.unicoil-b8 \ --output runs/run.msmarco-passage-unicoil-b8.tsv \ --impact \ --hits 1000 --batch 36 --threads 12 \ @@ -86,9 +85,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights. First, fetch the MS MARCO passage ranking dev set queries: ```bash +# Alternate mirrors of the same data, pick one: wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.unicoil.tsv.gz -P collections/ - -# Alternate mirror wget https://vault.cs.uwaterloo.ca/s/QGoHeBm4YsAgt6H/download -O collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz ``` @@ -98,7 +96,7 @@ We can now run retrieval: ```bash python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \ - --index indexes/lucene-index.msmarco-passage-unicoil-b8 \ + --index indexes/lucene-index.msmarco-passage.unicoil-b8 \ --output runs/run.msmarco-passage-unicoil-b8.tsv \ --impact \ --hits 1000 --batch 36 --threads 12 \ @@ -133,12 +131,11 @@ We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing: ```bash +# Alternate mirrors of the same data, pick one: wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -P collections/ - -# Alternate mirror wget https://vault.cs.uwaterloo.ca/s/ZmF6SKpgMZJYXd6/download -O collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -tar -xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/ +tar xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/ ``` To confirm, `msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar` should have MD5 checksum of `88f365b148c7702cf30c0fb95af35149`. @@ -147,10 +144,10 @@ To confirm, `msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar` should have M We can now index these docs: -``` +```bash python -m pyserini.index -collection JsonVectorCollection \ -input collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8/ \ - -index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \ + -index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \ -generator DefaultLuceneDocumentGenerator -impact -pretokenized \ -threads 12 ``` @@ -166,7 +163,7 @@ We can now run retrieval: ```bash python -m pyserini.search --topics msmarco-doc-dev \ --encoder castorini/unicoil-d2q-msmarco-passage \ - --index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \ + --index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \ --output runs/run.msmarco-doc-unicoil-d2q-b8.tsv \ --impact \ --hits 1000 --batch 36 --threads 12 \ @@ -201,9 +198,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights. First, fetch the MS MARCO passage ranking dev set queries: ```bash +# Alternate mirrors of the same data, pick one: wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-doc.dev.unicoil.tsv.gz -P collections/ - -# Alternate mirror wget https://vault.cs.uwaterloo.ca/s/6D5JtJQxYpPbByM/download -O collections/topics.msmarco-doc.dev.unicoil.tsv.gz ``` @@ -213,7 +209,7 @@ We can now run retrieval: ```bash python -m pyserini.search --topics collections/topics.msmarco-doc.dev.unicoil.tsv.gz \ - --index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \ + --index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \ --output runs/run.msmarco-doc-unicoil-d2q-b8.tsv \ --impact \ --hits 1000 --batch 36 --threads 12 \