Refactor sparse learned model docs (castorini#816)

crystina-z · Oct 9, 2021 · 8599c81 · 8599c81
1 parent 509bb5a
commit 8599c81
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 52 deletions.
diff --git a/README.md b/README.md
@@ -387,22 +387,22 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe
 
 + Reproducing [runs directly from the Python package](docs/pypi-reproduction.md)
 + Reproducing [Robust04 baselines for ad hoc retrieval](docs/experiments-robust04.md)
-+ Reproducing the [BM25 baseline for MS MARCO (V1) Passage Ranking](docs/experiments-msmarco-passage.md)
-+ Reproducing the [BM25 baseline for MS MARCO (V1) Document Ranking](docs/experiments-msmarco-doc.md)
-+ Reproducing the [multi-field BM25 baseline for MS MARCO (V1) Document Ranking from Elasticsearch](docs/experiments-elastic.md)
-+ Reproducing [BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md)
-+ Reproducing [DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md)
-+ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1)](docs/experiments-unicoil.md)
-+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md)
-+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V2) Passage Ranking](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md)
-+ Reproducing [uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md)
-+ Reproducing [SPLADEv2 experiments for MS MARCO (V1) Passage Ranking](docs/experiments-spladev2.md)
++ Reproducing the [BM25 baseline for MS MARCO V1 Passage Ranking](docs/experiments-msmarco-passage.md)
++ Reproducing the [BM25 baseline for MS MARCO V1 Document Ranking](docs/experiments-msmarco-doc.md)
++ Reproducing the [multi-field BM25 baseline for MS MARCO V1 Document Ranking from Elasticsearch](docs/experiments-elastic.md)
++ Reproducing [BM25 baselines on the MS MARCO V2 Collections](docs/experiments-msmarco-v2.md)
++ Reproducing [DeepImpact experiments for MS MARCO V1 Passage Ranking](docs/experiments-deepimpact.md)
++ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO V1](docs/experiments-unicoil.md)
++ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO V1 Passage Ranking](docs/experiments-unicoil-tilde-expansion.md)
++ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO V2 Passage Ranking](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md)
++ Reproducing [uniCOIL experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-unicoil.md)
++ Reproducing [SPLADEv2 experiments for MS MARCO V1 Passage Ranking](docs/experiments-spladev2.md)
 
 ### Dense Retrieval
 
-+ Reproducing [TCT-ColBERTv1 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert.md)
-+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert-v2.md)
-+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md)
++ Reproducing [TCT-ColBERTv1 experiments on the MS MARCO V1 Collections](docs/experiments-tct_colbert.md)
++ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO V1 Collections](docs/experiments-tct_colbert-v2.md)
++ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md)
 + Reproducing [DPR experiments](docs/experiments-dpr.md)
 + Reproducing [BPR experiments](docs/experiments-bpr.md)
 + Reproducing [ANCE experiments](docs/experiments-ance.md)

diff --git a/docs/experiments-spladev2.md b/docs/experiments-spladev2.md
@@ -1,4 +1,4 @@
-# Pyserini: SPLADEv2 for MS MARCO Passage Ranking
+# Pyserini: SPLADEv2 for MS MARCO V1 Passage Ranking
 
 This page describes how to reproduce with Pyserini the DistilSPLADE-max experiments in the following paper:
 
@@ -13,17 +13,15 @@ We're going to use the repository's root directory as the working directory.
 First, we need to download and extract the MS MARCO passage dataset with SPLADE processing:
 
 ```bash
+# Alternate mirrors of the same data, pick one:
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar
 
 tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/
 ```
 
 To confirm, `msmarco-passage-distill-splade-max.tar` should have MD5 checksum of `95b89a7dfd88f3685edcc2d1ffb120d1`.
 
-
 ## Indexing
 
 We can now index these documents:
@@ -48,9 +46,8 @@ To ensure that the tokenization in the index aligns exactly with the queries, we
 First, fetch the MS MARCO passage ranking dev set queries: 
 
 ```
+# Alternate mirrors of the same data, pick one:
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/DrL4HLqgmT6orJL/download -O collections/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz
 ```
 
@@ -91,6 +88,7 @@ The final evaluation metric is very close to the one reported in the paper (0.36
 Alternatively, we can use one-the-fly query encoding.
 
 First, download the model checkpoint from NAVER's github [repo](https://github.com/naver/splade/tree/main/weights/splade_max):
+
 ```bash
 mkdir splade-distil-max
 cd splade-distil-max

diff --git a/docs/experiments-unicoil-tilde-expansion.md b/docs/experiments-unicoil-tilde-expansion.md
@@ -1,4 +1,4 @@
-# Pyserini: uniCOIL (w/ TILDE) for MS MARCO (V1) Passage Ranking
+# Pyserini: uniCOIL w/ TILDE for MS MARCO V1 Passage Ranking
 
 This page describes how to reproduce experiments using uniCOIL with TILDE document expansion on the MS MARCO passage corpus, as described in the following paper:
 
@@ -11,21 +11,22 @@ The original uniCOIL model is described here:
 
 In the original uniCOIL paper, doc2query-T5 is used to perform document expansion, which is slow and expensive.
 As an alternative, Zhuang and Zuccon proposed to use the TILDE model to expand the documents instead, resulting in a faster and cheaper process that is just as effective.
-For details of how to use TILDE to expand documents, please refer to the [TIDLE repo](https://github.com/ielab/TILDE).
+For details of how to use TILDE to expand documents, please refer to the [TILDE repo](https://github.com/ielab/TILDE).
 For additional details on the original uniCOIL design (with doc2query-T5 expansion), please refer to the [COIL repo](https://github.com/luyug/COIL/tree/main/uniCOIL).
 
 In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE, i.e., gone through document expansion and term re-weighting.
 Thus, no neural inference is involved.
 
+Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md) based on Java.
+
 ## Data Prep
 
 We're going to use the repository's root directory as the working directory.
 First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
 
 ```bash
+# Alternate mirrors of the same data, pick one:
 wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar
 
 tar -xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/
@@ -40,7 +41,7 @@ We can now index these docs:
 ```
 python -m pyserini.index -collection JsonVectorCollection \
  -input collections/msmarco-passage-unicoil-tilde-expansion-b8/ \
- -index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
+ -index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
  -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
  -threads 12
 ```
@@ -57,8 +58,8 @@ We can now run retrieval:
 ```bash
 python -m pyserini.search --topics msmarco-passage-dev-subset \
                           --encoder ielab/unicoil-tilde200-msmarco-passage \
-                          --index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
-                          --output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \
+                          --index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
+                          --output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv \
                           --impact \
                           --hits 1000 --batch 32 --threads 12 \
                           --output-format msmarco
@@ -72,7 +73,7 @@ A complete run typically takes around 20 minutes.
 The output is in MS MARCO output format, so we can directly evaluate:
 
 ```bash
-python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv
+python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv
 ```
 
 The results should be as follows:
@@ -91,9 +92,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights.
 First, fetch the queries:
 
 ```
+# Alternate mirrors of the same data, pick one:
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/GZEPQkNQGoszHTx/download -O collections/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz
 ```
 
@@ -103,8 +103,8 @@ We can now run retrieval:
 
 ```bash
 python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \
-                          --index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
-                          --output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \
+                          --index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
+                          --output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv \
                           --impact \
                           --hits 1000 --batch 32 --threads 12 \
                           --output-format msmarco
@@ -116,7 +116,7 @@ Since we're not applying neural inference over the queries, retrieval is faster,
 The output is in MS MARCO output format, so we can directly evaluate:
 
 ```bash
-python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv
+python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv
 ```
 
 The results should be as follows:

diff --git a/docs/experiments-unicoil.md b/docs/experiments-unicoil.md
@@ -1,4 +1,4 @@
-# Pyserini: uniCOIL (w/ doc2query-T5) for MS MARCO (V1)
+# Pyserini: uniCOIL w/ doc2query-T5 for MS MARCO V1
 
 This page describes how to reproduce the uniCOIL experiments in the following paper:
 
@@ -8,7 +8,7 @@ In this guide, we start with a version of the MS MARCO passage corpus that has a
 Thus, no neural inference is involved.
 For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).
 
-Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil.md) based on Java.
+Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-unicoil.md) based on Java.
 
 ## Passage Ranking
 
@@ -18,12 +18,11 @@ We're going to use the repository's root directory as the working directory.
 First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
 
 ```bash
-wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-b8.tar -P collections/
-
-# Alternate mirror
+# Alternate mirrors of the same data, pick one:
+wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-b8.tar -P collections/
 wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msmarco-passage-unicoil-b8.tar
 
-tar -xvf collections/msmarco-passage-unicoil-b8.tar -C collections/
+tar xvf collections/msmarco-passage-unicoil-b8.tar -C collections/
 ```
 
 To confirm, `msmarco-passage-unicoil-b8.tar` should have MD5 checksum of `eb28c059fad906da2840ce77949bffd7`.
@@ -35,7 +34,7 @@ We can now index these docs:
 ```
 python -m pyserini.index -collection JsonVectorCollection \
  -input collections/msmarco-passage-unicoil-b8/ \
- -index indexes/lucene-index.msmarco-passage-unicoil-b8 \
+ -index indexes/lucene-index.msmarco-passage.unicoil-b8 \
  -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
  -threads 12
 ```
@@ -52,7 +51,7 @@ We can now run retrieval:
 ```bash
 python -m pyserini.search --topics msmarco-passage-dev-subset \
                           --encoder castorini/unicoil-d2q-msmarco-passage \
-                          --index indexes/lucene-index.msmarco-passage-unicoil-b8 \
+                          --index indexes/lucene-index.msmarco-passage.unicoil-b8 \
                           --output runs/run.msmarco-passage-unicoil-b8.tsv \
                           --impact \
                           --hits 1000 --batch 36 --threads 12 \
@@ -86,9 +85,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights.
 First, fetch the MS MARCO passage ranking dev set queries:
 
 ```bash
+# Alternate mirrors of the same data, pick one:
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.unicoil.tsv.gz -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/QGoHeBm4YsAgt6H/download -O collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz
 ```
 
@@ -98,7 +96,7 @@ We can now run retrieval:
 
 ```bash
 python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \
-                          --index indexes/lucene-index.msmarco-passage-unicoil-b8 \
+                          --index indexes/lucene-index.msmarco-passage.unicoil-b8 \
                           --output runs/run.msmarco-passage-unicoil-b8.tsv \
                           --impact \
                           --hits 1000 --batch 36 --threads 12 \
@@ -133,12 +131,11 @@ We're going to use the repository's root directory as the working directory.
 First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
 
 ```bash
+# Alternate mirrors of the same data, pick one:
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/ZmF6SKpgMZJYXd6/download -O collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar
 
-tar -xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/
+tar xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/
 ```
 
 To confirm, `msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar` should have MD5 checksum of `88f365b148c7702cf30c0fb95af35149`.
@@ -147,10 +144,10 @@ To confirm, `msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar` should have M
 
 We can now index these docs:
 
-```
+```bash
 python -m pyserini.index -collection JsonVectorCollection \
  -input collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8/ \
- -index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \
+ -index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
  -generator DefaultLuceneDocumentGenerator -impact -pretokenized \
  -threads 12
 ```
@@ -166,7 +163,7 @@ We can now run retrieval:
 ```bash
 python -m pyserini.search --topics msmarco-doc-dev \
                           --encoder castorini/unicoil-d2q-msmarco-passage \
-                          --index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \
+                          --index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
                           --output runs/run.msmarco-doc-unicoil-d2q-b8.tsv \
                           --impact \
                           --hits 1000 --batch 36 --threads 12 \
@@ -201,9 +198,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights.
 First, fetch the MS MARCO passage ranking dev set queries:
 
 ```bash
+# Alternate mirrors of the same data, pick one:
 wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-doc.dev.unicoil.tsv.gz -P collections/
-
-# Alternate mirror
 wget https://vault.cs.uwaterloo.ca/s/6D5JtJQxYpPbByM/download -O collections/topics.msmarco-doc.dev.unicoil.tsv.gz
 ```
 
@@ -213,7 +209,7 @@ We can now run retrieval:
 
 ```bash
 python -m pyserini.search --topics collections/topics.msmarco-doc.dev.unicoil.tsv.gz \
-                          --index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \
+                          --index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
                           --output runs/run.msmarco-doc-unicoil-d2q-b8.tsv \
                           --impact \
                           --hits 1000 --batch 36 --threads 12 \