Skip to content

Commit

Permalink
Refactor sparse learned model docs (castorini#816)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Oct 9, 2021
1 parent 509bb5a commit 8599c81
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 52 deletions.
26 changes: 13 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -387,22 +387,22 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe

+ Reproducing [runs directly from the Python package](docs/pypi-reproduction.md)
+ Reproducing [Robust04 baselines for ad hoc retrieval](docs/experiments-robust04.md)
+ Reproducing the [BM25 baseline for MS MARCO (V1) Passage Ranking](docs/experiments-msmarco-passage.md)
+ Reproducing the [BM25 baseline for MS MARCO (V1) Document Ranking](docs/experiments-msmarco-doc.md)
+ Reproducing the [multi-field BM25 baseline for MS MARCO (V1) Document Ranking from Elasticsearch](docs/experiments-elastic.md)
+ Reproducing [BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md)
+ Reproducing [DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md)
+ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1)](docs/experiments-unicoil.md)
+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V2) Passage Ranking](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md)
+ Reproducing [SPLADEv2 experiments for MS MARCO (V1) Passage Ranking](docs/experiments-spladev2.md)
+ Reproducing the [BM25 baseline for MS MARCO V1 Passage Ranking](docs/experiments-msmarco-passage.md)
+ Reproducing the [BM25 baseline for MS MARCO V1 Document Ranking](docs/experiments-msmarco-doc.md)
+ Reproducing the [multi-field BM25 baseline for MS MARCO V1 Document Ranking from Elasticsearch](docs/experiments-elastic.md)
+ Reproducing [BM25 baselines on the MS MARCO V2 Collections](docs/experiments-msmarco-v2.md)
+ Reproducing [DeepImpact experiments for MS MARCO V1 Passage Ranking](docs/experiments-deepimpact.md)
+ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO V1](docs/experiments-unicoil.md)
+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO V1 Passage Ranking](docs/experiments-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO V2 Passage Ranking](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-unicoil.md)
+ Reproducing [SPLADEv2 experiments for MS MARCO V1 Passage Ranking](docs/experiments-spladev2.md)

### Dense Retrieval

+ Reproducing [TCT-ColBERTv1 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert.md)
+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO (V1) Collections](docs/experiments-tct_colbert-v2.md)
+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md)
+ Reproducing [TCT-ColBERTv1 experiments on the MS MARCO V1 Collections](docs/experiments-tct_colbert.md)
+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO V1 Collections](docs/experiments-tct_colbert-v2.md)
+ Reproducing [TCT-ColBERTv2 experiments on the MS MARCO V2 Collections](docs/experiments-msmarco-v2-tct_colbert-v2.md)
+ Reproducing [DPR experiments](docs/experiments-dpr.md)
+ Reproducing [BPR experiments](docs/experiments-bpr.md)
+ Reproducing [ANCE experiments](docs/experiments-ance.md)
Expand Down
10 changes: 4 additions & 6 deletions docs/experiments-spladev2.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pyserini: SPLADEv2 for MS MARCO Passage Ranking
# Pyserini: SPLADEv2 for MS MARCO V1 Passage Ranking

This page describes how to reproduce with Pyserini the DistilSPLADE-max experiments in the following paper:

Expand All @@ -13,17 +13,15 @@ We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with SPLADE processing:

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-distill-splade-max.tar -P collections/

# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/poCLbJDMm7JxwPk/download -O collections/msmarco-passage-distill-splade-max.tar

tar xvf collections/msmarco-passage-distill-splade-max.tar -C collections/
```

To confirm, `msmarco-passage-distill-splade-max.tar` should have MD5 checksum of `95b89a7dfd88f3685edcc2d1ffb120d1`.


## Indexing

We can now index these documents:
Expand All @@ -48,9 +46,8 @@ To ensure that the tokenization in the index aligns exactly with the queries, we
First, fetch the MS MARCO passage ranking dev set queries:

```
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz -P collections/
# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/DrL4HLqgmT6orJL/download -O collections/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz
```

Expand Down Expand Up @@ -91,6 +88,7 @@ The final evaluation metric is very close to the one reported in the paper (0.36
Alternatively, we can use one-the-fly query encoding.

First, download the model checkpoint from NAVER's github [repo](https://github.com/naver/splade/tree/main/weights/splade_max):

```bash
mkdir splade-distil-max
cd splade-distil-max
Expand Down
26 changes: 13 additions & 13 deletions docs/experiments-unicoil-tilde-expansion.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pyserini: uniCOIL (w/ TILDE) for MS MARCO (V1) Passage Ranking
# Pyserini: uniCOIL w/ TILDE for MS MARCO V1 Passage Ranking

This page describes how to reproduce experiments using uniCOIL with TILDE document expansion on the MS MARCO passage corpus, as described in the following paper:

Expand All @@ -11,21 +11,22 @@ The original uniCOIL model is described here:
In the original uniCOIL paper, doc2query-T5 is used to perform document expansion, which is slow and expensive.
As an alternative, Zhuang and Zuccon proposed to use the TILDE model to expand the documents instead, resulting in a faster and cheaper process that is just as effective.
For details of how to use TILDE to expand documents, please refer to the [TIDLE repo](https://github.com/ielab/TILDE).
For details of how to use TILDE to expand documents, please refer to the [TILDE repo](https://github.com/ielab/TILDE).
For additional details on the original uniCOIL design (with doc2query-T5 expansion), please refer to the [COIL repo](https://github.com/luyug/COIL/tree/main/uniCOIL).

In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL + TILDE, i.e., gone through document expansion and term re-weighting.
Thus, no neural inference is involved.

Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil-tilde-expansion.md) based on Java.

## Data Prep

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:

```bash
# Alternate mirrors of the same data, pick one:
wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-tilde-expansion-b8.tar -P collections/

# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/6LECmLdiaBoPwrL/download -O collections/msmarco-passage-unicoil-tilde-expansion-b8.tar

tar -xvf collections/msmarco-passage-unicoil-tilde-expansion-b8.tar -C collections/
Expand All @@ -40,7 +41,7 @@ We can now index these docs:
```
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-passage-unicoil-tilde-expansion-b8/ \
-index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
-index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
```
Expand All @@ -57,8 +58,8 @@ We can now run retrieval:
```bash
python -m pyserini.search --topics msmarco-passage-dev-subset \
--encoder ielab/unicoil-tilde200-msmarco-passage \
--index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
--output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \
--index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
--output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv \
--impact \
--hits 1000 --batch 32 --threads 12 \
--output-format msmarco
Expand All @@ -72,7 +73,7 @@ A complete run typically takes around 20 minutes.
The output is in MS MARCO output format, so we can directly evaluate:

```bash
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv
```

The results should be as follows:
Expand All @@ -91,9 +92,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights.
First, fetch the queries:

```
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz -P collections/
# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/GZEPQkNQGoszHTx/download -O collections/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz
```

Expand All @@ -103,8 +103,8 @@ We can now run retrieval:

```bash
python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.unicoil-tilde-expansion.tsv.gz \
--index indexes/lucene-index.msmarco-passage-unicoil-tilde-expansion-b8 \
--output runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv \
--index indexes/lucene-index.msmarco-passage.unicoil-tilde-expansion-b8 \
--output runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv \
--impact \
--hits 1000 --batch 32 --threads 12 \
--output-format msmarco
Expand All @@ -116,7 +116,7 @@ Since we're not applying neural inference over the queries, retrieval is faster,
The output is in MS MARCO output format, so we can directly evaluate:

```bash
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage-unicoil-tilde-expansion-b8.tsv
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-tilde-expansion-b8.tsv
```

The results should be as follows:
Expand Down
36 changes: 16 additions & 20 deletions docs/experiments-unicoil.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Pyserini: uniCOIL (w/ doc2query-T5) for MS MARCO (V1)
# Pyserini: uniCOIL w/ doc2query-T5 for MS MARCO V1

This page describes how to reproduce the uniCOIL experiments in the following paper:

Expand All @@ -8,7 +8,7 @@ In this guide, we start with a version of the MS MARCO passage corpus that has a
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil.md) based on Java.
Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-unicoil.md) based on Java.

## Passage Ranking

Expand All @@ -18,12 +18,11 @@ We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:

```bash
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-passage-unicoil-b8.tar -P collections/

# Alternate mirror
# Alternate mirrors of the same data, pick one:
wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msmarco-passage-unicoil-b8.tar

tar -xvf collections/msmarco-passage-unicoil-b8.tar -C collections/
tar xvf collections/msmarco-passage-unicoil-b8.tar -C collections/
```

To confirm, `msmarco-passage-unicoil-b8.tar` should have MD5 checksum of `eb28c059fad906da2840ce77949bffd7`.
Expand All @@ -35,7 +34,7 @@ We can now index these docs:
```
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-passage-unicoil-b8/ \
-index indexes/lucene-index.msmarco-passage-unicoil-b8 \
-index indexes/lucene-index.msmarco-passage.unicoil-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
```
Expand All @@ -52,7 +51,7 @@ We can now run retrieval:
```bash
python -m pyserini.search --topics msmarco-passage-dev-subset \
--encoder castorini/unicoil-d2q-msmarco-passage \
--index indexes/lucene-index.msmarco-passage-unicoil-b8 \
--index indexes/lucene-index.msmarco-passage.unicoil-b8 \
--output runs/run.msmarco-passage-unicoil-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
Expand Down Expand Up @@ -86,9 +85,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights.
First, fetch the MS MARCO passage ranking dev set queries:

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.unicoil.tsv.gz -P collections/

# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/QGoHeBm4YsAgt6H/download -O collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz
```

Expand All @@ -98,7 +96,7 @@ We can now run retrieval:

```bash
python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \
--index indexes/lucene-index.msmarco-passage-unicoil-b8 \
--index indexes/lucene-index.msmarco-passage.unicoil-b8 \
--output runs/run.msmarco-passage-unicoil-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
Expand Down Expand Up @@ -133,12 +131,11 @@ We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -P collections/

# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/ZmF6SKpgMZJYXd6/download -O collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar

tar -xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/
tar xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/
```

To confirm, `msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar` should have MD5 checksum of `88f365b148c7702cf30c0fb95af35149`.
Expand All @@ -147,10 +144,10 @@ To confirm, `msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar` should have M

We can now index these docs:

```
```bash
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8/ \
-index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \
-index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
```
Expand All @@ -166,7 +163,7 @@ We can now run retrieval:
```bash
python -m pyserini.search --topics msmarco-doc-dev \
--encoder castorini/unicoil-d2q-msmarco-passage \
--index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \
--index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
--output runs/run.msmarco-doc-unicoil-d2q-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
Expand Down Expand Up @@ -201,9 +198,8 @@ Alternatively, we can use pre-tokenized queries with pre-computed weights.
First, fetch the MS MARCO passage ranking dev set queries:

```bash
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-doc.dev.unicoil.tsv.gz -P collections/

# Alternate mirror
wget https://vault.cs.uwaterloo.ca/s/6D5JtJQxYpPbByM/download -O collections/topics.msmarco-doc.dev.unicoil.tsv.gz
```

Expand All @@ -213,7 +209,7 @@ We can now run retrieval:

```bash
python -m pyserini.search --topics collections/topics.msmarco-doc.dev.unicoil.tsv.gz \
--index indexes/lucene-index.msmarco-doc-unicoil-d2q-b8 \
--index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
--output runs/run.msmarco-doc-unicoil-d2q-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
Expand Down

0 comments on commit 8599c81

Please sign in to comment.