Skip to content

Commit

Permalink
Minor tweaks to uniCOIL docs: fixed indentation levels (#764)
Browse files Browse the repository at this point in the history
  • Loading branch information
lintool authored Sep 17, 2021
1 parent 4ea3ada commit 331dfe7
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 11 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -392,7 +392,7 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe
+ Reproducing the [multi-field BM25 baseline for MS MARCO (V1) Document Ranking from Elasticsearch](docs/experiments-elastic.md)
+ Reproducing [BM25 baselines on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2.md)
+ Reproducing [DeepImpact experiments for MS MARCO (V1) Passage Ranking](docs/experiments-deepimpact.md)
+ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil.md)
+ Reproducing [uniCOIL experiments with doc2query-T5 expansions for MS MARCO (V1)](docs/experiments-unicoil.md)
+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V1) Passage Ranking](docs/experiments-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments with TILDE expansions for MS MARCO (V2) Passage Ranking](docs/experiments-msmarco-v2-unicoil-tilde-expansion.md)
+ Reproducing [uniCOIL experiments on the MS MARCO (V2) Collections](docs/experiments-msmarco-v2-unicoil.md)
Expand Down
24 changes: 14 additions & 10 deletions docs/experiments-unicoil.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ For details on how to train uniCOIL and perform inference, please see [this guid
Note that Anserini provides [a comparable reproduction guide](https://github.com/castorini/anserini/blob/master/docs/experiments-msmarco-passage-unicoil.md) based on Java.
Here, we can get _exactly_ the same results from Python.

# Passage Ranking
## Data Prep
## Passage Ranking

### Data Prep

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
Expand All @@ -26,10 +27,10 @@ wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msm
tar -xvf collections/msmarco-passage-unicoil-b8.tar -C collections/
```

To confirm, `msmarco-passage-unicoil-b8.tar` should have MD5 checksum of `eb28c059fad906da2840ce77949bffd7`.

To confirm, `msmarco-passage-unicoil-b8.tar` should have MD5 checksum of `eb28c059fad906da2840ce77949bffd7`.

## Indexing
### Indexing

We can now index these docs:

Expand All @@ -47,7 +48,7 @@ Upon completion, we should have an index with 8,841,823 documents.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 20 minutes.


## Retrieval
### Retrieval

To ensure that the tokenization in the index aligns exactly with the queries, we use pre-tokenized queries.
First, fetch the MS MARCO passage ranking dev set queries:
Expand Down Expand Up @@ -90,8 +91,9 @@ QueriesRanked: 6980
#####################
```

# Document Ranking
## Data Prep
## Document Ranking

### Data Prep

We're going to use the repository's root directory as the working directory.
First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
Expand All @@ -107,7 +109,8 @@ tar -xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C col

To confirm, `msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar` should have MD5 checksum of `88f365b148c7702cf30c0fb95af35149`.

## Indexing

### Indexing

We can now index these docs:

Expand All @@ -123,7 +126,9 @@ The important indexing options to note here are `-impact -pretokenized`: the fir

The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around an hour.

## Retrieval

### Retrieval

We can now run retrieval:

```bash
Expand All @@ -137,7 +142,6 @@ $ python -m pyserini.search --topics msmarco-doc-dev \
--output-format msmarco
```


Query evaluation is much slower than with bag-of-words BM25; a complete run can take around 40 minutes.
Note that the important option here is `-impact`, where we specify impact scoring.

Expand Down

0 comments on commit 331dfe7

Please sign in to comment.