This page describes how to reproduce the uniCOIL experiments in the following paper:
Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.
In this guide, we start with a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting. Thus, no neural inference is involved. For details on how to train uniCOIL and perform inference, please see this guide.
Note that Anserini provides a comparable reproduction guide based on Java.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
# Alternate mirrors of the same data, pick one:
wget https://git.uwaterloo.ca/jimmylin/unicoil/-/raw/master/msmarco-passage-unicoil-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/Rm6fknT432YdBts/download -O collections/msmarco-passage-unicoil-b8.tar
tar xvf collections/msmarco-passage-unicoil-b8.tar -C collections/
To confirm, msmarco-passage-unicoil-b8.tar
is ~3.3 GB and has MD5 checksum eb28c059fad906da2840ce77949bffd7
.
We can now index these docs:
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-passage-unicoil-b8/ \
-index indexes/lucene-index.msmarco-passage.unicoil-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
The important indexing options to note here are -impact -pretokenized
: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around 15 minutes.
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-passage-unicoil-d2q
in the command below.
We can now run retrieval:
python -m pyserini.search --topics msmarco-passage-dev-subset \
--encoder castorini/unicoil-d2q-msmarco-passage \
--index indexes/lucene-index.msmarco-passage.unicoil-b8 \
--output runs/run.msmarco-passage.unicoil-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
--output-format msmarco
Here, we are using the transformer model to encode the queries on the fly using the CPU.
Note that the important option here is -impact
, where we specify impact scoring.
With these impact scores, query evaluation is already slower than bag-of-words BM25; on top of that we're adding neural inference on the CPU.
A complete run typically takes around 30 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-b8.tsv
The results should be something along these lines:
#####################
MRR @10: 0.3508734138354477
QueriesRanked: 6980
#####################
There might be small differences in score due to non-determinism in neural inference; see these notes for detail. The above score was obtained on Linux.
Alternatively, we can use pre-tokenized queries with pre-computed weights. First, fetch the MS MARCO passage ranking dev set queries:
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-passage.dev-subset.unicoil.tsv.gz -P collections/
wget https://vault.cs.uwaterloo.ca/s/QGoHeBm4YsAgt6H/download -O collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz
The MD5 checksum of the topics file is 1af1da05ae5fe0b9d8ddf2d143b6e7f8
.
We can now run retrieval:
python -m pyserini.search --topics collections/topics.msmarco-passage.dev-subset.unicoil.tsv.gz \
--index indexes/lucene-index.msmarco-passage.unicoil-b8 \
--output runs/run.msmarco-passage.unicoil-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
--output-format msmarco
Here, we also specify -impact
for impact scoring.
Since we're not applying neural inference over the queries, speed is faster, typically less than 10 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.unicoil-b8.tsv
The results should be as follows:
#####################
MRR @10: 0.35155222404147896
QueriesRanked: 6980
#####################
Note that in this case, the results should be deterministic.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO passage dataset with uniCOIL processing:
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -P collections/
wget https://vault.cs.uwaterloo.ca/s/ZmF6SKpgMZJYXd6/download -O collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar
tar xvf collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar -C collections/
To confirm, msmarco-doc-per-passage-expansion-unicoil-d2q-b8.tar
should have MD5 checksum of 88f365b148c7702cf30c0fb95af35149
.
We can now index these docs:
python -m pyserini.index -collection JsonVectorCollection \
-input collections/msmarco-doc-per-passage-expansion-unicoil-d2q-b8/ \
-index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
-generator DefaultLuceneDocumentGenerator -impact -pretokenized \
-threads 12
The important indexing options to note here are -impact -pretokenized
: the first tells Anserini not to encode BM25 doclengths into Lucene's norms (which is the default) and the second option says not to apply any additional tokenization on the uniCOIL tokens.
The indexing speed may vary; on a modern desktop with an SSD (using 12 threads, per above), indexing takes around an hour.
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-doc-per-passage-unicoil-d2q
in the command below.
We can now run retrieval:
python -m pyserini.search --topics msmarco-doc-dev \
--encoder castorini/unicoil-d2q-msmarco-passage \
--index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
--output runs/run.msmarco-doc.unicoil-d2q-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
--max-passage --max-passage-hits 100 \
--output-format msmarco
Here, we are using the transformer model to encode the queries on the fly using the CPU.
Note that the important option here is -impact
, where we specify impact scoring.
With these impact scores, query evaluation is already slower than bag-of-words BM25; on top of that we're adding neural inference on the CPU.
A complete run can take around 40 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev --run runs/run.msmarco-doc.unicoil-d2q-b8.tsv
The results should be something along these lines:
#####################
MRR @100: 0.3530641289682811
QueriesRanked: 5193
#####################
There might be small differences in score due to non-determinism in neural inference; see these notes for detail. The above score was obtained on Linux.
Alternatively, we can use pre-tokenized queries with pre-computed weights. First, fetch the MS MARCO passage ranking dev set queries:
# Alternate mirrors of the same data, pick one:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/topics.msmarco-doc.dev.unicoil.tsv.gz -P collections/
wget https://vault.cs.uwaterloo.ca/s/6D5JtJQxYpPbByM/download -O collections/topics.msmarco-doc.dev.unicoil.tsv.gz
The MD5 checksum of the topics file is 40e5f64500272ecde270e55beecd5e94
.
We can now run retrieval:
python -m pyserini.search --topics collections/topics.msmarco-doc.dev.unicoil.tsv.gz \
--index indexes/lucene-index.msmarco-doc.unicoil-d2q-b8 \
--output runs/run.msmarco-doc.unicoil-d2q-b8.tsv \
--impact \
--hits 1000 --batch 36 --threads 12 \
--max-passage --max-passage-hits 100 \
--output-format msmarco
Here, we also specify -impact
for impact scoring.
Since we're not applying neural inference over the queries, speed is faster, typically less than 10 minutes.
The output is in MS MARCO output format, so we can directly evaluate:
python -m pyserini.eval.msmarco_doc_eval --judgments msmarco-doc-dev --run runs/run.msmarco-doc.unicoil-d2q-b8.tsv
The results should be as follows:
#####################
MRR @100: 0.352997702662614
QueriesRanked: 5193
#####################
Note that in this case, the results should be deterministic.
Reproduction Log*
- Results reproduced by @ArthurChen189 on 2021-07-13 (commit
228d5c9
) - Results reproduced by @lintool on 2021-07-14 (commit
ed88e4c
) - Results reproduced by @lintool on 2021-09-17 (commit
79eb5cf
) - Results reproduced by @mayankanand007 on 2021-09-18 (commit
331dfe7
) - Results reproduced by @apokali on 2021-09-23 (commit
82f8422
)