This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections. Details about our model can be found in the following paper:
Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.
For uniCOIL, we make the corpus (sparse vectors) as well as the pre-built indexes available to download.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.
Here, we start from MS MARCO V2 passage corpus that has already been processed with uniCOIL, i.e., gone through term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
Download the sparse representation of the corpus generated by uniCOIL:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar -P collections/
tar -xvf collections/msmarco_v2_passage_unicoil_noexp_0shot.tar -C collections/
To confirm, msmarco_v2_passage_unicoil_noexp_0shot.tar
is 24 GB and has an MD5 checksum of d9cc1ed3049746e68a2c91bf90e5212d
.
Index the sparse vectors:
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_passage_unicoil_noexp_0shot/ \
--index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact --pretokenized
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-v2-passage-unicoil-noexp-0shot
in the command below.
Sparse retrieval with uniCOIL:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \
--topics msmarco-v2-passage-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--output runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt \
--batch 144 --threads 36 \
--hits 1000 \
--impact
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev \
runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt
Results:
map all 0.1334
recip_rank all 0.1343
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev \
runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt
Results:
recall_100 all 0.4983
recall_1000 all 0.7010
Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries.
To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-passage-dev-unicoil-noexp
.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.
Download the sparse representation of the corpus generated by uniCOIL:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar -P collections/
tar -xvf collections/msmarco_v2_passage_unicoil_0shot.tar -C collections/
To confirm, msmarco_v2_passage_unicoil_0shot.tar
is 41 GB and has an MD5 checksum of 1949a00bfd5e1f1a230a04bbc1f01539
.
Index the sparse vectors:
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_passage_unicoil_0shot/ \
--index indexes/lucene-index.msmarco-v2-passage-unicoil-0shot/ \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact --pretokenized
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-v2-passage-unicoil-0shot
in the command below.
Sparse retrieval with uniCOIL:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-v2-passage-unicoil-0shot/ \
--topics msmarco-v2-passage-dev \
--encoder castorini/unicoil-msmarco-passage \
--output runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt \
--batch 144 --threads 36 \
--hits 1000 \
--impact
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev \
runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt
Results:
map all 0.1488
recip_rank all 0.1501
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev \
runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt
Results:
recall_100 all 0.5515
recall_1000 all 0.7613
Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries.
To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-passage-dev-unicoil
.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.
Here, we start from MS MARCO V2 segmented document corpus that has already been processed with uniCOIL, i.e., gone through term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).
Download the sparse representation of the corpus generated by uniCOIL:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -P collections/
tar -xvf collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -C collections/
To confirm, msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar
is 54 GB and has an MD5 checksum of 28261587d6afde56efd8df4f950e7fb4
.
Index the sparse vectors:
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot/ \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact --pretokenized
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-v2-doc-segmented-unicoil-noexp-0shot
in the command below.
Sparse retrieval with uniCOIL:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
--topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-noexp-msmarco-passage \
--output runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt \
--batch 144 --threads 36 \
--hits 10000 --max-passage --max-passage-hits 1000 \
--impact
For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt
Results:
map all 0.2047
recip_rank all 0.2066
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt
Results:
recall_100 all 0.7198
recall_1000 all 0.8854
We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries.
To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-doc-dev-unicoil-noexp
.
You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.
After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.
Download the sparse representation of the corpus generated by uniCOIL:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot.tar -P collections/
tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot.tar -C collections/
To confirm, msmarco_v2_doc_segmented_unicoil_0shot.tar
is 62 GB and has an MD5 checksum of 889db095113cc4fe152382ccff73304a
.
Index the sparse vectors:
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_doc_segmented_unicoil_0shot/ \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact --pretokenized
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-v2-doc-segmented-unicoil-0shot
in the command below.
Sparse retrieval with uniCOIL:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
--topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-msmarco-passage \
--output runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt \
--batch 144 --threads 36 \
--hits 10000 --max-passage --max-passage-hits 1000 \
--impact
For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt
Results:
map all 0.2217
recip_rank all 0.2242
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt
Results:
recall_100 all 0.7556
recall_1000 all 0.9056
We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries.
To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-doc-dev-unicoil
.
Download the sparse representation of the corpus generated by uniCOIL:
wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -P collections/
tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot_v2.tar -C collections/
To confirm, msmarco_v2_doc_segmented_unicoil_0shot_v2.tar
is 72 GB and has an MD5 checksum of c5639748c2cbad0152e10b0ebde3b804
.
Index the sparse vectors:
python -m pyserini.index.lucene \
--collection JsonVectorCollection \
--input collections/msmarco_v2_doc_segmented_unicoil_0shot_v2/ \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot-v2/ \
--generator DefaultLuceneDocumentGenerator \
--threads 32 \
--impact --pretokenized
If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use
--index msmarco-v2-doc-segmented-unicoil-0shot-v2
in the command below.
Sparse retrieval with uniCOIL:
python -m pyserini.search.lucene \
--index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot-v2/ \
--topics msmarco-v2-doc-dev \
--encoder castorini/unicoil-msmarco-passage \
--output runs/run.msmarco-doc-v2-segmented-unicoil-0shot-v2.dev.txt \
--batch 144 --threads 36 \
--hits 10000 --max-passage --max-passage-hits 1000 \
--impact
For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.
To evaluate, using trec_eval
:
$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-0shot-v2.dev.txt
Results:
map all 0.2388
recip_rank all 0.2419
$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
runs/run.msmarco-doc-v2-segmented-unicoil-0shot-v2.dev.txt
Results:
recall_100 all 0.7789
recall_1000 all 0.9120
We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.
These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries.
To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-doc-dev-unicoil
.