Skip to content

Latest commit

 

History

History
292 lines (207 loc) · 12.6 KB

experiments-msmarco-v2-unicoil.md

File metadata and controls

292 lines (207 loc) · 12.6 KB

Pyserini: uniCOIL w/ doc2query-T5 on MS MARCO V2

This page describes how to reproduce retrieval experiments with the uniCOIL model on the MS MARCO V2 collections. Details about our model can be found in the following paper:

Jimmy Lin and Xueguang Ma. A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques. arXiv:2106.14807.

For uniCOIL, we make the corpus (sparse vectors) as well as the pre-built indexes available to download.

Passage Ranking (No Expansion)

You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.

For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.

Here, we start from MS MARCO V2 passage corpus that has already been processed with uniCOIL, i.e., gone through term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).

Download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_noexp_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_passage_unicoil_noexp_0shot.tar -C collections/

To confirm, msmarco_v2_passage_unicoil_noexp_0shot.tar is 24 GB and has an MD5 checksum of d9cc1ed3049746e68a2c91bf90e5212d.

Index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_passage_unicoil_noexp_0shot/ \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use --index msmarco-v2-passage-unicoil-noexp-0shot in the command below.

Sparse retrieval with uniCOIL:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-noexp-0shot/ \
  --topics msmarco-v2-passage-dev \
  --encoder castorini/unicoil-noexp-msmarco-passage \
  --output runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt \
  --batch 144 --threads 36 \
  --hits 1000 \
  --impact

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt

Results:
map                   	all	0.1334
recip_rank            	all	0.1343

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-noexp-0shot.dev.txt

Results:
recall_100            	all	0.4983
recall_1000           	all	0.7010

Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-passage-dev-unicoil-noexp.

Passage Ranking (With doc2query-T5 Expansion)

You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.

After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.

Download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_passage_unicoil_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_passage_unicoil_0shot.tar -C collections/

To confirm, msmarco_v2_passage_unicoil_0shot.tar is 41 GB and has an MD5 checksum of 1949a00bfd5e1f1a230a04bbc1f01539.

Index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_passage_unicoil_0shot/ \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-0shot/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use --index msmarco-v2-passage-unicoil-0shot in the command below.

Sparse retrieval with uniCOIL:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-v2-passage-unicoil-0shot/ \
  --topics msmarco-v2-passage-dev \
  --encoder castorini/unicoil-msmarco-passage \
  --output runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt \
  --batch 144 --threads 36 \
  --hits 1000 \
  --impact

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt

Results:
map                     all     0.1488
recip_rank              all     0.1501

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-passage-dev \
    runs/run.msmarco-v2-passage-unicoil-0shot.dev.txt

Results:
recall_100              all     0.5515
recall_1000             all     0.7613

Note that we evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-passage-dev-unicoil.

Document Ranking (No Expansion)

You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.

For the TREC 2021 Deep Learning Track, we did not have time to train a new uniCOIL model on V2 data and we did not have time to finish doc2query-T5 expansions. Thus, we applied uniCOIL without expansions in a zero-shot manner using a model trained on the MS MARCO (V1) passage corpus.

Here, we start from MS MARCO V2 segmented document corpus that has already been processed with uniCOIL, i.e., gone through term reweighting. As an alternative, we also make available pre-built indexes (in which case the indexing step can be skipped).

Download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar -C collections/

To confirm, msmarco_v2_doc_segmented_unicoil_noexp_0shot.tar is 54 GB and has an MD5 checksum of 28261587d6afde56efd8df4f950e7fb4.

Index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_doc_segmented_unicoil_noexp_0shot/ \
  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use --index msmarco-v2-doc-segmented-unicoil-noexp-0shot in the command below.

Sparse retrieval with uniCOIL:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-noexp-0shot/ \
  --topics msmarco-v2-doc-dev \
  --encoder castorini/unicoil-noexp-msmarco-passage \
  --output runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt \
  --batch 144 --threads 36 \
  --hits 10000 --max-passage --max-passage-hits 1000 \
  --impact

For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
    runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt

Results:
map                   	all	0.2047
recip_rank            	all	0.2066

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
    runs/run.msmarco-doc-v2-segmented-unicoil-noexp-0shot.dev.txt

Results:
recall_100            	all	0.7198
recall_1000           	all	0.8854

We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-doc-dev-unicoil-noexp.

Document Ranking (With doc2query-T5 Expansion)

You can skip the data prep and indexing steps if you use our pre-built indexes. Skip directly down to the "Retrieval" section below.

After the TREC 2021 Deep Learning Track submissions, we were able to complete doc2query-T5 expansions.

Download the sparse representation of the corpus generated by uniCOIL:

wget https://rgw.cs.uwaterloo.ca/JIMMYLIN-bucket0/data/msmarco_v2_doc_segmented_unicoil_0shot.tar -P collections/

tar -xvf collections/msmarco_v2_doc_segmented_unicoil_0shot.tar -C collections/

To confirm, msmarco_v2_doc_segmented_unicoil_0shot.tar is 62 GB and has an MD5 checksum of 889db095113cc4fe152382ccff73304a.

Index the sparse vectors:

python -m pyserini.index.lucene \
  --collection JsonVectorCollection \
  --input collections/msmarco_v2_doc_segmented_unicoil_0shot/ \
  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
  --generator DefaultLuceneDocumentGenerator \
  --threads 32 \
  --impact --pretokenized

If you've skipped the data prep and indexing steps and wish to directly use our pre-built indexes, use --index msmarco-v2-doc-segmented-unicoil-0shot in the command below.

Sparse retrieval with uniCOIL:

python -m pyserini.search.lucene \
  --index indexes/lucene-index.msmarco-doc-v2-segmented-unicoil-0shot/ \
  --topics msmarco-v2-doc-dev \
  --encoder castorini/unicoil-msmarco-passage \
  --output runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt \
  --batch 144 --threads 36 \
  --hits 10000 --max-passage --max-passage-hits 1000 \
  --impact

For the document corpus, since we are searching the segmented version, we retrieve the top 10k segments and perform MaxP to obtain the top 1000 documents.

To evaluate, using trec_eval:

$ python -m pyserini.eval.trec_eval -c -M 100 -m map -m recip_rank msmarco-v2-doc-dev \
    runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt

Results:
map                     all     0.2217
recip_rank              all     0.2242

$ python -m pyserini.eval.trec_eval -c -m recall.100,1000 msmarco-v2-doc-dev \
    runs/run.msmarco-doc-v2-segmented-unicoil-0shot.dev.txt

Results:
recall_100              all     0.7556
recall_1000             all     0.9056

We evaluate MAP and MRR at a cutoff of 100 hits to match the official evaluation metrics. However, we measure recall at both 100 and 1000 hits; the latter is a common setting for reranking.

These results differ slightly from the regressions in Anserini because here we are performing on-the-fly query encoding, whereas the Anserini indexes use pre-encoded queries. To reproduce the Anserini results, use pre-encoded queries with --topics msmarco-v2-doc-dev-unicoil.

Reproduction Log*