Skip to content

Latest commit

 

History

History
173 lines (136 loc) · 12.6 KB

experiments-msmarco-passage.md

File metadata and controls

173 lines (136 loc) · 12.6 KB

Pyserini: BM25 Baseline for MS MARCO Passage Retrieval

This guide contains instructions for running BM25 baselines on the MS MARCO passage ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO document ranking task.

Setup Note: If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP), try to provision enough resources such as RAM > 6GB and storage ~ 100 (can also be around 70 - 80 for this task) GB (SSD). This will prevent going back and fixing machine configuration again and again. If you get a configuration which works for Anserini on this task, it will work with Pyserini as well.

Data Prep

The guide requires the development installation for additional resource that are not shipped with the Python module; for the (more limited) runs that directly work from the Python module installed via pip, see this guide.

We're going to use collections/msmarco-passage/ as the working directory. First, we need to download and extract the MS MARCO passage dataset:

mkdir collections/msmarco-passage

wget https://msmarco.blob.core.windows.net/msmarcoranking/collectionandqueries.tar.gz -P collections/msmarco-passage

# Alternative mirror:
# wget https://www.dropbox.com/s/9f54jg2f71ray3b/collectionandqueries.tar.gz -P collections/msmarco-passage

tar xvfz collections/msmarco-passage/collectionandqueries.tar.gz -C collections/msmarco-passage

To confirm, collectionandqueries.tar.gz should have MD5 checksum of 31644046b18952c1386cd4564ba2ae69.

Next, we need to convert the MS MARCO tsv collection into Anserini's jsonl files (which have one json object per line):

python tools/scripts/msmarco/convert_collection_to_jsonl.py \
 --collection-path collections/msmarco-passage/collection.tsv \
 --output-folder collections/msmarco-passage/collection_jsonl

The above script should generate 9 jsonl files in collections/msmarco-passage/collection_jsonl, each with 1M lines (except for the last one, which should have 841,823 lines).

We can now index these docs as a JsonCollection using Anserini:

python -m pyserini.index.lucene \
  --collection JsonCollection \
  --input collections/msmarco-passage/collection_jsonl \
  --index indexes/lucene-index-msmarco-passage \
  --generator DefaultLuceneDocumentGenerator \
  --threads 9 \
  --storePositions --storeDocvectors --storeRaw

Upon completion, we should have an index with 8,841,823 documents. The indexing speed may vary; on a modern desktop with an SSD, indexing takes a couple of minutes.

Performing Retrieval on the Dev Queries

The 6980 queries in the development set are already stored in the repo. Let's take a peek:

$ head tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
1048585	what is paula deen's brother
2	 Androgen receptor define
524332	treating tension headaches without medication
1048642	what is paranoid sc
524447	treatment of varicose veins in legs
786674	what is prime rate in canada
1048876	who plays young dr mallard on ncis
1048917	what is operating system misconfiguration
786786	what is priority pass
524699	tricare service number

$ wc tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt
    6980   48335  290193 tools/topics-and-qrels/topics.msmarco-passage.dev-subset.txt

Each line contains a tab-delimited (query id, query) pair. Conveniently, Pyserini already knows how to load and iterate through these pairs. We can now perform retrieval using these queries:

python -m pyserini.search.lucene \
  --index indexes/lucene-index-msmarco-passage \
  --topics msmarco-passage-dev-subset \
  --output runs/run.msmarco-passage.bm25tuned.txt \
  --output-format msmarco \
  --hits 1000 \
  --bm25 --k1 0.82 --b 0.68

Here, we set the BM25 parameters to k1=0.82, b=0.68 (tuned by grid search). The option --output-format msmarco says to generate output in the MS MARCO output format. The option --hits specifies the number of documents to return per query. Thus, the output file should have approximately 6980 × 1000 = 6.9M lines.

Retrieval speed will vary by hardware: On a reasonably modern CPU with an SSD, we might get around 13 qps (queries per second), and so the entire run should finish in under ten minutes (using a single thread). We can perform multi-threaded retrieval by using the --threads and --batch-size arguments. For example, setting --threads 16 --batch-size 64 on a CPU with sufficient cores, the entire run will finish in a couple of minutes.

After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:

$ python tools/scripts/msmarco/msmarco_passage_eval.py \
   tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt runs/run.msmarco-passage.bm25tuned.txt

#####################
MRR @10: 0.18741227770955546
QueriesRanked: 6980
#####################

We can also use the official TREC evaluation tool, trec_eval, to compute metrics other than MRR@10. For that we first need to convert the run file into TREC format:

python -m pyserini.eval.convert_msmarco_run_to_trec_run \
   --input runs/run.msmarco-passage.bm25tuned.txt \
   --output runs/run.msmarco-passage.bm25tuned.trec

python tools/scripts/msmarco/convert_msmarco_to_trec_qrels.py \
   --input tools/topics-and-qrels/qrels.msmarco-passage.dev-subset.txt \
   --output collections/msmarco-passage/qrels.dev.small.trec

And then run the trec_eval tool:

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap \
   collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec

map                   	all	0.1957
recall_1000           	all	0.8573

Average precision or AP (also called mean average precision, MAP) and recall@1000 (recall at rank 1000) are the two metrics we care about the most. AP captures aspects of both precision and recall in a single metric, and is the most common metric used by information retrieval researchers. On the other hand, recall@1000 provides the upper bound effectiveness of downstream reranking modules (i.e., rerankers are useless if there isn't a relevant document in the results).

Reproduction Log*