This guide contains instructions for running BM25 baselines on the MS MARCO document ranking task, which is nearly identical to a similar guide in Anserini, except that everything is in Python here (no Java). Note that there is a separate guide for the MS MARCO passage ranking task.
Setup Note: If you're instantiating an Ubuntu VM on your system or on cloud (AWS and GCP), try to provision enough resources as the tasks such as building the index could take some time to finish such as RAM > 8GB and storage > 100 GB (SSD). This will prevent going back and fixing machine configuration again and again. If you get a configuration which works for Anserini on this task, it will work with Pyserini as well.
The guide requires the development installation for additional resource that are not shipped with the Python module; for the (more limited) runs that directly work from the Python module installed via pip
, see this guide.
We're going to use the repository's root directory as the working directory. First, we need to download and extract the MS MARCO document dataset:
mkdir collections/msmarco-doc
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docs.trec.gz -P collections/msmarco-doc
# Alternative mirror:
# wget https://www.dropbox.com/s/w6caao3sfx9nluo/msmarco-docs.trec.gz -P collections/msmarco-doc
To confirm, msmarco-docs.trec.gz
should have MD5 checksum of d4863e4f342982b51b9a8fc668b2d0c0
.
There's no need to uncompress the file, as Anserini can directly index gzipped files. Build the index with the following command:
python -m pyserini.index.lucene \
--collection CleanTrecCollection \
--input collections/msmarco-doc \
--index indexes/lucene-index-msmarco-doc \
--generator DefaultLuceneDocumentGenerator \
--threads 1 \
--storePositions --storeDocvectors --storeRaw
On a modern desktop with an SSD, indexing takes around 40 minutes. There should be a total of 3,213,835 documents indexed.
The 5193 queries in the development set are already stored in the repo. Let's take a peek:
$ head tools/topics-and-qrels/topics.msmarco-doc.dev.txt
174249 does xpress bet charge to deposit money in your account
320792 how much is a cost to run disneyland
1090270 botulinum definition
1101279 do physicians pay for insurance from their salaries?
201376 here there be dragons comic
54544 blood diseases that are sexually transmitted
118457 define bona fides
178627 effects of detox juice cleanse
1101278 do prince harry and william have last names
68095 can hives be a sign of pregnancy
$ wc tools/topics-and-qrels/topics.msmarco-doc.dev.txt
5193 35787 220304 tools/topics-and-qrels/topics.msmarco-doc.dev.txt
Each line contains a tab-delimited (query id, query) pair. Conveniently, Pyserini already knows how to load and iterate through these pairs. We can now perform retrieval using these queries:
python -m pyserini.search.lucene \
--index indexes/lucene-index-msmarco-doc \
--topics msmarco-doc-dev \
--output runs/run.msmarco-doc.bm25tuned.txt \
--output-format msmarco \
--hits 100 \
--bm25 --k1 4.46 --b 0.82
Here, we set the BM25 parameters to k1=4.46
, b=0.82
(tuned by grid search).
The option --output-format msmarco
says to generate output in the MS MARCO output format.
The option --hits
specifies the number of documents to return per query.
Note that for the MS MARCO Document Ranking Leaderboard, the official metric is MRR@100, so submissions should only return 100 hits per query.
Retrieval speed will vary by hardware:
On a reasonably modern CPU with an SSD, we might get around 18 qps (queries per second), and so the entire run should finish in under five minutes (using a single thread).
We can perform multi-threaded retrieval by using the --threads
and --batch-size
arguments.
For example, setting --threads 16 --batch-size 64
on a CPU with sufficient cores, the entire run will finish in under a minute.
After the run finishes, we can evaluate the results using the official MS MARCO evaluation script:
$ python tools/scripts/msmarco/msmarco_doc_eval.py \
--judgments tools/topics-and-qrels/qrels.msmarco-doc.dev.txt \
--run runs/run.msmarco-doc.bm25tuned.txt
#####################
MRR @100: 0.2770296928568702
QueriesRanked: 5193
#####################
We can also use the official TREC evaluation tool, trec_eval
, to compute metrics other than MRR@100.
For that we first need to convert the run file into TREC format:
python -m pyserini.eval.convert_msmarco_run_to_trec_run \
--input runs/run.msmarco-doc.bm25tuned.txt \
--output runs/run.msmarco-doc.bm25tuned.trec
And then run the trec_eval
tool:
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.100 -mmap \
tools/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/run.msmarco-doc.bm25tuned.trec
map all 0.2770
recall_100 all 0.8076
Let's compare to the baseline provided by Microsoft. First, download:
wget https://msmarco.blob.core.windows.net/msmarcoranking/msmarco-docdev-top100.gz -P runs
gunzip runs/msmarco-docdev-top100.gz
Then, run trec_eval
to compare:
$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.100 -mmap \
tools/topics-and-qrels/qrels.msmarco-doc.dev.txt runs/msmarco-docdev-top100
map all 0.2219
recall_100 all 0.7564
We can see that Anserini's (tuned) BM25 baseline is already much better than the baseline provided by the organizers.
Reproduction Log*
- Results reproduced by @JeffreyCA on 2020-09-14 (commit
49fd7cb
) - Results reproduced by @jhuang265 on 2020-09-14 (commit
2ed2acc
) - Results reproduced by @Dahlia-Chehata on 2020-11-12 (commit
55c3dbc
) - Results reproduced by @rakeeb123 on 2020-12-07 (commit
3bcd4e5
) - Results reproduced by @jrzhang12 on 2021-01-03 (commit
7caedfc
) - Results reproduced by @HEC2018 on 2021-01-04 (commit
46a6d47
) - Results reproduced by @KaiSun314 on 2021-01-08 (commit
aeec31f
) - Results reproduced by @yemiliey on 2021-01-18 (commit
98f3236
) - Results reproduced by @larryli1999 on 2021-01-04 (commit
74a87e4
) - Results reproduced by @ArthurChen189 on 2021-01-04 (commit
7261223
) - Results reproduced by @printfCalvin on 2021-04-12 (commit
0801f7f
) - Results reproduced by @saileshnankani on 2021-04-26 (commit
6d48609
) - Results reproduced by @andrewyguo on 2021-04-30 (commit
ecfed61
) - Results reproduced by @mayankanand007 on 2021-05-04 (commit
a9d6f66
) - Results reproduced by @rootofallevii on 2021-05-14 (commit
e764797
) - Results reproduced by @jpark621 on 2021-06-13 (commit
f614111
) - Results reproduced by @nimasadri11 on 2021-06-28 (commit
d31e2e6
) - Results reproduced by @mzzchy on 2021-07-05 (commit
45083f5
) - Results reproduced by @d1shs0ap on 2021-07-16 (commit
a6b6545
) - Results reproduced by @apokali on 2021-08-19 (commit
45a2fb4
) - Results reproduced by @leungjch on 2021-09-12 (commit
c71a69e
) - Results reproduced by @AlexWang000 on 2021-10-10 (commit
8599c81
) - Results reproduced by @manveertamber on 2021-12-05 (commit
c280dad
) - Results reproduced by @lingwei-gu on 2021-12-15 (commit
7249409
) - Results reproduced by @tyao-t on 2021-12-19 (commit
fc54ed6
) - Results reproduced by @kevin-wangg on 2022-01-05 (commit
b9fcae7
) - Results reproduced by @vivianliu0 on 2021-01-06 (commit
937ec63
) - Results reproduced by @mikhail-tsir on 2022-01-10 (commit
f1084a0
) - Results reproduced by @AceZhan on 2022-01-14 (commit
68be809
) - Results reproduced by @jh8liang on 2022-02-06 (commit
e03e068
) - Results reproduced by @HAKSOAT on 2022-03-11 (commit
7796685
)