To run benchmark on bm25 implementations, simply run:
# For bm25_pt
python -m benchmark.on_bm25_pt -d "<dataset>"
# For rank-bm25
python -m benchmark.on_rank_bm25 -d "<dataset>"
# for Pyserini
python -m benchmark.on_pyserini -d "<dataset>"
# For elastic, After starting the server, run:
python -m benchmark.on_elastic -d "<dataset>"
# for PISA
python -m benchmark.on_pisa -d "<dataset>"
where <dataset>
is the name of the dataset to be used.
The available datasets are public BEIR datasets: trec-covid
, nfcorpus
, fiqa
, arguana
, webis-touche2020
, quora
, scidocs
, scifact
, cqadupstack
, nq
, msmarco
, hotpotqa
, dbpedia-entity
, fever
, climate-fever
,
For rank-bm25
, due to the long runtime, we can sample queries
python -m benchmark.on_rank_bm25 -d "<dataset>" --samples <num_samples>
For rank-bm25
, we can also specify the method with --method
to be used:
rank
(default)bm25l
bm25+
Results will be saved in results/
directory.
If you want to use elastic search, you need to start the server first.
First, download the elastic search from here. You will get a file, e.g. elasticsearch-8.14.0-linux-x86_64.tar.gz
. Extract the file and ensure it is in the same directory as the bm25-benchmarks
directory.
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.14.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-8.14.0-linux-x86_64.tar.gz
# remove the tar file
rm elasticsearch-8.14.0-linux-x86_64.tar.gz
Then, start the server with the following command:
./elasticsearch-8.14.0/bin/elasticsearch -E xpack.security.enabled=false -E thread_pool.search.size=1 -E thread_pool.write.size=1
The results are benchmarked using Kaggle notebooks to ensure reproducibility. Each one is run on single-core, Intel Xeon CPU @ 2.20GHz, using 30GB RAM.
The shorthands used are:
BM25PT
forbm25_pt
PSRN
forpyserini
R-BM25
forrank-bm25
BM25S
forbm25
, andBM25S+J
for Numba JIT version ofbm25s
(v0.2.0+)ES
forelasticsearch
PISA
for the Pisa Engine (via thepyterrier_pisa
Python bindings)OOM
for out-of-memory errorDNT
for did not terminate (i.e. went over 12 hours)
dataset | PISA | BM25S+J | BM25S | ES | PSRN | PT | R-BM25 |
---|---|---|---|---|---|---|---|
arguana | 128.01 | 869.95 | 573.91 | 13.67 | 11.95 | 110.51 | 2 |
climate-fever | 25.66 | 38.49 | 13.09 | 4.02 | 8.06 | OOM | 0.03 |
cqadupstack | 281.65 | 396.5 | 170.91 | 13.38 | DNT | OOM | 0.77 |
dbpedia-entity | 114.05 | 71.8 | 13.44 | 10.68 | 12.69 | OOM | 0.11 |
fever | 60.76 | 53.84 | 20.19 | 7.45 | 10.52 | OOM | 0.06 |
fiqa | 546.61 | 1237.39 | 717.78 | 16.96 | 12.51 | 20.52 | 4.46 |
hotpotqa | 39.70 | 47.16 | 20.88 | 7.11 | 10.41 | OOM | 0.04 |
msmarco | 125.18 | 39.18 | 12.2 | 11.88 | 11.01 | OOM | 0.07 |
nfcorpus | 2639.38 | 5696.21 | 1196.16 | 45.84 | 32.94 | 256.67 | 224.66 |
nq | 117.34 | 109.47 | 41.85 | 12.16 | 11.04 | OOM | 0.1 |
quora | 528.45 | 479.71 | 272.04 | 21.8 | 15.58 | 6.49 | 1.18 |
scidocs | 678.06 | 1448.32 | 767.05 | 17.93 | 14.1 | 41.34 | 9.01 |
scifact | 1100.60 | 2787.84 | 1317.12 | 20.81 | 15.02 | 184.3 | 47.6 |
trec-covid | 140.92 | 483.84 | 85.64 | 7.34 | 8.53 | 3.73 | 1.48 |
webis-touche2020 | 201.19 | 390.03 | 60.59 | 13.53 | 12.36 | OOM | 1.1 |
Notes:
- For Rank-BM25, larger datasets are ran with 1000 samples rather than the full dataset to ensure it finishes within 12h (limit for Kaggle notebooks).
- For ES and BM25S, we can set a number of threads to use. However, you might not see an improvement, in fact you might even see a decrease in throughput in the case of BM25S due to how multi-threading is implemented. Click below to see the results.
Show BM25S & ES multi-threaded (4T) performance (Q/s)
dataset | PISA | BM25S | ES |
---|---|---|---|
arguana | 311.00 | 211 | 33.37 |
climate-fever | 69.89 | 22.06 | 8.13 |
cqadupstack | 694.82 | 248.87 | 27.76 |
dbpedia-entity | 294.66 | 26.18 | 15.49 |
fever | 166.77 | 47.03 | 14.07 |
fiqa | 1240.22 | 449.82 | 36.33 |
hotpotqa | 109.44 | 45.02 | 10.35 |
msmarco | 340.29 | 21.64 | 18.19 |
nfcorpus | 3188.22 | 784.24 | 81.07 |
nq | 312.09 | 77.49 | 19.18 |
quora | 1534.93 | 308.58 | 43.02 |
scidocs | 1461.20 | 614.23 | 46.36 |
scifact | 2620.73 | 645.88 | 50.93 |
trec-covid | 268.96 | 100.88 | 13.5 |
webis-touche2020 | 297.79 | 202.39 | 26.55 |
Show normalized table wrt Rank-BM25
dataset | PISA | BM25S | ES | PSRN | PT | Rank |
---|---|---|---|---|---|---|
arguana | 64.01 | 286.96 | 6.84 | 5.98 | 55.26 | 1 |
climate-fever | 855.33 | 436.33 | 134 | 268.67 | nan | 1 |
cqadupstack | 365.78 | 221.96 | 17.38 | nan | nan | 1 |
dbpedia-entity | 1036.82 | 122.18 | 97.09 | 115.36 | nan | 1 |
fever | 1012.67 | 336.5 | 124.17 | 175.33 | nan | 1 |
fiqa | 122.56 | 160.94 | 3.8 | 2.8 | 4.6 | 1 |
hotpotqa | 992.50 | 522 | 177.75 | 260.25 | nan | 1 |
msmarco | 1788.29 | 174.29 | 169.71 | 157.29 | nan | 1 |
nfcorpus | 11.75 | 5.32 | 0.2 | 0.15 | 1.14 | 1 |
nq | 1173.40 | 418.5 | 121.6 | 110.4 | nan | 1 |
quora | 447.84 | 230.54 | 18.47 | 13.2 | 5.5 | 1 |
scidocs | 75.26 | 85.13 | 1.99 | 1.56 | 4.59 | 1 |
scifact | 23.12 | 27.67 | 0.44 | 0.32 | 3.87 | 1 |
trec-covid | 95.22 | 57.86 | 4.96 | 5.76 | 2.52 | 1 |
webis-touche2020 | 182.90 | 55.08 | 12.3 | 11.24 | nan | 1 |
# Docs | # Queries | # Tokens | |
---|---|---|---|
msmarco | 8,841,823 | 6,980 | 340,859,891 |
hotpotqa | 5,233,329 | 7,405 | 169,530,287 |
trec-covid | 171,332 | 50 | 20,231,412 |
webis-touche2020 | 382,545 | 49 | 74,180,340 |
arguana | 8,674 | 1,406 | 947,470 |
fiqa | 57,638 | 648 | 5,189,035 |
nfcorpus | 3,633 | 323 | 614,081 |
climate-fever | 5,416,593 | 1,535 | 318,190,120 |
nq | 2,681,468 | 3,452 | 148,249,808 |
scidocs | 25,657 | 1,000 | 3,211,248 |
quora | 522,931 | 10,000 | 4,202,123 |
dbpedia-entity | 4,635,922 | 400 | 162,336,256 |
cqadupstack | 457,199 | 13,145 | 44,857,487 |
fever | 5,416,568 | 6,666 | 318,184,321 |
scifact | 5,183 | 300 | 812,074 |
The following results follow the same setup as the queries/s benchmarks described above (single-core).
dataset | PISA | BM25S | ES | PSRN | PT | Rank |
---|---|---|---|---|---|---|
arguana | 3608.75 | 4314.79 | 3591.63 | 1225.18 | 638.1 | 5021.3 |
climate-fever | 5474.84 | 4364.43 | 3825.89 | 6880.42 | nan | 7085.51 |
cqadupstack | 4380.73 | 4800.89 | 3725.43 | nan | nan | 5370.32 |
dbpedia-entity | 9532.44 | 7576.28 | 6333.82 | 8501.7 | nan | 9110.36 |
fever | 5633.20 | 4921.88 | 3879.63 | 7007.5 | nan | 5482.64 |
fiqa | 4655.99 | 5959.25 | 4035.11 | 3735.38 | 421.51 | 6455.53 |
hotpotqa | 10199.26 | 7420.39 | 5455.6 | 10342.5 | nan | 9407.9 |
msmarco | 9873.08 | 7480.71 | 5391.29 | 9686.07 | nan | 12455.9 |
nfcorpus | 2484.47 | 3169.4 | 1688.15 | 692.05 | 442.2 | 3579.47 |
nq | 6923.50 | 6083.86 | 5742.13 | 6652.33 | nan | 6048.85 |
quora | 37954.43 | 28002.4 | 8189.75 | 22818.5 | 6251.26 | 47609.2 |
scidocs | 3076.90 | 4107.46 | 3008.45 | 2137.64 | 312.72 | 4232.15 |
scifact | 2560.91 | 3253.63 | 2649.57 | 880.53 | 442.61 | 3792.84 |
trec-covid | 4736.72 | 4600.14 | 2966.98 | 3768.1 | 406.37 | 4672.62 |
webis-touche2020 | 2301.53 | 2971.96 | 2484.87 | 2718.41 | nan | 3115.96 |
We use abbreviations for datasets of BEIR benchmarks.
Click to show dataset abbreviations
AG
for arguanaCD
for cqadupstackCF
for climate-feverDB
for dbpedia-entityFQ
for fiqaFV
for feverHP
for hotpotqaMS
for msmarcoNF
for nfcorpusNQ
for nqQR
for quoraSD
for scidocsSF
for scifactTC
for trec-covidWT
for webis-touche2020
k1 | b | method | Avg. | AG | CD | CF | DB | FQ | FV | HP | MS | NF | NQ | QR | SD | SF | TC | WT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.9 | 0.4 | Lucene | 41.1 | 40.8 | 28.2 | 16.2 | 31.9 | 23.8 | 63.8 | 62.9 | 22.8 | 31.8 | 30.5 | 78.7 | 15.0 | 67.6 | 58.9 | 44.2 |
1.2 | 0.75 | ATIRE | 39.9 | 48.7 | 30.1 | 13.7 | 30.3 | 25.3 | 50.3 | 58.5 | 22.6 | 31.8 | 29.1 | 80.5 | 15.6 | 68.1 | 61.0 | 33.2 |
1.2 | 0.75 | BM25+ | 39.9 | 48.7 | 30.1 | 13.7 | 30.3 | 25.3 | 50.3 | 58.5 | 22.6 | 31.8 | 29.1 | 80.5 | 15.6 | 68.1 | 61.0 | 33.2 |
1.2 | 0.75 | BM25L | 39.5 | 49.6 | 29.8 | 13.5 | 29.4 | 25.0 | 46.6 | 55.9 | 21.4 | 32.2 | 28.1 | 80.3 | 15.8 | 68.7 | 62.9 | 33.0 |
1.2 | 0.75 | Lucene | 39.9 | 48.7 | 30.1 | 13.7 | 30.3 | 25.3 | 50.3 | 58.5 | 22.6 | 31.8 | 29.1 | 80.5 | 15.6 | 68.0 | 61.0 | 33.2 |
1.2 | 0.75 | Robertson | 39.9 | 49.2 | 29.9 | 13.7 | 30.3 | 25.4 | 50.3 | 58.5 | 22.6 | 31.9 | 29.2 | 80.4 | 15.5 | 68.3 | 59.0 | 33.8 |
1.5 | 0.75 | ES | 42.0 | 47.7 | 29.8 | 17.8 | 31.1 | 25.3 | 62.0 | 58.6 | 22.1 | 34.4 | 31.6 | 80.6 | 16.3 | 69.0 | 68.0 | 35.4 |
1.5 | 0.75 | Lucene | 39.7 | 49.3 | 29.9 | 13.6 | 29.9 | 25.1 | 48.1 | 56.9 | 21.9 | 32.1 | 28.5 | 80.4 | 15.8 | 68.7 | 62.3 | 33.1 |
1.5 | 0.75 | PSRN | 40.0 | 48.4 | 29.8 | 14.2 | 30.0 | 25.3 | 50.0 | 57.6 | 22.1 | 32.6 | 28.6 | 80.6 | 15.6 | 68.8 | 63.4 | 33.5 |
1.5 | 0.75 | PT | 45.0 | 44.9 | -- | -- | -- | 22.5 | -- | -- | -- | 31.9 | -- | 75.1 | 14.7 | 67.8 | 58.0 | -- |
1.5 | 0.75 | Rank | 39.6 | 49.5 | 29.6 | 13.6 | 29.9 | 25.3 | 49.3 | 58.1 | 21.1 | 32.1 | 28.5 | 80.3 | 15.8 | 68.5 | 60.1 | 32.9 |
1.2 | 0.75 | PISA | 38.8 | 41.2 | 27.8 | 13.8 | 30.6 | 24.6 | 49.1 | 58.1 | 22.6 | 34.4 | 28.2 | 72 | 15.7 | 68.8 | 63.7 | 31.1 |
k1 | b | method | Avg. | AG | CD | CF | DB | FQ | FV | HP | MS | NF | NQ | QR | SD | SF | TC | WT |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0.9 | 0.4 | Lucene | 77.3 | 98.8 | 71.1 | 63.3 | 67.5 | 74.3 | 95.7 | 88.0 | 85.3 | 47.7 | 89.6 | 99.5 | 56.5 | 97.0 | 39.2 | 86.0 |
1.2 | 0.75 | ATIRE | 77.4 | 99.3 | 73.0 | 59.0 | 67.0 | 76.5 | 94.2 | 86.8 | 85.7 | 47.8 | 89.8 | 99.5 | 57.3 | 97.0 | 40.3 | 87.2 |
1.2 | 0.75 | BM25+ | 77.4 | 99.3 | 73.0 | 59.0 | 67.0 | 76.5 | 94.2 | 86.8 | 85.7 | 47.8 | 89.8 | 99.5 | 57.3 | 97.0 | 40.3 | 87.2 |
1.2 | 0.75 | BM25L | 77.2 | 99.4 | 73.4 | 57.3 | 66.1 | 77.3 | 93.7 | 85.7 | 85.0 | 47.7 | 89.3 | 99.5 | 57.7 | 97.0 | 40.8 | 87.5 |
1.2 | 0.75 | Lucene | 77.4 | 99.3 | 73.0 | 59.0 | 67.0 | 76.5 | 94.2 | 86.8 | 85.6 | 47.8 | 89.8 | 99.5 | 57.3 | 97.0 | 40.3 | 87.2 |
1.2 | 0.75 | Robertson | 77.4 | 99.3 | 73.2 | 59.1 | 66.7 | 76.8 | 94.2 | 86.8 | 85.9 | 47.5 | 89.8 | 99.5 | 57.3 | 96.7 | 40.2 | 87.4 |
1.5 | 0.75 | ES | 76.9 | 99.2 | 74.2 | 58.8 | 63.6 | 76.7 | 95.9 | 85.2 | 85.1 | 39.0 | 90.8 | 99.6 | 57.9 | 98.0 | 41.3 | 88.0 |
1.5 | 0.75 | Lucene | 77.2 | 99.3 | 73.3 | 57.8 | 66.3 | 77.2 | 93.8 | 86.1 | 85.2 | 47.7 | 89.5 | 99.6 | 57.5 | 97.0 | 40.6 | 87.4 |
1.5 | 0.75 | PSRN | 76.7 | 99.2 | 74.2 | 58.7 | 66.2 | 76.7 | 94.2 | 86.4 | 85.1 | 37.1 | 89.4 | 99.6 | 57.4 | 97.7 | 41.1 | 87.2 |
1.5 | 0.75 | PT | 73.0 | 98.3 | -- | -- | -- | 72.5 | -- | -- | -- | 51.0 | -- | 98.9 | 56.0 | 97.8 | 36.3 | -- |
1.5 | 0.75 | Rank | 77.1 | 99.4 | 73.4 | 57.5 | 66.4 | 77.4 | 93.6 | 87.7 | 82.6 | 47.6 | 89.5 | 99.5 | 57.4 | 96.7 | 40.5 | 87.5 |
1.2 | 0.75 | PISA | 77.1 | 98.8 | 72.2 | 59.8 | 67.6 | 76.4 | 93.7 | 86.8 | 86.9 | 38.7 | 89.1 | 98.9 | 56.9 | 97 | 46 | 87.7 |
- BM25+
- BM25L
- ATIRE
- Robertson
- Lucene (k1=1.2, b=0.75)
- Lucene (k1=0.9, b=0.4)
- Lucene (k1=1.5, b=0.75): DB, MS, FV, CF, NQ, HP, Remaining
- ES: FV, CF, NQ, MS, HP, DB, Remaining
- PT: MS, CF, FV, DB, HP, NQ, WT, CD, Remaining
- Rank: DB, HP, CF, FV, MS, NQ, CD, Remaining
- PSRN: CD, FV, HP, MS, DB, NQ, Remaining
- PISA: NQ, DB, CF, HP, FV, MS, CD, Remaining
- BM25+J: Sub-1m, remaining