Add adore reproduce experiments (castorini#785)

add adore reproduce retrieval stage reproduction Co-authored-by: Hang Li <cecillll.lee@gmail.com>
MXueguang · Nov 5, 2021 · b3b2b05 · b3b2b05
1 parent ed0c6e2
commit b3b2b05
Show file tree

Hide file tree

Showing 3 changed files with 115 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -408,6 +408,7 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe
 + Reproducing [DistilBERT KD experiments](docs/experiments-distilbert_kd.md)
 + Reproducing [DistilBERT Balanced Topic Aware Sampling experiments](docs/experiments-distilbert_tasb.md)
 + Reproducing [SBERT dense retrieval experiments](docs/experiments-sbert.md)
++ Reproducing [ADORE dense retrieval experiments](docs/experiments-adore.md)
 + Reproducing [Vector PRF experiments](docs/experiments-vector-prf.md)
 
 ## Baselines

diff --git a/docs/experiments-adore.md b/docs/experiments-adore.md
@@ -0,0 +1,102 @@
+# Pyserini: Reproducing ADORE Results
+
+This guide provides instructions to reproduce the following dense retrieval work:
+
+> Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma. [Optimizing Dense Retrieval Model Training with Hard Negatives](https://arxiv.org/pdf/2104.08051.pdf)
+
+Starting with v0.12.0, you can reproduce these results directly from the [Pyserini PyPI package](https://pypi.org/project/pyserini/).
+Since dense retrieval depends on neural networks, Pyserini requires a more complex set of dependencies to use this feature.
+See [package installation notes](../README.md#package-installation) for more details.
+
+Note that we have observed minor differences in scores between different computing environments (e.g., Linux vs. macOS).
+However, the differences usually appear in the fifth digit after the decimal point, and do not appear to be a cause for concern from a reproducibility perspective.
+Thus, while the scoring script provides results to much higher precision, we have intentionally rounded to four digits after the decimal point.
+
+## MS MARCO Passage
+
+**ADORE retrieval** with brute-force index:
+
+```bash
+$ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
+                             --index msmarco-passage-adore-bf \
+                             --encoded-queries adore-msmarco-passage-dev-subset \
+                             --batch-size 36 \
+                             --threads 12 \
+                             --output runs/run.msmarco-passage.adore.bf.tsv \
+                             --output-format msmarco
+```
+
+The option `--encoded-queries` specifies the use of encoded queries (i.e., queries that have already been converted into dense vectors and cached).
+
+Unfortunately, the "on-the-fly" query encoding, ie, convert text queries into dense vectors as part of the dense retrieval process is not available for this model. This is because the original ADORE implementation is based on an old version of transformers (`transformers=2.8.0`). Pyserini uses a higher version so that the base model (`roberta-base`) performs differently.
+
+To evaluate:
+
+```bash
+$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.adore.bf.tsv 
+#####################
+MRR @10: 0.34661947969254514
+QueriesRanked: 6980
+#####################
+```
+
+We can also use the official TREC evaluation tool `trec_eval` to compute other metrics than MRR@10. 
+For that we first need to convert runs and qrels files to the TREC format:
+
+```bash
+$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.adore.bf.tsv --output runs/run.msmarco-passage.adore.bf.trec
+$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.ance.bf.trec
+map                   	all	0.3523
+recall_1000           	all	0.9688
+```
+
+## TREC DL2019 Passage
+
+**ANCE retrieval** with brute-force index:
+
+```bash
+$ python -m pyserini.dsearch --topics dl19-passage  \
+                             --index msmarco-passage-adore-bf \
+                             --encoded-queries adore-dl19-passage \ 
+                             --batch-size 36 \
+                             --threads 12 \
+                             --output runs/run.dl19-passage.adore.bf.trec
+```
+
+Same as above, you cannot use the "on-the-fly" query encoding feature.
+
+To evaluate:
+
+```bash
+$ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -m recall.1000 -l 2 dl19-passage runs/run.dl19-passage.adore.bf.trec
+map                     all     0.4188
+recall_1000             all     0.7759
+ndcg_cut_10             all     0.6832
+```
+
+## TREC DL2020 Passage
+
+**ANCE retrieval** with brute-force index:
+
+```bash
+$ python -m pyserini.dsearch --topics dl20  \
+                             --index msmarco-passage-adore-bf \
+                             --encoded-queries adore-dl20-passage \ 
+                             --batch-size 36 \
+                             --threads 12 \
+                             --output runs/run.dl20-passage.adore.bf.trec
+```
+
+Same as above, you cannot use the "on-the-fly" query encoding feature.
+
+To evaluate:
+
+```bash
+$ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -m recall.1000 -l 2 dl20-passage runs/run.dl20-passage.adore.bf.trec
+map                     all     0.4418
+recall_1000             all     0.8151
+ndcg_cut_10             all     0.6655
+```
+
+## Reproduction Log[*](reproducibility.md)
+
diff --git a/docs/experiments-vector-prf.md b/docs/experiments-vector-prf.md
@@ -40,6 +40,10 @@ Here's how our results stack up against all available models and datasets in Pys
 | SBERT                | Original                | 0.4060 | 0.5985   | 0.7872      |
 | SBERT                | Average PRF 3           | 0.4354 | 0.6149   | 0.7937      |
 | SBERT                | Rocchio PRF 5 A0.4 B0.6 | 0.4371 | 0.6149   | 0.7941      |
+| ADORE                | Original                | 0.4188 | 0.5946   | 0.7759      |
+| ADORE                | Average PRF 3           | 0.4672 | 0.6263   | 0.7890      |
+| ADORE                | Rocchio PRF 5 A0.4 B0.6 | 0.4629 | 0.6325   | 0.7950      |
+
 
 #### TREC DL 2020 Passage
 
@@ -63,6 +67,9 @@ Here's how our results stack up against all available models and datasets in Pys
 | SBERT                | Original                | 0.4124 | 0.5734   | 0.7937      |
 | SBERT                | Average PRF 3           | 0.4258 | 0.5781   | 0.8169      |
 | SBERT                | Rocchio PRF 5 A0.4 B0.6 | 0.4342 | 0.5851   | 0.8226      |
+| ADORE                | Original                | 0.4418 | 0.5949   | 0.8151      |
+| ADORE                | Average PRF 3           | 0.4706 | 0.6176   | 0.8323      |
+| ADORE                | Rocchio PRF 5 A0.4 B0.6 | 0.4760 | 0.6193   | 0.8251      |
 
 #### MS MARCO Passage V1
 
@@ -87,7 +94,10 @@ The PRF does not perform well with sparse judgements like in MS MARCO, the resul
 | DistillBERT Balanced | Rocchio PRF 5 A0.4 B0.6 | 0.2969 | 0.4178   | 0.9702      | 
 | SBERT                | Original                | 0.3373 | 0.4453   | 0.9558      | 
 | SBERT                | Average PRF 3           | 0.3094 | 0.4183   | 0.9446      | 
-| SBERT                | Rocchio PRF 5 A0.4 B0.6 | 0.3034 | 0.4157   | 0.9529      | 
+| SBERT                | Rocchio PRF 5 A0.4 B0.6 | 0.3034 | 0.4157   | 0.9529      |
+| ADORE                | Original                | 0.3523 | 0.4637   | 0.9688      |
+| ADORE                | Average PRF 3           | 0.3188 | 0.4330   | 0.9583      |
+| ADORE                | Rocchio PRF 5 A0.4 B0.6 | 0.3209 | 0.4376   | 0.9669      |
 
 ## Reproducing Results
 
@@ -145,7 +155,7 @@ _Note: TREC DL 2019, TREC DL 2020, and MS MARCO Passage V1 use the same passage
 
 _Note: If you have pre-computed queries available, the `--encoder` can be replaced with `--encoded-queries` to avoid "on-the-fly" query encoding by passing in the path to your pre-computed query file. 
 For example, Pyserini has the ANCE pre-computed query available for MS MARCO Passage V1, so instead of using `--encoder castorini/ance-msmarco-passage`,
-one can use `--encoded-queries ance-msmarco-passage-dev-subset`._
+one can use `--encoded-queries ance-msmarco-passage-dev-subset`. For ADORE model, you can only use `--encoded-queries`, otf encoding is not available._
 
 With these parameters, one can easily reproduce the results above, for example, to reproduce `TREC DL 2019 Passage with ANCE Average Vector PRF 3` the command will be:
 ```