Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add adore reproduce experiments #785

Merged
merged 4 commits into from
Sep 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -408,6 +408,7 @@ With Pyserini, it's easy to [reproduce](docs/reproducibility.md) runs on a numbe
+ Reproducing [DistilBERT KD experiments](docs/experiments-distilbert_kd.md)
+ Reproducing [DistilBERT Balanced Topic Aware Sampling experiments](docs/experiments-distilbert_tasb.md)
+ Reproducing [SBERT dense retrieval experiments](docs/experiments-sbert.md)
+ Reproducing [ADORE dense retrieval experiments](docs/experiments-adore.md)
+ Reproducing [Vector PRF experiments](docs/experiments-vector-prf.md)

## Baselines
Expand Down
102 changes: 102 additions & 0 deletions docs/experiments-adore.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# Pyserini: Reproducing ADORE Results

This guide provides instructions to reproduce the following dense retrieval work:

> Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma. [Optimizing Dense Retrieval Model Training with Hard Negatives](https://arxiv.org/pdf/2104.08051.pdf)
Starting with v0.12.0, you can reproduce these results directly from the [Pyserini PyPI package](https://pypi.org/project/pyserini/).
Since dense retrieval depends on neural networks, Pyserini requires a more complex set of dependencies to use this feature.
See [package installation notes](../README.md#package-installation) for more details.

Note that we have observed minor differences in scores between different computing environments (e.g., Linux vs. macOS).
However, the differences usually appear in the fifth digit after the decimal point, and do not appear to be a cause for concern from a reproducibility perspective.
Thus, while the scoring script provides results to much higher precision, we have intentionally rounded to four digits after the decimal point.

## MS MARCO Passage

**ADORE retrieval** with brute-force index:

```bash
$ python -m pyserini.dsearch --topics msmarco-passage-dev-subset \
--index msmarco-passage-adore-bf \
--encoded-queries adore-msmarco-passage-dev-subset \
--batch-size 36 \
--threads 12 \
--output runs/run.msmarco-passage.adore.bf.tsv \
--output-format msmarco
```

The option `--encoded-queries` specifies the use of encoded queries (i.e., queries that have already been converted into dense vectors and cached).

Unfortunately, the "on-the-fly" query encoding, ie, convert text queries into dense vectors as part of the dense retrieval process is not available for this model. This is because the original ADORE implementation is based on an old version of transformers (`transformers=2.8.0`). Pyserini uses a higher version so that the base model (`roberta-base`) performs differently.

To evaluate:

```bash
$ python -m pyserini.eval.msmarco_passage_eval msmarco-passage-dev-subset runs/run.msmarco-passage.adore.bf.tsv
#####################
MRR @10: 0.34661947969254514
QueriesRanked: 6980
#####################
```

We can also use the official TREC evaluation tool `trec_eval` to compute other metrics than MRR@10.
For that we first need to convert runs and qrels files to the TREC format:

```bash
$ python -m pyserini.eval.convert_msmarco_run_to_trec_run --input runs/run.msmarco-passage.adore.bf.tsv --output runs/run.msmarco-passage.adore.bf.trec
$ python -m pyserini.eval.trec_eval -c -mrecall.1000 -mmap msmarco-passage-dev-subset runs/run.msmarco-passage.ance.bf.trec
map all 0.3523
recall_1000 all 0.9688
```

## TREC DL2019 Passage

**ANCE retrieval** with brute-force index:

```bash
$ python -m pyserini.dsearch --topics dl19-passage \
--index msmarco-passage-adore-bf \
--encoded-queries adore-dl19-passage \
--batch-size 36 \
--threads 12 \
--output runs/run.dl19-passage.adore.bf.trec
```

Same as above, you cannot use the "on-the-fly" query encoding feature.

To evaluate:

```bash
$ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -m recall.1000 -l 2 dl19-passage runs/run.dl19-passage.adore.bf.trec
map all 0.4188
recall_1000 all 0.7759
ndcg_cut_10 all 0.6832
```

## TREC DL2020 Passage

**ANCE retrieval** with brute-force index:

```bash
$ python -m pyserini.dsearch --topics dl20 \
--index msmarco-passage-adore-bf \
--encoded-queries adore-dl20-passage \
--batch-size 36 \
--threads 12 \
--output runs/run.dl20-passage.adore.bf.trec
```

Same as above, you cannot use the "on-the-fly" query encoding feature.

To evaluate:

```bash
$ python -m pyserini.eval.trec_eval -c -m map -m ndcg_cut.10 -m recall.1000 -l 2 dl20-passage runs/run.dl20-passage.adore.bf.trec
map all 0.4418
recall_1000 all 0.8151
ndcg_cut_10 all 0.6655
```

## Reproduction Log[*](reproducibility.md)

14 changes: 12 additions & 2 deletions docs/experiments-vector-prf.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ Here's how our results stack up against all available models and datasets in Pys
| SBERT | Original | 0.4060 | 0.5985 | 0.7872 |
| SBERT | Average PRF 3 | 0.4354 | 0.6149 | 0.7937 |
| SBERT | Rocchio PRF 5 A0.4 B0.6 | 0.4371 | 0.6149 | 0.7941 |
| ADORE | Original | 0.4188 | 0.5946 | 0.7759 |
| ADORE | Average PRF 3 | 0.4672 | 0.6263 | 0.7890 |
| ADORE | Rocchio PRF 5 A0.4 B0.6 | 0.4629 | 0.6325 | 0.7950 |


#### TREC DL 2020 Passage

Expand All @@ -63,6 +67,9 @@ Here's how our results stack up against all available models and datasets in Pys
| SBERT | Original | 0.4124 | 0.5734 | 0.7937 |
| SBERT | Average PRF 3 | 0.4258 | 0.5781 | 0.8169 |
| SBERT | Rocchio PRF 5 A0.4 B0.6 | 0.4342 | 0.5851 | 0.8226 |
| ADORE | Original | 0.4418 | 0.5949 | 0.8151 |
| ADORE | Average PRF 3 | 0.4706 | 0.6176 | 0.8323 |
| ADORE | Rocchio PRF 5 A0.4 B0.6 | 0.4760 | 0.6193 | 0.8251 |

#### MS MARCO Passage V1

Expand All @@ -87,7 +94,10 @@ The PRF does not perform well with sparse judgements like in MS MARCO, the resul
| DistillBERT Balanced | Rocchio PRF 5 A0.4 B0.6 | 0.2969 | 0.4178 | 0.9702 |
| SBERT | Original | 0.3373 | 0.4453 | 0.9558 |
| SBERT | Average PRF 3 | 0.3094 | 0.4183 | 0.9446 |
| SBERT | Rocchio PRF 5 A0.4 B0.6 | 0.3034 | 0.4157 | 0.9529 |
| SBERT | Rocchio PRF 5 A0.4 B0.6 | 0.3034 | 0.4157 | 0.9529 |
| ADORE | Original | 0.3523 | 0.4637 | 0.9688 |
| ADORE | Average PRF 3 | 0.3188 | 0.4330 | 0.9583 |
| ADORE | Rocchio PRF 5 A0.4 B0.6 | 0.3209 | 0.4376 | 0.9669 |

## Reproducing Results

Expand Down Expand Up @@ -145,7 +155,7 @@ _Note: TREC DL 2019, TREC DL 2020, and MS MARCO Passage V1 use the same passage

_Note: If you have pre-computed queries available, the `--encoder` can be replaced with `--encoded-queries` to avoid "on-the-fly" query encoding by passing in the path to your pre-computed query file.
For example, Pyserini has the ANCE pre-computed query available for MS MARCO Passage V1, so instead of using `--encoder castorini/ance-msmarco-passage`,
one can use `--encoded-queries ance-msmarco-passage-dev-subset`._
one can use `--encoded-queries ance-msmarco-passage-dev-subset`. For ADORE model, you can only use `--encoded-queries`, otf encoding is not available._

With these parameters, one can easily reproduce the results above, for example, to reproduce `TREC DL 2019 Passage with ANCE Average Vector PRF 3` the command will be:
```
Expand Down