-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Difficulties to reproduce results on Robust 04 #22
Comments
Hi Martin! Thanks for reporting. I'm looking into these issues (as well as related #21). |
Thanks @seanmacavaney for your fast reply and for looking into it. Much appreciated! |
Hi Martin, Thank you for pointing this out. While Sean is looking into the third question (#21), I'll try to provide some information about the others.
This appears to be caused by a preprocessing difference. When using Anserini to prepare If I use an Indri index built without stopword removal and without stemming, I get P@20 = 0.4470 (from trec_eval) and nDCG@20 = 0.51797 (gdeval). Indri and Anserini behave differently here: Anserini returns the raw document whereas Indri (via pyndri) is returning the document tokens. The project README should probably be updated to point this out. I don't seen an obvious reason for the remaining nDCG differences (0.51774 vs. 0.51797), but it may be that my Indri version or config differs from what Sean used.
Some background: this repository is a simplified version of a (in-house) toolkit called srank, which is what was originally used to conduct experiments. The OpenNIR toolkit is based on srank with some experimental/unpublished items removed and other cleanup. (Exporting data from srank to the format used in this repo is the step Sean referred to in #21.) I have successfully trained CEDR using this repository starting from the pre-trained VBERT weights, but this obviously would be affected by #21 if this issue does turn out to indicate a test data leak. I also spotchecked one robust04 fold (using this repo) where VBERT was also trained. However, after looking through per-fold results from running OpenNIR over the past few days (details below), I don't see this as convincing evidence. The per-fold metrics vary a lot with some looking fine even though the aggregation is lower than expected. Regardless of #21, the metrics you report (i.e., P@20 = 0.3790 and nDCG@20 = 0.4347) look low to me. As mentioned, I have been training VBERT and CEDR-KNRM using OpenNIR to look for any differences in the two repositories that were missed. In this setting, I get P@20 = 0.4167 and nDCG@20 = 0.4826 with CEDR-KNRM, which are higher than the metrics you obtained. I've obtained similar metrics to these both with this repository (using Indri preprocessing) and using a different TensorFlow v1 codebase (using Anserini). The difference in our results here may be related to the document preprocessing as with the first experiment, but I'm not confident of this given that Anserini was also used with the TFv1 code. It's worth noting that this TFv1 setting also resulted in several other changes to the training setup (e.g., larger batch size and different sampling approach). Regardless of how preprocessing may be affecting these results, it's clear that something else is also going on; nDCG@20 = 0.4826 is still lower than expected. My initial theory was that VBERT fine-tuning is sensitive to the random seed (which is effectively different on different hardware even when fixed) as others have observed, but experiments I've run do not support this. |
Hi Andrew, thanks for your quick and detailed reply!
I meanwhile started a Vanilla BERT training run using OpenNIR but in order to make sure we're using the same configuration, can you please share the exact Vanilla BERT and CEDR-KNRM training commands you've used?
Not sure I fully understand. You don't see this as a convincing evidence for what?
I understand that metrics significantly vary for different folds but my main goal was to reproduce the metrics obtained with the provided CEDR-KNRM checkpoint for fold 1 only. Or do you refer to another variance here? Also, can you please elaborate what you mean by aggregation? Is it aggregation of test results on different folds? Sorry for my ignorance, just want to make sure I fully understand.
I initially trained VBERT and CEDR-KNRM on an 8 GB GTX 1080 and had to set But before testing using Indri together with this repo, I'd like to give OpenNIR on fold 1 a try. On an 8 GB GTX 1080 I'm currently running Vanilla BERT training with
but will rerun it later on an RTX 2080ti with the command line you provide. And thanks for sharing OpenNIR with the community, great initiative! |
One more question regarding
I get similar numbers on the fold 1 validation set. My "low" numbers are obtained on the fold 1 test set i.e. from running the checkpoint with the highest P@20 validation set metric on the test set. Are you reporting here (a statistic of) the validation set metric? Is the "variance" you mentioned previously the validation metric variance across epochs? Is "aggregation" a (running) average over epochs? |
Sure, here is the script I used. I made some changes to the
Right, I'm referring to the variance in per-fold metrics (e.g., nDCG@20 for fold 1). I meant that I am not convinced that VBERT was being correctly trained at that time. I missed that you were concentrating on fold 1 only, which changes a bit of what I said about the OpenNIR results (see below). By aggregation I meant averaging the per-query metrics from across all five folds, so I think we're on the same page.
I missed that these were fold 1 metrics before. Your metrics with
I think your |
My bad, the confusion comes from the fact that I thought the metrics you reported were across all folds. P@20 = 0.4167 and nDCG@20 = 0.4826 are the test set metrics across folds 1-5. The details are shown here. |
Thanks for clarifying Andrew. This means that our fold 1 results are now close (and I don't seem to have any gross errors in my training setup). I meanwhile also trained CEDR-KNRM with OpenNIR and again got results (P@20 = 0.3660 and nDCG@20 = 0.4259) that are close to your fold 1 results and the fold 1 results I got with the CEDR repo. I'll later run your script for training on all folds but I am now quite confident that I can reproduce the numbers you obtained:
Thanks for all your help on this so far! Remains the question what caused the gap between these numbers and those reported in the paper i.e. those obtained from the aggregated results in #18 (P@20 = 0.4667 and nDCG@20 = 0.5381 by evaluating |
@seanmacavaney do you have any updates on this? I'm in the process of selecting potential candidates for a ranking pipeline and it is therefore important for me to be able to reproduce the numbers in the paper. |
Hi @krasserm -- sorry for the delays. I'm trying to balance a variety of priorities right now, and I have not had much time to dig into this. |
I've spent some time looking into the reproduction issues from a different angle by implementing CEDR-KNRM in a toolkit [1] known to work well with Transformer-based models like PARADE [2]. tl;dr If you're interested in using CEDR, variants replacing BERT with ELECTRA perform well and sometimes exceed the results reported in the paper. If you're curious about why CEDR has been hard to reproduce, I've made some progress, but there are still missing pieces. Results with BERT base are slightly better than the previous ones. I also tried ELECTRA base, which has performed well elsewhere [2,3], and saw a bigger improvement.
CEDR-KNRM, BERT-KNRM, and VBERT correspond to the models from the paper. +MS MARCO indicates fine-tuning on MS MARCO prior to robust04 (as in [2,3]). The main model differences compared to this repo are (1) using the weight initialization from PyTorch 0.4 rather than the initialization in 1.0+ and (2) adding a hidden layer to the network predicting the relevance score. The 4 passage setting considers 900 total tokens, which is close to this repo's. The 8 passage setting uses 1800 tokens. ELECTRA's nDCG@20 sometimes surpasses the paper's best results, but this model was not used originally (and didn't yet exist), of course. ELECTRA is essentially BERT with improved pretraining, so the difference between the two is surprising. Overall, something is still missing though, because the BERT configurations are substantially lower than CEDR-KNRM's ~0.538 nDCG@20 from the paper. Given that the PyTorch version in this repo seems to have been wrong, it's possible an earlier version of You can find additional results and reproduction instructions here. [1] Capreolus. https://capreolus.ai [2] PARADE: Passage Representation Aggregation for Document Reranking. Canjia Li, Andrew Yates, Sean MacAvaney, Ben He, and Yingfei Sun. arXiv 2020. [3] Comparing Score Aggregation Approaches for Document Retrieval with Pretrained Transformers. Xinyu Zhang, Andrew Yates, and Jimmy Lin. ECIR 2021. |
This is a follow-up on #21. I tried to reproduce the results on Robust 04 but failed to do so using the code in this repository. In the following I report my results on test fold
f1
obtained in 3 experiments:Experiment 1: Use provided CEDR-KNRM weights and
.run
files.When evaluating the provided
cedrknrm-robust-f1.run
file in #18 withI'm getting P@20 = 0.4470 and nDCG@20 = 0.5177. When using a
.run
file generated with the provided weightscedrknrm-robust-f1.p
I'm getting P@20 = 0.4290 and nDCG@20 = 0.5038. I'd expect these metrics to be equal to those of the provided
cedrknrm-robust-f1.run
file. What is the reason for this difference?Experiment 2: Train my own BERT and CEDR-KNRM models.
This is were I'm getting results that are far below the expected results (only for CEDR-KNRM, not for Vanilla BERT). I started by training and evaluating a Vanilla BERT ranker:
I'm getting P@20 = 0.3690 and nDCG@20 = 0.4231 which is consistent with evaluating the provided
vbert-robust-f1.run
file:This gives P@20 = 0.3550 and nDCG@20 = 0.4219 which comes quite close. I understand that here I simply ignored the inconsistencies reported in #21 but it is at least coarse cross-check of model performance on a single fold. When training a CEDR-KNRM model with this BERT model as initialization
I'm getting P@20 = 0.3790 and nDCG@20 = 0.4347. This is slightly better than a Vanilla BERT ranker but far below the performance obtained in Experiment 1. I also repeated Experiment 2 with
f1.test.run
,f1.valid.run
andf1.train.pairs
files that I generated myself from Anserini runs with a default BM25 configuration and still get results very close to those above.Has anyone been able to get results similar to those as in Experiment 1 by training a BERT and CEDR-KNRM model as explained in the project's README?
Experiment 3: Use provided
vbert-robust-f1.p
weights as initialization to CEDR-KNRM trainingI made this experiment in an attempt to debug the performance gap found in the previous experiment. I'm fully aware that training and evaluating a CEDR-KNRM model on fold 1 (i.e.
f1
) with the providedvbert-robust-f1.p
is invalid because of the inconsistencies reported in #21. This is because the folds used for training/validating/testingvbert-robust-f1.p
differ from those indata/robust/f[1-5]*
.In other words, validation and evaluation of the trained CEDR-KNRM model is done with queries that have been used for training the provided
vbert-robust-f1.p
. So this setup is using partially training data for evaluation which of course gives better evaluation results. I was surprised to see that with this invalid setup, I'm able to reproduce the numbers obtained in Experiment 1, or at least come very close. Here's what I did:With this setup I'm getting a CEDR-KNRM performance of P@20 = 0.4400 and nDCG@20 = 0.5050. Given these results and the inconsistencies reported in #21, I wonder if the performance of the
cedrknrm-robust-f[1-5].run
checkpoints is the result of an invalid CEDR-KNRM training and evaluation setup or, more likely, if I did something wrong? Any hints appreciated!The text was updated successfully, but these errors were encountered: