Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

Open
xhluca opened this issue Feb 17, 2023 · 1 comment

Comments

@xhluca
Copy link

xhluca commented Feb 17, 2023

I found Contriever quite interesting based on the table 3 of the paper (few-shot retrieval) as Contriever-MSMarco achieves a score of 38.1 when finetuned on FiQA, which is much higher than the BERT-MSMarco which is at ~31. The difference is even bigger when comparing contriever and BERT (the checkpoints that were not first finetuned on msmarco), achieving a 10 pts improvements:

image

I’ve tried a similar set up (similar to DPR), with the differences being:

  1. Trained for 20 epochs instead of 500
  2. AdamW instead of ASAM
  3. Included BM25 hard negatives (i.e. top results that are not a gold label) in addition to in-batch negative sampling
  4. Batch size of 128 instead of 256 (though the number of negatives should be the same due to HNs)
  5. Instead of early stopping, I just trained for 20 epochs and save the checkpoint at the epoch with the best dev NDCG@10

It seems that under those settings, the improvements isn't as high as the difference reported in the paper:

split epoch metric model_name learning_rate k=1 k=3 k=5 k=10 k=100 k=1000
test 7 ndcg facebook/contriever-msmarco 1e-05 0.24383 0.25005 0.2608 0.28715 0.36118 0.39975
test 16 ndcg facebook/contriever 3e-05 0.25 0.23583 0.24952 0.2732 0.35149 0.39019
test 12 ndcg roberta-base 5e-05 0.25309 0.22701 0.24416 0.26293 0.33809 0.37927
test 16 ndcg bert-base-uncased 2e-05 0.21451 0.20465 0.21947 0.23826 0.31088 0.35118

Note the NDCG@10 of the contriever model is 3.49 higher than the bert-base-uncased (I tried learning rates between 1e-5 and 5e-5), which is small than the 10.3 pts improvements show in the screenshot (26.1 -> 36.4). I am not surprised that the results themselves are lower due to the differences in hyperperameters, but the delta in improvements surprises me. Is contriever harder to finetune when using the Adam optimizer? Or are we expected to use 256 batch sizes and/or avoid hard negatives from BM25?

Is it possible to either:

  1. provide the huggingface checkpoints of contriever and contriever-msmarco finetuned on fiqa, or
  2. share scripts that let me reproduce the process of finetuning contriever or contriever-msmarco on fiqa and save the checkpoint as huggingface model

Thank you!

@siyuanseever
Copy link

I have the same requirements and encountered similar problems, my few-shot implementation scores are much worse than those in the paper.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants