Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

xhluca · 2023-02-17T20:39:28Z

I found Contriever quite interesting based on the table 3 of the paper (few-shot retrieval) as Contriever-MSMarco achieves a score of 38.1 when finetuned on FiQA, which is much higher than the BERT-MSMarco which is at ~31. The difference is even bigger when comparing contriever and BERT (the checkpoints that were not first finetuned on msmarco), achieving a 10 pts improvements:

I’ve tried a similar set up (similar to DPR), with the differences being:

Trained for 20 epochs instead of 500
AdamW instead of ASAM
Included BM25 hard negatives (i.e. top results that are not a gold label) in addition to in-batch negative sampling
Batch size of 128 instead of 256 (though the number of negatives should be the same due to HNs)
Instead of early stopping, I just trained for 20 epochs and save the checkpoint at the epoch with the best dev NDCG@10

It seems that under those settings, the improvements isn't as high as the difference reported in the paper:

split	epoch	metric	model_name	learning_rate	k=1	k=3	k=5	k=10	k=100	k=1000
test	7	ndcg	facebook/contriever-msmarco	1e-05	0.24383	0.25005	0.2608	0.28715	0.36118	0.39975
test	16	ndcg	facebook/contriever	3e-05	0.25	0.23583	0.24952	0.2732	0.35149	0.39019
test	12	ndcg	roberta-base	5e-05	0.25309	0.22701	0.24416	0.26293	0.33809	0.37927
test	16	ndcg	bert-base-uncased	2e-05	0.21451	0.20465	0.21947	0.23826	0.31088	0.35118

Note the NDCG@10 of the contriever model is 3.49 higher than the bert-base-uncased (I tried learning rates between 1e-5 and 5e-5), which is small than the 10.3 pts improvements show in the screenshot (26.1 -> 36.4). I am not surprised that the results themselves are lower due to the differences in hyperperameters, but the delta in improvements surprises me. Is contriever harder to finetune when using the Adam optimizer? Or are we expected to use 256 batch sizes and/or avoid hard negatives from BM25?

Is it possible to either:

provide the huggingface checkpoints of contriever and contriever-msmarco finetuned on fiqa, or
share scripts that let me reproduce the process of finetuning contriever or contriever-msmarco on fiqa and save the checkpoint as huggingface model

Thank you!

The text was updated successfully, but these errors were encountered:

siyuanseever · 2023-08-28T11:52:20Z

I have the same requirements and encountered similar problems, my few-shot implementation scores are much worse than those in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

xhluca commented Feb 17, 2023 •

edited

Loading

siyuanseever commented Aug 28, 2023

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

Difficulty in achieving similar improvements in FIQA for few-shot learning as reported in table 3 #16

Comments

xhluca commented Feb 17, 2023 • edited Loading

siyuanseever commented Aug 28, 2023

xhluca commented Feb 17, 2023 •

edited

Loading