-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train set and test set ranking distribution difference #16
Comments
Good questions :)
That's not exactly true because for BART and other models the checkpoint is selected based on their performance on the evaluation set, and if the model is overfitting too much on the training set it would not perform well on the evaluation set.
We found diverse beam search to be very useful in terms of generating diverse data. Please refer to https://github.com/yixinL7/BRIO/blob/main/gen_candidate.py.
It is only fine-tuned on XSum.
Firstly, having similar ROUGE scores doesn't necessarily mean the data distribution is the same. For example, if you calculate the extractive oracle performance on the training set and test set on CNN/DM, you will find the score is higher on the test set. |
Hi, thanks for reply. But I am still a little bit confused.
This is true. But when using the model to generate candidates on the training set, which means the model has already seen the ground truth summary during training, p_{model}(reference_summary) = 1 as you mentioned, how could the average max rouge score of training set is almost equivalent to the test set ? Also,
And as you recommend in this #14, I check this paper SummaReranker: A Multi-Task Mixture-of-Experts Re-ranking Framework for Abstractive Summarization, there indeed has some special tricks for this mismatch problem: |
I'd like to emphasize my point that if the model is overfitting too much on the training set it would not perform well on the evaluation set. So it's possible that the selected checkpoint doesn't really overfit the training data. |
Hi, Sorry for take your time. But here I think it is maybe not a overfitting problem but a memorization problem. If train set and validation set gives same results with respect to some metrics, what is the point of validation set ? I think the meaning of validation set is to test the model's ability with data in the same distribution with training data but not exactly the same as training data considering model's memorization capacity. I admit this is an empirical problem. And thanks so much for providing data for reranking and generation scripts. But considering the large dataset So just to be clear, the whole process of
|
I have to admit this surprises me a lot. Because from my previous experience, training a transformer model from scratch in translation or summarization task, the |
Hi, since the model used for CNNDM is
Facebook/bart-large-cnn
, which means the model actually got fine-tuned on the CNNDM training set. Considering the Neural model's amazing capacity for memorization, the candidate generation of training set for evaluation model should be nearly perfect. Do I understand this correctly ? How do you avoid this to generate useful data for ranking ? And does Pegasus also fine tuned on the CNNDM before generating summary candidate ? Thanks .The text was updated successfully, but these errors were encountered: