-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pre-training step #6
Comments
@dschaehi I will provide the splits and pretrain script tonight. Pretrained VL-models always excluded these test images from the pretraining dataset, because the finetuning uses the same dataset, just in a different way. For example, it is absolutely wrong to pretrain a VL-model with the masked language modelling objective, where the model sees the whole caption (except the masked words which are randomly chosen), and then to later finetune this VL-model on the image captioning task. Because the pretraining step already saw the test caption which the finetuned model should predict. In summary, when the same dataset is used for pretraining and finetuning, regardless of what the task is, the finetuning test dataset should be excluded from the pretraining dataset. In our case, the pretraining dataset (image captioning) is completely different with the finetuning dataset (Natural Language Explanations), it's just that the images are shared. Whether or not it is fair to use the finetuning test images during pretraining is a debate. But the general thing, is that the test dataset should be something the model has never seen before, and has no idea about. Essentially, allowing the model to understand these finetuning NLE test images through a different way (e.g. image captioning) is distilling knowledge about these images in the pretrained model. Therefore, pretraining with the NLE test images is wrong, and we avoided this. Hope you are clear now |
Hi @fawazsammani, thank you again for your answer!
This is great. Thanks!
I find it a bit confusing to follow. If I understand it correctly, only the images from the fine-tuning datasets are shared with the pre-training datasets, which is OK (but debatable) because they are for two different tasks, i.e., image captioning vs NLE? |
@dschaehi correct, but we avoid this. Regards |
Hi @fawazsammani, thanks for the clarification so far. |
Hi again @dschaehi However, if you require the pretrained model, it is already available in the Models section. I do not see any need for training it again and wasting computational resources if we already did :) Regards |
Hi @fawazsammani, |
Hello @dschaehi , Feel free to open this issue if you have any other doubts. Regards |
Great! Thank you very much! |
Hi @fawazsammani,
Since the repo provides pretrained models, but not a script for pretraining, I am wondering what split to choose to pretrain on the four datasets mentioned in the paper (i.e., coco captions, flickr30k, VG and image paragraph captioning). I think this is not well described in the paper. Would I need to split the datasets for pre-training or can I pre-train the model on the entire datasets without splitting?
The text was updated successfully, but these errors were encountered: