You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello! I'm very interested in your great work! I have two questions about pretraining.
Does the generalization ability of UMT come from CLIP? With this in mind, regardless of what kind of pre-training dataset is used, it is all about approaching the effectiveness of the weights of the open-source CLIP. So do is the choice of pretraing dataset in stage1 important?
Here's another question. Is the pre-training in stage2 helpful for visual-only tasks? If we finetune visual-only dataset on stage2 pretrained model, will it outperform stage1 pretrained model?
Looking forward to you reply!
The text was updated successfully, but these errors were encountered:
The high-quality video will be better since I have used Webvid which is ~10x more than K400 with 1/10 epochs, but the result is worse. That's why I only use videos from action recognition datasets, see InternVideo2.
Good question! Under a full-tuning setting, stage 2's checkpoint performs similarly to stage 1's checkpoint. But under a frozen-tuning setting, the multi-modal training helps and performs much better.
Your answer is really helpful, thank youI!
If I want to utilize the model to other video domains rather than action recognition. Will it be helpful to perform continue pretrain(stage1) on those videos? Or do you have any suggestions for improving performance in other video domains?
Looking forward to your reply!
Hello! I'm very interested in your great work! I have two questions about pretraining.
Does the generalization ability of UMT come from CLIP? With this in mind, regardless of what kind of pre-training dataset is used, it is all about approaching the effectiveness of the weights of the open-source CLIP. So do is the choice of pretraing dataset in stage1 important?
Here's another question. Is the pre-training in stage2 helpful for visual-only tasks? If we finetune visual-only dataset on stage2 pretrained model, will it outperform stage1 pretrained model?
Looking forward to you reply!
The text was updated successfully, but these errors were encountered: