Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up a pretrained backbone ViT and a training pipeline #27

Closed
6 tasks
Grutschus opened this issue Nov 22, 2023 · 2 comments · Fixed by #32
Closed
6 tasks

Set up a pretrained backbone ViT and a training pipeline #27

Grutschus opened this issue Nov 22, 2023 · 2 comments · Fixed by #32
Assignees

Comments

@Grutschus
Copy link
Owner

Grutschus commented Nov 22, 2023

Now that we now that the model can generally train, we need to figure out how we want to design the training pipeline:

  • Which Backbone do we use (there are different ViT sizes)
  • We need to setup the model such that pretrained weights for the backbone are used
  • Which data preprocessing operations should we use for the training pipeline (data augmentation, transforming the data to the right shape etc.)
  • Which data preprocessing operations should be used in validation pipelines (what metrics to consider, should we use different sampling operations)
  • We need to setup the classification head. For that we need to consider which one to use, whether we can use a pretrained version that only needs fine-tuning (and of course complete initialization of the final layer) etc.
  • Run the training on a tiny subset of the data to check whether the model successfully overfits the sample

This obviously requires heavy reading into the VideoMAEv2 paper :D

@Grutschus Grutschus added this to the Development freeze milestone Nov 22, 2023
@Grutschus Grutschus self-assigned this Nov 22, 2023
@Grutschus Grutschus changed the title Setup up a pretrained backbone ViT and a training pipeline Set up a pretrained backbone ViT and a training pipeline Nov 22, 2023
@Grutschus
Copy link
Owner Author

Just gave the VideoMAEv2 paper a proper read - good one, can recommend :D

Here are my key insights:

  • The goal of VideoMAEv2 is to provide a large foundation model for video understanding. The key challenge with training ViT models on video data is as usual model and data scaling. Thus, the authors presented a good method for semi-supervised video pretraining. This means they can use large amounts of unlabeled data to train a vanilla ViT encoder. They do that by masking the videos in a clever way. For us that means that we can just use their pretrained weights and have a (hopefully) very capable encoder.
  • It might not be strictly necessary to fine-tune the backbone for our purposes. Apparently, they have also just fine-tuned the model for a handful of epochs to achieve their results.

@Grutschus
Copy link
Owner Author

I have now looked a little into the code.

In our framework, we pretty much have the ViT backbone that has been pretrained and finetuned to the Kinetics-400 dataset. Now, unfortunately we only have the ViT-B and ViT-S backbones available. While it is possible to request weights from the original authors of VideoMAEv2 for larger transformers, we cannot even fit ViT-B onto the T4 GPUs - we'd need larger computes. Thus, we are somewhat stuck with the ViT-S backbone for now. This is unfortunate, since apparently the real benefits of these foundation models come from scaling them.

Nonetheless, we have made our lives pretty easy when it comes to the task we chose. The classification head used in the framework is literally just a fc layer (and optionally a droput layer).

So, here is the plan:

  1. We lock the weights of the backbone completely and try to only train the classification head and see where that takes us
    This is almost a trivial experiment, since we effectively only have (EMBED_DIM + 1) x NUM_CLASSES weights to train, i.e. roughly 2k. I don't expect any great results from that.
  2. The next thing we could try is to fine-tune the backbone further. The authors of the original VideoMAEv2 paper have provided some of the hyperparameters they've used for fine-tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant