Set up a pretrained backbone ViT and a training pipeline #27

Grutschus · 2023-11-22T13:36:21Z

Now that we now that the model can generally train, we need to figure out how we want to design the training pipeline:

Which Backbone do we use (there are different ViT sizes)
We need to setup the model such that pretrained weights for the backbone are used
Which data preprocessing operations should we use for the training pipeline (data augmentation, transforming the data to the right shape etc.)
Which data preprocessing operations should be used in validation pipelines (what metrics to consider, should we use different sampling operations)
We need to setup the classification head. For that we need to consider which one to use, whether we can use a pretrained version that only needs fine-tuning (and of course complete initialization of the final layer) etc.
Run the training on a tiny subset of the data to check whether the model successfully overfits the sample

This obviously requires heavy reading into the VideoMAEv2 paper :D

Grutschus · 2023-11-24T18:16:07Z

Just gave the VideoMAEv2 paper a proper read - good one, can recommend :D

Here are my key insights:

The goal of VideoMAEv2 is to provide a large foundation model for video understanding. The key challenge with training ViT models on video data is as usual model and data scaling. Thus, the authors presented a good method for semi-supervised video pretraining. This means they can use large amounts of unlabeled data to train a vanilla ViT encoder. They do that by masking the videos in a clever way. For us that means that we can just use their pretrained weights and have a (hopefully) very capable encoder.
It might not be strictly necessary to fine-tune the backbone for our purposes. Apparently, they have also just fine-tuned the model for a handful of epochs to achieve their results.

Grutschus · 2023-11-24T19:56:49Z

I have now looked a little into the code.

In our framework, we pretty much have the ViT backbone that has been pretrained and finetuned to the Kinetics-400 dataset. Now, unfortunately we only have the ViT-B and ViT-S backbones available. While it is possible to request weights from the original authors of VideoMAEv2 for larger transformers, we cannot even fit ViT-B onto the T4 GPUs - we'd need larger computes. Thus, we are somewhat stuck with the ViT-S backbone for now. This is unfortunate, since apparently the real benefits of these foundation models come from scaling them.

Nonetheless, we have made our lives pretty easy when it comes to the task we chose. The classification head used in the framework is literally just a fc layer (and optionally a droput layer).

So, here is the plan:

We lock the weights of the backbone completely and try to only train the classification head and see where that takes us
This is almost a trivial experiment, since we effectively only have (EMBED_DIM + 1) x NUM_CLASSES weights to train, i.e. roughly 2k. I don't expect any great results from that.
The next thing we could try is to fine-tune the backbone further. The authors of the original VideoMAEv2 paper have provided some of the hyperparameters they've used for fine-tuning.

Grutschus added this to the Development freeze milestone Nov 22, 2023

Grutschus self-assigned this Nov 22, 2023

Grutschus changed the title ~~Setup up a pretrained backbone ViT and a training pipeline~~ Set up a pretrained backbone ViT and a training pipeline Nov 22, 2023

Grutschus mentioned this issue Nov 28, 2023

27 set up a pretrained backbone vit and a training pipeline #32

Merged

Grutschus linked a pull request Nov 28, 2023 that will close this issue

27 set up a pretrained backbone vit and a training pipeline #32

Merged

Grutschus closed this as completed in #32 Nov 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up a pretrained backbone ViT and a training pipeline #27

Set up a pretrained backbone ViT and a training pipeline #27

Grutschus commented Nov 22, 2023 •

edited

Loading

Grutschus commented Nov 24, 2023

Grutschus commented Nov 24, 2023

Set up a pretrained backbone ViT and a training pipeline #27

Set up a pretrained backbone ViT and a training pipeline #27

Comments

Grutschus commented Nov 22, 2023 • edited Loading

Grutschus commented Nov 24, 2023

Grutschus commented Nov 24, 2023

Grutschus commented Nov 22, 2023 •

edited

Loading