You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:
Dataset & Variations:
I have two primary datasets:
Small Dataset: ~120 clips for quicker iteration.
Full Dataset: ~3k clips.
All videos are 6 seconds long, though I've also tested with 3-second clips.
I've also created variations with blurred or blacked-out backgrounds to help with recognition.
Model & Configuration:
The model classifies actions using 16 uniformly sampled frames per video.
I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect.
Hyperparameters tested:
Batch sizes of 2, 4, and 8.
Epochs ranging from 4 to 16.
Learning rate set to 5e-5.
I removed the RandomCrop transformation since it entirely removes the person from the video.
I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.
Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.
Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:
Additional hyperparameter adjustments.
Potential model architecture changes (if applicable).
Dataset augmentation techniques that might improve generalization.
Thank you for any help or insights you can provide!
The text was updated successfully, but these errors were encountered:
Description:
I'm fine-tuning a VideoMAE model for binary classification on home camera footage to distinguish between two actions. Here’s a summary of my setup and the issues I’m facing:
Dataset & Variations:
I have two primary datasets:
All videos are 6 seconds long, though I've also tested with 3-second clips.
I've also created variations with blurred or blacked-out backgrounds to help with recognition.
Model & Configuration:
The model classifies actions using 16 uniformly sampled frames per video.
I’ve tried various base models, including small, base, large, and models fine-tuned on SSV2 and Kinect.
Hyperparameters tested:
Batch sizes of 2, 4, and 8.
Epochs ranging from 4 to 16.
Learning rate set to 5e-5.
I removed the RandomCrop transformation since it entirely removes the person from the video.
I'm using the Hugging Face Video Classification Colab Notebook as a starting point: Training Notebook.
Problem: Despite these variations, the model overfits immediately. I’ve also tested using the UCF101 dataset to rule out dataset-specific issues and got similar results to the Hugging Face VideoMAE colab, so the code seems fine.
Request: Any advice on addressing this overfitting issue would be greatly appreciated. Specifically, I'm looking for guidance on:
Thank you for any help or insights you can provide!
The text was updated successfully, but these errors were encountered: