You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During the training of LLaVA-Video, I observed some inconsistencies in how video data augmentation is handled.
Typically, standard video data augmentation involves applying random cropping during training and center-cropping (or multi-cropping) during testing—both after resizing the frames while maintaining the original aspect ratio.
However, the current LLaVA-Video code deviates from this approach. Instead of preserving the aspect ratio, it resizes videos directly from their original resolution (720x1280) to a fixed square shape (384x384), following the image-processing logic defined by the SigLIP image processor. This results in distorted aspect ratios.
During the training of LLaVA-Video, I observed some inconsistencies in how video data augmentation is handled.
Typically, standard video data augmentation involves applying random cropping during training and center-cropping (or multi-cropping) during testing—both after resizing the frames while maintaining the original aspect ratio.
However, the current LLaVA-Video code deviates from this approach. Instead of preserving the aspect ratio, it resizes videos directly from their original resolution (720x1280) to a fixed square shape (384x384), following the image-processing logic defined by the SigLIP image processor. This results in distorted aspect ratios.
On the other hand, image data preprocessing is properly done by calling the version of augmentation following the config.
Image preprocessing code
Am I correctly understanding your codebase? Please let me know if I have misunderstood anything.
The text was updated successfully, but these errors were encountered: