[Llava-video] Wrong data augmentation for video data #318

HYUNJS · 2024-10-21T10:54:18Z

During the training of LLaVA-Video, I observed some inconsistencies in how video data augmentation is handled.

Typically, standard video data augmentation involves applying random cropping during training and center-cropping (or multi-cropping) during testing—both after resizing the frames while maintaining the original aspect ratio.

However, the current LLaVA-Video code deviates from this approach. Instead of preserving the aspect ratio, it resizes videos directly from their original resolution (720x1280) to a fixed square shape (384x384), following the image-processing logic defined by the SigLIP image processor. This results in distorted aspect ratios.

Siglip Image Preprocessor

transforms = [
    convert_to_rgb,
    to_numpy_array,
    partial(resize, size=self.size, resample=self.resample, data_format=self.data_format),
    partial(rescale, scale=self.rescale_factor, data_format=self.data_format),
    partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format),
    partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format),
]
images = reduce(lambda x, f: [*map(f, x)], transforms, images)

On the other hand, image data preprocessing is properly done by calling the version of augmentation following the config.
Image preprocessing code

Am I correctly understanding your codebase? Please let me know if I have misunderstood anything.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Llava-video] Wrong data augmentation for video data #318

[Llava-video] Wrong data augmentation for video data #318

HYUNJS commented Oct 21, 2024

[Llava-video] Wrong data augmentation for video data #318

[Llava-video] Wrong data augmentation for video data #318

Comments

HYUNJS commented Oct 21, 2024