Reproduce the pre-training tasks of Video-LLaMAv2, but the video dimensions are misaligned. #141

CauchyFanUpdate · 2024-12-27T16:02:46Z

I am attempting to reproduce the pre-training tasks of Video-LLaMAv2. I have already downloaded the Vallay and LLaVA-image datasets and started experimenting with pre-training. However, I noticed that the video dimensions obtained in LazySupervisedDataset and DataCollatorForSupervisedDataset are 16, 3, 336, 336. Without making any modifications, I found that the video dimensions became 2, 3, 336, 336 in the forward method of VideoLLaMA2MistralForCausalLM. I couldn't find where the changes occurred and couldn't understand the logic behind the modification. Could you help me resolve this issue?

The text was updated successfully, but these errors were encountered:

CauchyFanUpdate · 2024-12-27T16:06:47Z

By the way, my training parameters in VSCode are:
"--model_type", "videollama2_mistral",
"--model_path", "checkpoints/Mistral-7B-Instruct-v0.2",
"--vision_tower", "checkpoints/clip-vit-large-patch14-336",
"--mm_projector_type", "stc_connector_v35",
"--tune_mm_mlp_adapter", "True",
"--data_path", "datasets/videollava_pt/valley_llavaimage.json",
"--data_folder", "datasets/videollava_pt/",
"--mm_vision_select_layer", "-2",
"--num_frames", "16",
"--bf16", "True",
"--tf32", "True",
"--fp16", "False",
"--output_dir", "output/",
"--num_train_epochs", "1",
"--per_device_train_batch_size", "1",
"--per_device_eval_batch_size", "4",
"--gradient_accumulation_steps", "1",
"--evaluation_strategy", "no",
"--save_strategy", "steps",
"--save_steps", "500",
"--save_total_limit", "99",
"--learning_rate", "1e-3",
"--warmup_ratio", "0.03",
"--weight_decay", "0.",
"--lr_scheduler_type", "cosine",
"--logging_steps", "1",
"--model_max_length", "2048",
"--gradient_checkpointing", "True",
"--dataloader_num_workers", "4",
"--report_to", "tensorboard",
"--run_name", "mistral_7b_16f",

Meanwhile, the parameters in my preprocessor_config.json for clip-vit-large-patch14-336 are as follows:
{
"crop_size": 336,
"do_center_crop": true,
"do_normalize": true,
"do_resize": true,
"feature_extractor_type": "CLIPFeatureExtractor",
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"resample": 3,
"size": 336
}

CauchyFanUpdate · 2024-12-30T02:38:51Z

@BenoitHanotte @hill2hill @lixin4ever @hangzhang-nlp Could you help me with this issue? It's very important to me, and I’ve already spent about two weeks on it. I would greatly appreciate your assistance in resolving it.

CauchyFanUpdate · 2025-01-02T11:54:56Z

@BenoitHanotte @hill2hill @lixin4ever @hangzhang-nlp I hope you can take the time to help address my concerns. The purpose of our platform is to solve users' problems, not merely to serve as a display tool without proper follow-up management. Such a situation might raise doubts about the impact of the method.

clownrat6 · 2025-01-03T08:02:53Z

Sorry for late 🙏🙏. Could you please fork our codebase and commit your modifications? I try the provided commands, but it seems I can't reproduce the bug.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduce the pre-training tasks of Video-LLaMAv2, but the video dimensions are misaligned. #141

Reproduce the pre-training tasks of Video-LLaMAv2, but the video dimensions are misaligned. #141

CauchyFanUpdate commented Dec 27, 2024

CauchyFanUpdate commented Dec 27, 2024

CauchyFanUpdate commented Dec 30, 2024

CauchyFanUpdate commented Jan 2, 2025

clownrat6 commented Jan 3, 2025

Reproduce the pre-training tasks of Video-LLaMAv2, but the video dimensions are misaligned. #141

Reproduce the pre-training tasks of Video-LLaMAv2, but the video dimensions are misaligned. #141

Comments

CauchyFanUpdate commented Dec 27, 2024

CauchyFanUpdate commented Dec 27, 2024

CauchyFanUpdate commented Dec 30, 2024

CauchyFanUpdate commented Jan 2, 2025

clownrat6 commented Jan 3, 2025