Text-to-Video Results Degrade After Fine-Tuning on the OpenVid HD Subset (~0.4M Items) #203

levon-khachatryan · 2024-11-27T12:05:54Z

First of all, congratulations on your fantastic work, and thank you for open-sourcing it!

I encountered an issue while fine-tuning the 384p miniFLUX model on the OpenVid HD subset (~0.4M items). After 40K steps, I noticed a degradation in the quality of the generated text-to-video results.

Details of my setup:

Dataset: The prompts for the OpenVid HD subset were generated using VILA.
Training Script: I used train_pyramid_flow.sh without making any modifications.
Model: 384p miniFLUX.

Issue:
Attached, you can find a comparison of the original 384p miniFLUX (on the left) and the fine-tuned version (on the right). The prompt used was:
"A fat rabbit wearing a purple robe walking through a fantasy landscape."
Is this degradation expected, or could there be an issue with the fine-tuning process? I would greatly appreciate any insights or recommendations for debugging and improving the results.

1.mp4

Thank you in advance for your time and support!

Best regards,
Levon

jy0205 · 2024-12-02T13:54:37Z

Hi! How many GPUs did you use in your fine-tuning? Maybe you can try to use a lower learning rate, e.g., 1e-5?

levon-khachatryan · 2024-12-03T07:11:32Z

Hello,

Thank you for your response. I am currently using 8 GPUs (8x A100) with a learning rate of 1e-5.

One observation I made is that the current training code lacks mixed training with both image and video data. The paper mentions that image data is utilized at a proportion of 12.5% in each batch during training, but the published code relies solely on video data. We have been following this approach, and this discrepancy might be contributing to the quality degradation we’re observing.

Please let me know if you require any additional details or have recommendations on how to proceed.

Best regards,
Levon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text-to-Video Results Degrade After Fine-Tuning on the OpenVid HD Subset (~0.4M Items) #203

Text-to-Video Results Degrade After Fine-Tuning on the OpenVid HD Subset (~0.4M Items) #203

levon-khachatryan commented Nov 27, 2024

jy0205 commented Dec 2, 2024

levon-khachatryan commented Dec 3, 2024

Text-to-Video Results Degrade After Fine-Tuning on the OpenVid HD Subset (~0.4M Items) #203

Text-to-Video Results Degrade After Fine-Tuning on the OpenVid HD Subset (~0.4M Items) #203

Comments

levon-khachatryan commented Nov 27, 2024

jy0205 commented Dec 2, 2024

levon-khachatryan commented Dec 3, 2024