Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text-to-Video Results Degrade After Fine-Tuning on the OpenVid HD Subset (~0.4M Items) #203

Open
levon-khachatryan opened this issue Nov 27, 2024 · 2 comments

Comments

@levon-khachatryan
Copy link

First of all, congratulations on your fantastic work, and thank you for open-sourcing it!

I encountered an issue while fine-tuning the 384p miniFLUX model on the OpenVid HD subset (~0.4M items). After 40K steps, I noticed a degradation in the quality of the generated text-to-video results.

Details of my setup:

  • Dataset: The prompts for the OpenVid HD subset were generated using VILA.
  • Training Script: I used train_pyramid_flow.sh without making any modifications.
  • Model: 384p miniFLUX.

Issue:
Attached, you can find a comparison of the original 384p miniFLUX (on the left) and the fine-tuned version (on the right). The prompt used was:
"A fat rabbit wearing a purple robe walking through a fantasy landscape."
Is this degradation expected, or could there be an issue with the fine-tuning process? I would greatly appreciate any insights or recommendations for debugging and improving the results.

1.mp4

Thank you in advance for your time and support!

Best regards,
Levon

@jy0205
Copy link
Owner

jy0205 commented Dec 2, 2024

Hi! How many GPUs did you use in your fine-tuning? Maybe you can try to use a lower learning rate, e.g., 1e-5?

@levon-khachatryan
Copy link
Author

Hello,

Thank you for your response. I am currently using 8 GPUs (8x A100) with a learning rate of 1e-5.

One observation I made is that the current training code lacks mixed training with both image and video data. The paper mentions that image data is utilized at a proportion of 12.5% in each batch during training, but the published code relies solely on video data. We have been following this approach, and this discrepancy might be contributing to the quality degradation we’re observing.

Please let me know if you require any additional details or have recommendations on how to proceed.

Best regards,
Levon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants