-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Instruction about training Open-Flamingo from scratch #129
Comments
By the way, I noticed that the README provides a Training part. However, it lacks a lot of details:) |
Thank you for your interest @HenryHZY. Can you please let me know what details (in addition to loss) would be most helpful and I will be sure to add them. Currently training from scratch is not possible as MMC4 is not public yet (but will be very soon). |
@anas-awadalla Thanks for your quick reply. Take your running command as an example, how can I change the following command to only train on LAION-2B based on a pre-trained OPT-1.3B?
By the way, I would like to ask about the contribution of MMC4 for training. Have you conducted an ablation study on MMC4 + LAION-2B and LAION-2B only? Thank you very much for your time and consideration! |
Got it. This is currently not an option but definitely should be! I will open an issue (feel free to contribute or if not I can do this next week). As for your second point we have not done these experiments but I agree that they would be very useful datapoints. |
Thank you for wondeful code release, and I have question for training flamingo9B,
|
Yes it is using #137, and I successfully trained flamingo3B(not 9B) with this code. |
@Soonhwan-Kwon The issue here is that you are adding a cross attention layer after every layer in llama 7B. I am not sure what the total number of parameters is using this setup but it is way larger than 9B. You should set |
Thank you for the quick reply! you saved my day. Thank you! |
@Soonhwan-Kwon @anas-awadalla Thanks for your great reply!! I will try it later:) |
Hi @anas-awadalla, Thanks for the great repo. I'm trying to reproduce OpenFlamingo results using mpt-1b-redpajama-200b with a single 40GB A100 node. Even though the results on VQA tasks are similar to what is reported, COCO CIDEr numbers are much worse. In the recently released paper, it was mentioned that 8 A100 nodes were used for training. So, I'm wondering have you done any experiments to check how long do I have to train to get the same performance as 8 A100 nodes? Do I have to train enough to see 5M MMC4 and 10M LAION samples? Have you seen any influence of the effective batch size on the final metrics when using multinode vs maybe a single GPU? |
Hello @itzsid! For all the models we released, we trained on 120M samples from LAION and 60M from mmc4. How many samples have you trained your version on? What is the performance on COCO for you? For our version of OpenFlamingo3B we used effective batch sizes 1152 and 2304 for mmc4 and LAION respectively and 1875 warmup steps. However you can use much lower batch sizes and still get similar performance but you should scale the warmup steps accordingly. |
@anas-awadalla I trained for approximately 10M samples. Zero-shot COCO CIDEr is 36.55 for me vs 75.9 using the released model. I think one of the issue is that the loss curve I get for LAION does not exactly match the Figure 5 results in the paper. My LAION loss curve look like this: |
We apply smoothing to the loss curve in the paper so these loss plots look fine to me! Is that 10M samples of LAION and 5M samples of MMC4 then? If so then seems like your training run is on track. Here is how 0-shot COCO improves during the training of our mpt-1b-redpajama-200b-dolly model: Data scale | Cider score* *Note that these are validation scores so the numbers will look a little different than what we report in the paper. |
Thanks @anas-awadalla. This is super helpful. I'll train the models longer and check the performance after 10M mmc4 + 20M laion. |
@anas-awadalla I get similar values as above after going through 150M samples. Thanks for the help! Next I'm trying to train a larger model with MPT-7B (anas-awadalla/mpt-7b). Wondering how much did you reduce the batch size to fit in the memory? I'm using 40GB A100. Also, I use amp_bf16 as suggested in the paper. These are the current args for 7b model:
|
Great! We used ddp with 80GB A100s for the 9B model. You should be able to train with higher batch sizes on the 40GB ones using our fsdp implementation. You can add flags "--fsdp", "--fsdp_use_orig_params", "--fsdp_sharding_strategy = "hybrid"" to the train script to do so. |
@anas-awadalla Using FSDP args mentioned above with MPT-7B, I get this error:
Any ideas? |
Thanks @anas-awadalla. Similar to LAION forward pass, I added these lines which made it work:
However, this issue only shows up when fsdp is enabled. Additionally, I had to comment out (https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/flamingo.py#L271-L276):
otherwise I get the error:
I also had to comment out (https://github.com/mlfoundations/open_flamingo/blob/main/open_flamingo/src/flamingo.py#L299):
Otherwise, it threw an error:
Did you have these issues on your end too? I'm using |
Hmm no we don't run into these. Just to confirm you are using torch 2.0.1? |
Yes, my torch version is 2.0.1+cu117. Do you have a docker container as well with all dependencies? I can try running it inside the container. |
@anas-awadalla I started a training with MPT-7B on 80GB nodes itself. However, I see vqa numbers going down with increasing number of samples seen. Did you see something similar? Here is a plot for OK-VQA numbers after 100M LAION+MMC4 samples. I used these args for training 9B model:
|
This is how downstream validation performance changes for COCO and VQAv2 for the 9B model. Our experience with VQA performance is that it stays relatively constant apart from an initial increase during training. We do see the behavior you are reporting if the mmc4 image-text similarity threshold is too high (we use 0.24). What value are you using for that? Also just checking that you are using the full mmc4 so data repetition is not an issue? |
Thanks @anas-awadalla for the table. This is quite helpful. In my case, I do use |
Ah ok that could be the reason because we do use the full set. Especially since you do hit ~37 and assuming this is zero-shot this would match what we got. How are coco and vqav2 scores? Are they also degrading? |
Thanks, that makes sense. COCO numbers are stable around 60. I didn't measure VQAv2 numbers but Vizwiz VQA, TextVQA and OKVQA are degrading. |
Hi @anas-awadalla
As described in #124, "Our training took place on 32 80GB A100s. We trained on 5M samples from MMC4 and 10M from LAION 2B."
I am interested in the details of loss during training. And if possible, I would like to extend it to other research fields. Could you please provide an instruction about training Open-Flamingo from scratch? It would be of great help to my research.
Thank you very much for your great project!
The text was updated successfully, but these errors were encountered: