-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient accumulation yields worse results than the equivalent batch size #2175
Gradient accumulation yields worse results than the equivalent batch size #2175
Comments
Hey, this is expected behaviour. |
Interesting, I didn't know this. But I don't think it matters, I would be surprised that TRL uses FSDP's reduce-scatter for single GPU training. |
Hi, thanks for reporting this. |
Sure, it's all in the notebook I linked to in my first post. I ran this notebook on Colab with the A100. |
Someone tried in fp32 and it didnt help so it doesnt seem to be the reason |
Have you tried full/mixed precision AdamW optimiser? |
Hi, is there any updates? Thanks! |
I'm writing up a report about this - I think I managed to fix it :) |
We have fixed the issue guys! Tweet: https://twitter.com/UnslothAI/status/1846231235749990699 |
nice! feel like fixing it in TRL too? |
The Hugging Face team is already on it! :) |
(Somewhat, currently trying to reverse engineer a few ways you did it, you guys would be much faster at it I imagine if you want to beat us to it ;) As this is more than TRL, it's ground up transformers/Trainer tbh I think) |
:) |
Just as a fair warning, this will not be an immediate nor quick fix, since essentially this means every single model's calculation is off when doing We are working on figuring out the best solution. |
@danielhanchen from the blog |
@muellerzr , i believe this only make sense padding based batch, for packing, there is no 0 / pad token in the batch, so avg cross entropy is consistent |
@nahidalam Unfortunately this is not a HF native issue. The way gradient accumulation has been originally done in many packages even those that use Pytorch directly accidentally missed considering ignored tokens. Using CE Loss directly does not solve the issue since mean reduction does not work, and sum will cause the loss to be scaled incorrectly. @huseinzol05 Packing is also affected albeit less so since some people also do training on completions so it'll also make the loss incorrect. @muellerzr If you guys need any help on anything, ping me! |
Kudos @danielhanchen on the fix! Neat write-up as well! Back to the OP, I think the issue isn't with the |
Does DDP have the same issue? @danielhanchen |
Yes, ddp does we already have documented this + a fix is being put in (I also have an article talking about this more, tl;dr you can choose a slower option of gathering all of the inputs/counts, which causes a communication which generally isn't recommended so it's False by default) |
Should this be closed since it's fixed in transformers? |
Right @burtenshaw. |
Time to hit that "Close Issue" button @qgallouedec @burtenshaw! :) I thought the issue was open because of that! |
Oops |
For language modeling task, will this be a problem even if all samples in a batch have the exact same sequence length? |
I expected a training configuration with per_device_train_batch_size=1 and gradient_accumulation_steps=32 to yield the same (or similar) result to per_device_train_batch_size=32 and gradient_accumulation_steps=1 but that's not the case, the former is much worse.
I ran several experiments with SmolLM-135M and Llama 3.2 1B, using always the same seed, and the results are consistent with this observation.
Maybe I misunderstand something here?
My training code is in this Colab notebook. I ran this notebook to draw the learning curves above, restarting the notebook between each training to avoid OOM.
Note that I have the same observations with Qwen2.
The text was updated successfully, but these errors were encountered: