Gradient accumulation yields worse results than the equivalent batch size #2175

benjamin-marie · 2024-10-04T15:26:53Z

I expected a training configuration with per_device_train_batch_size=1 and gradient_accumulation_steps=32 to yield the same (or similar) result to per_device_train_batch_size=32 and gradient_accumulation_steps=1 but that's not the case, the former is much worse.
I ran several experiments with SmolLM-135M and Llama 3.2 1B, using always the same seed, and the results are consistent with this observation.

Maybe I misunderstand something here?

My training code is in this Colab notebook. I ran this notebook to draw the learning curves above, restarting the notebook between each training to avoid OOM.
Note that I have the same observations with Qwen2.

mayank31398 · 2024-10-06T06:29:12Z

Hey, this is expected behaviour.
FSDP-1 only allows accumulation in 16-bit precision.
This is not the case for FSDP-2 which allows accumulation in both 16-bit and 32-bit.

mayank31398 · 2024-10-06T06:35:25Z

documentation for FSDP-1:

documentation for FSDP-2:

benjamin-marie · 2024-10-07T05:44:18Z

Interesting, I didn't know this. But I don't think it matters, I would be surprised that TRL uses FSDP's reduce-scatter for single GPU training.

qgallouedec · 2024-10-07T09:39:36Z

Hi, thanks for reporting this.
Can you share your system info and the code you use for training?

benjamin-marie · 2024-10-07T10:04:58Z

Sure, it's all in the notebook I linked to in my first post. I ran this notebook on Colab with the A100.

teknium1 · 2024-10-10T15:57:56Z

Someone tried in fp32 and it didnt help so it doesnt seem to be the reason

https://x.com/bnjmn_marie/status/1842464802636980564

vigneshwaran · 2024-10-11T04:51:52Z

Have you tried full/mixed precision AdamW optimiser?

benjamin-marie · 2024-10-11T05:43:17Z

Yes:

This configuration uses fp32 and adamw_torch.

fzyzcjy · 2024-10-15T06:44:54Z

Hi, is there any updates? Thanks!

danielhanchen · 2024-10-15T09:08:02Z

I'm writing up a report about this - I think I managed to fix it :)
(Yes it is in fact a subtle bug!) - will tweet and post about it in like 8 - 10 hours!

shimmyshimmer · 2024-10-15T16:52:58Z

We have fixed the issue guys!

Tweet: https://twitter.com/UnslothAI/status/1846231235749990699
Blogpost: https://unsloth.ai/blog/gradient

geronimi73 · 2024-10-15T17:07:38Z

We have fixed the issue guys!

nice! feel like fixing it in TRL too?

shimmyshimmer · 2024-10-15T17:50:17Z

We have fixed the issue guys!

nice! feel like fixing it in TRL too?

The Hugging Face team is already on it! :)

muellerzr · 2024-10-15T17:55:10Z

(Somewhat, currently trying to reverse engineer a few ways you did it, you guys would be much faster at it I imagine if you want to beat us to it ;) As this is more than TRL, it's ground up transformers/Trainer tbh I think)

danielhanchen · 2024-10-15T18:23:09Z

:)
Wrote a detailed tweet about it: https://x.com/danielhanchen/status/1846235913443262891
Also Reddit post: https://www.reddit.com/r/LocalLLaMA/comments/1g4ego7/llm_training_bug_fixes_gradient_accumulation_was/
Blog post: https://unsloth.ai/blog/gradient
Also @shimmyshimmer is my brother!! :)

muellerzr · 2024-10-15T19:08:38Z

Just as a fair warning, this will not be an immediate nor quick fix, since essentially this means every single model's calculation is off when doing output.loss, and every single model will need a custom variation of CrossEntropy (and other valid loss funcs) if you do not calculate the loss by hand.

We are working on figuring out the best solution.

nahidalam · 2024-10-15T23:56:18Z

@danielhanchen from the blog The 2nd theory was there is in fact a bug in the loss calculation, which we find to be the case. this bug is specifically for CrossEntropy loss calculation in HF trl? This will not be an issue if someone is using say torch.nn.CrossEntropyLoss ?

huseinzol05 · 2024-10-16T01:50:11Z

@muellerzr , i believe this only make sense padding based batch, for packing, there is no 0 / pad token in the batch, so avg cross entropy is consistent

danielhanchen · 2024-10-16T01:59:46Z

@nahidalam Unfortunately this is not a HF native issue. The way gradient accumulation has been originally done in many packages even those that use Pytorch directly accidentally missed considering ignored tokens. Using CE Loss directly does not solve the issue since mean reduction does not work, and sum will cause the loss to be scaled incorrectly.

@huseinzol05 Packing is also affected albeit less so since some people also do training on completions so it'll also make the loss incorrect.

@muellerzr If you guys need any help on anything, ping me!

wongjingping · 2024-10-16T04:21:02Z

Kudos @danielhanchen on the fix! Neat write-up as well! Back to the OP, I think the issue isn't with the trl library, but with the transformers library instead, because of how SFTTrainer extends Trainer, how the loss is calculated in Trainer's compute_loss, and how it is naively scaled by the number of steps here. I don't have a ton of context, but I imagine the more principled solution would be to fix it within the Trainer.compute_loss function, vs say having SFTTrainer override the compute_loss method. Happy to assist with the transformers fix if anyone from HF would like to take me up on it 😄

qingjianbuyi · 2024-10-28T13:57:37Z

Does DDP have the same issue? @danielhanchen

muellerzr · 2024-10-28T14:33:55Z

Yes, ddp does we already have documented this + a fix is being put in (I also have an article talking about this more, tl;dr you can choose a slower option of gathering all of the inputs/counts, which causes a communication which generally isn't recommended so it's False by default)

burtenshaw · 2024-11-25T12:29:31Z

Should this be closed since it's fixed in transformers?

cc @qgallouedec @lewtun

qgallouedec · 2024-11-25T20:12:04Z

Right @burtenshaw.
Closed by huggingface/transformers#34198

pminervini · 2024-11-27T06:54:37Z

Time to hit that "Close Issue" button @qgallouedec @burtenshaw! :) I thought the issue was open because of that!

qgallouedec · 2024-11-27T07:33:43Z

Oops

surprisedPikachu007 · 2024-12-06T05:41:52Z

@huseinzol05 Packing is also affected albeit less so since some people also do training on completions so it'll also make the loss incorrect.

For language modeling task, will this be a problem even if all samples in a batch have the exact same sequence length?

qgallouedec added ❓ question Seeking clarification or more information ⏳ needs more info Additional information or clarification is required to proceed labels Oct 7, 2024

muellerzr mentioned this issue Oct 16, 2024

Enable users to use their own loss functions + deal with prefetching for grad accum huggingface/transformers#34198

Merged

5 tasks

qgallouedec closed this as completed Nov 27, 2024

qgallouedec mentioned this issue Dec 10, 2024

Probaly mistake in DPOTrainer when compute/log grad_norm #2456

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient accumulation yields worse results than the equivalent batch size #2175

Gradient accumulation yields worse results than the equivalent batch size #2175

benjamin-marie commented Oct 4, 2024

mayank31398 commented Oct 6, 2024

mayank31398 commented Oct 6, 2024

benjamin-marie commented Oct 7, 2024

qgallouedec commented Oct 7, 2024

benjamin-marie commented Oct 7, 2024

teknium1 commented Oct 10, 2024

vigneshwaran commented Oct 11, 2024

benjamin-marie commented Oct 11, 2024

fzyzcjy commented Oct 15, 2024

danielhanchen commented Oct 15, 2024

shimmyshimmer commented Oct 15, 2024

geronimi73 commented Oct 15, 2024

shimmyshimmer commented Oct 15, 2024

muellerzr commented Oct 15, 2024 •

edited

Loading

danielhanchen commented Oct 15, 2024

muellerzr commented Oct 15, 2024

nahidalam commented Oct 15, 2024

huseinzol05 commented Oct 16, 2024

danielhanchen commented Oct 16, 2024

wongjingping commented Oct 16, 2024 •

edited

Loading

qingjianbuyi commented Oct 28, 2024

muellerzr commented Oct 28, 2024 •

edited

Loading

burtenshaw commented Nov 25, 2024

qgallouedec commented Nov 25, 2024 •

edited

Loading

pminervini commented Nov 27, 2024

qgallouedec commented Nov 27, 2024

surprisedPikachu007 commented Dec 6, 2024

Gradient accumulation yields worse results than the equivalent batch size #2175

Gradient accumulation yields worse results than the equivalent batch size #2175

Comments

benjamin-marie commented Oct 4, 2024

mayank31398 commented Oct 6, 2024

mayank31398 commented Oct 6, 2024

benjamin-marie commented Oct 7, 2024

qgallouedec commented Oct 7, 2024

benjamin-marie commented Oct 7, 2024

teknium1 commented Oct 10, 2024

vigneshwaran commented Oct 11, 2024

benjamin-marie commented Oct 11, 2024

fzyzcjy commented Oct 15, 2024

danielhanchen commented Oct 15, 2024

shimmyshimmer commented Oct 15, 2024

geronimi73 commented Oct 15, 2024

shimmyshimmer commented Oct 15, 2024

muellerzr commented Oct 15, 2024 • edited Loading

danielhanchen commented Oct 15, 2024

muellerzr commented Oct 15, 2024

nahidalam commented Oct 15, 2024

huseinzol05 commented Oct 16, 2024

danielhanchen commented Oct 16, 2024

wongjingping commented Oct 16, 2024 • edited Loading

qingjianbuyi commented Oct 28, 2024

muellerzr commented Oct 28, 2024 • edited Loading

burtenshaw commented Nov 25, 2024

qgallouedec commented Nov 25, 2024 • edited Loading

pminervini commented Nov 27, 2024

qgallouedec commented Nov 27, 2024

surprisedPikachu007 commented Dec 6, 2024

muellerzr commented Oct 15, 2024 •

edited

Loading

wongjingping commented Oct 16, 2024 •

edited

Loading

muellerzr commented Oct 28, 2024 •

edited

Loading

qgallouedec commented Nov 25, 2024 •

edited

Loading