Fix GA loss bugs and add unit test #35121

techkang · 2024-12-06T12:20:08Z

What does this PR do?

There are two ways to fix GA loss bugs:

Use num_items_in_batch in loss function defined by model. In this case, model_accepts_loss_kwargs is True.
The model doesn't have loss function or user has self-defined loss function, which is compute_loss_func.

However, previes unit test only test for the second condition. So I introduced a new unit test to cover the first condition and fix the bugs by the way.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@muellerzr @ArthurZucker

muellerzr

Thanks! These solutions make sense and ran the tests myself. Left some nits for less-leeway on the test closeness. cc @ArthurZucker

muellerzr · 2024-12-06T15:44:11Z

tests/trainer/test_trainer.py

+        diff_broken = [abs(base - grad) for base, grad in zip(base_loss_callback.losses, broken_loss_callback.losses)]
+
+        # all diff truth should be quite close
+        self.assertLess(max(diff_truth), 0.3, f"Difference {max(diff_truth)} is not within 0.3")


Let's be a bit more aggressive and do 0.15 (this passes). I still feel that's quite big but I can't figure out why (my tests showed 0.001 should be doable).

Suggested change

self.assertLess(max(diff_truth), 0.3, f"Difference {max(diff_truth)} is not within 0.3")

self.assertLess(max(diff_truth), 0.15, f"Difference {max(diff_truth)} is not within 0.15")

It is strange that I tested on both Mac and Windows that max(diff_truth) is 0.144. So maybe 0.15 may failed on some other machine.

Done! I use TinyStories to narrow down gap to the same as you. The code is submitted.

muellerzr · 2024-12-06T15:46:53Z

tests/trainer/test_trainer.py

+        diff_broken = [abs(base - grad) for base, grad in zip(base_loss_callback.losses, broken_loss_callback.losses)]
+
+        # all diff truth should be quite close
+        self.assertLess(max(diff_truth), 0.3, f"Difference {max(diff_truth)} is not within 0.3")


Suggested change

self.assertLess(max(diff_truth), 0.3, f"Difference {max(diff_truth)} is not within 0.3")

self.assertLess(max(diff_truth), 0.2, f"Difference {max(diff_truth)} is not within 0.2")

Similarly we can be aggressive here too

I managed to reduce the gap to 1e-4 by padding all input labels to the same length. However, this method did not work for the GPT-2 model. I will continue to explore other solutions.

ArthurZucker

Sorry @muellerzr but this does not solve:

FAILED examples/pytorch/test_pytorch_examples.py::ExamplesTests::test_run_speech_recognition_seq2seq - TypeError: Wav2Vec2Model.forward() got an unexpected keyword argument 'num_items_in_batch'

so I am not sure I understand. Related to #35113 and #35128.
We can't merge with the broken test

techkang · 2024-12-07T10:27:36Z

@ArthurZucker The Wav2Vec2Model bug is because SpeechEncoderDecoderModel takes variable argument as forward paramters:

transformers/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py

Line 457 in c8c8dff

**kwargs,

But it dispatches the paramater to it's encoder and decode which doesn't accept variable argument:

transformers/src/transformers/models/speech_encoder_decoder/modeling_speech_encoder_decoder.py

Lines 489 to 493 in c8c8dff

    
           kwargs_encoder = {argument: value for argument, value in kwargs.items() if not argument.startswith("decoder_")} 
        
           kwargs_decoder = { 
        
               argument[len("decoder_") :]: value for argument, value in kwargs.items() if argument.startswith("decoder_") 
        
           }

I think the better solution is to modify it's decode to accept variable argument. I proposed a new commit and the test succeed.

techkang · 2024-12-08T08:22:29Z

Finally all unit test passed. Please check again. @muellerzr @ArthurZucker

ArthurZucker

Thanks a lot @techkang !
I did not dive enough on the test, my bad 🤗
Merging ASAP and doing the patch

ArthurZucker · 2024-12-09T08:50:23Z

tests/trainer/test_trainer.py

        set_seed(42)
        import datasets

-        model_name = "distilgpt2"
+        model_name = "nickypro/tinyllama-110M"


Okay, would be nice if we had a safetensors model here but alright.

* fix GA bugs and add unit test * narrow down model loss unit test diff gap * format code to make ruff happy * send num_items_in_batch argument to decoder * fix GA loss bug in BertLMHeadModel * use TinyStories-33M to narrow down diff gap * fotmat code * missing .config * avoid add extra args --------- Co-authored-by: kangsheng <kangsheng@meituan.com>

muellerzr approved these changes Dec 6, 2024

View reviewed changes

ArthurZucker reviewed Dec 7, 2024

View reviewed changes

kangsheng and others added 9 commits December 8, 2024 16:02

fix GA bugs and add unit test

8382b5a

narrow down model loss unit test diff gap

deb67d8

format code to make ruff happy

b985a20

send num_items_in_batch argument to decoder

72b9b14

fix GA loss bug in BertLMHeadModel

8460bad

use TinyStories-33M to narrow down diff gap

b132be9

fotmat code

d461264

missing .config

5abba30

avoid add extra args

893058d

techkang force-pushed the main branch from b0ff20d to 893058d Compare December 8, 2024 08:02

ArthurZucker approved these changes Dec 9, 2024

View reviewed changes

ArthurZucker merged commit 1ccca8f into huggingface:main Dec 9, 2024
22 checks passed

qubvel mentioned this pull request Dec 13, 2024

Fix model_accepts_loss_kwargs for timm model #35257

Merged

hiyouga mentioned this pull request Dec 27, 2024

Pass correct num_items_in_batch value into the training_step function #35438

Merged

5 tasks

This was referenced Jan 13, 2025

PR #35438 introduced a new bug #35649

Closed

About GA loss in the latest transformers version #35663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GA loss bugs and add unit test #35121

Fix GA loss bugs and add unit test #35121

techkang commented Dec 6, 2024

muellerzr left a comment

muellerzr Dec 6, 2024

techkang Dec 7, 2024

techkang Dec 7, 2024

muellerzr Dec 6, 2024

techkang Dec 7, 2024

ArthurZucker left a comment

techkang commented Dec 7, 2024

techkang commented Dec 8, 2024

ArthurZucker left a comment •

edited

Loading

ArthurZucker Dec 9, 2024

	self.assertLess(max(diff_truth), 0.3, f"Difference {max(diff_truth)} is not within 0.3")
	self.assertLess(max(diff_truth), 0.15, f"Difference {max(diff_truth)} is not within 0.15")

	self.assertLess(max(diff_truth), 0.3, f"Difference {max(diff_truth)} is not within 0.3")
	self.assertLess(max(diff_truth), 0.2, f"Difference {max(diff_truth)} is not within 0.2")

Fix GA loss bugs and add unit test #35121

Fix GA loss bugs and add unit test #35121

Conversation

techkang commented Dec 6, 2024

What does this PR do?

Before submitting

Who can review?

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr Dec 6, 2024

Choose a reason for hiding this comment

techkang Dec 7, 2024

Choose a reason for hiding this comment

techkang Dec 7, 2024

Choose a reason for hiding this comment

muellerzr Dec 6, 2024

Choose a reason for hiding this comment

techkang Dec 7, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

techkang commented Dec 7, 2024

techkang commented Dec 8, 2024

ArthurZucker left a comment • edited Loading

Choose a reason for hiding this comment

ArthurZucker Dec 9, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment •

edited

Loading