Add TFVisionEncoderDecoderModel #14148

ydshieh · 2021-10-25T16:36:25Z

What does this PR do?

To make Vision-Encoder-Text-Decoder family complete by adding TFVisionEncoderDecoderModel.

To complete this PR, it requires to wait #13778 being merged to master (then rebase)
(And if we want to include a real integration test using the recent image-captioning ViT + GPT2 model, need to wait #14038 too)

sgugger

Thanks for adding this model!

src/transformers/generation_tf_utils.py

sgugger · 2021-11-19T23:27:02Z

src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py

-        # Add `encoder_hidden_states` to make the cross-attention layers' weights initialized
-        if self.config.add_cross_attention:
-            batch_size, seq_len = input_ids.shape
-            shape = (batch_size, seq_len) + (self.config.hidden_size,)
-            h = tf.random.uniform(shape=shape)
-            dummy["encoder_hidden_states"] = h
-


Why is this part removed?

TFEncoderDecoderModel.call() doesn't have encoder_hidden_states parameter, but encoder_outputs.
Moreover, encoder_hidden_states is always passed to the decoder with encoder_hidden_states = encoder_outputs[0]. Therefore, there is no need to add encoder_hidden_states in dummy_inputs.

(I don't remember why I did that before, probably in some intermediate commits, it was required)

Looks good to me!

src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py

tests/test_modeling_tf_vision_encoder_decoder.py

ydshieh · 2021-11-23T13:15:46Z

@sgugger, thank you for your review, I have addressed them.

I am impressed by your ability to spot the fits in one line places.

I feel that sometimes make style can reformat in this case, but doesn't work well in a few places, as happened in this PR.

I dive a bit deeper, and found:

this one won't be reformatted by make style

    def prepare_inputs_for_generation(
        self,
        decoder_input_ids,
        past,
        use_cache=None,
        **kwargs,
    ):

but the following one will work well.

    def prepare_inputs_for_generation(
        self,
        decoder_input_ids,
        past,
        use_cache=None,
        **kwargs
    ):

The difference is the ending comma after the last argument. Is this a bug? I can open an issue if so.
Maybe I should just remove all the commas after the last arguments in my PR.

patrickvonplaten · 2021-12-10T12:30:04Z

src/transformers/generation_tf_utils.py

-                output_hidden_states=output_hidden_states,
-                return_dict=return_dict_in_generate,
-            )
+            encoder_kwargs = {


fine with me!

patrickvonplaten · 2021-12-10T12:31:27Z

src/transformers/generation_tf_utils.py

        # Expand input ids if num_beams > 1 or num_return_sequences > 1
-        if num_return_sequences > 1 or num_beams > 1:
+        if len(shape_list(input_ids)) == 2 and (num_return_sequences > 1 or num_beams > 1):


that's a bit hacky - vision inputs should also work with num_beams > 1 no? But ok for now until we do the big generate refactor @Rocketknight1

Just as remark: The code inside this block treats input_ids as text-only inputs, for example shape_list(input_ids)[-1] assumes that the last dimension is the sequence dim.

For vision inputs, generate will be called only if it is a vision model as encoder in an encoder-decoder model (?). In this case, it is more the decoder_input_ids to be processed. And I think this is done in the next block

transformers/src/transformers/generation_tf_utils.py

Line 722 in e235759

if self.config.is_encoder_decoder:

It's not clear to me if a standalone vision model will need to call generate(). Maybe @NielsRogge can share some insights here (ImageGPT ?).

src/transformers/generation_utils.py

src/transformers/models/deit/modeling_deit.py

patrickvonplaten · 2021-12-10T12:34:41Z

src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py

@@ -153,7 +153,7 @@
 @add_start_docstrings(ENCODER_DECODER_START_DOCSTRING)
 class TFEncoderDecoderModel(TFPreTrainedModel):
    r"""
-    :class:`~transformers.TFEncoderDecoder` is a generic model class that will be instantiated as a transformer
+    :class:`~transformers.TFEncoderDecoderModel` is a generic model class that will be instantiated as a transformer


tests/test_modeling_tf_vision_encoder_decoder.py

tests/test_modeling_vision_encoder_decoder.py

tests/test_modeling_tf_vision_encoder_decoder.py

patrickvonplaten

This is a great addition @ydshieh! Thanks a lot for the contribution.

From my side it would be great if we could:

remove all pytorch specific changes that are not needed to get the TF version working (I'll tackle this in a future PR :-) )
add one slow test that ensures that the model works correctly

Thanks a bunch!

ydshieh · 2021-12-10T13:09:07Z

This is a great addition @ydshieh! Thanks a lot for the contribution.

From my side it would be great if we could:
* remove all pytorch specific changes that are not needed to get the TF version working (I'll tackle this in a future PR :-) )

* add one slow test that ensures that the model works correctly
Thanks a bunch!

Hi @patrickvonplaten

OK for removing the changes on PT code.
For the test, I think I forgot to do the same as I have done for PT/Flax

transformers/tests/test_modeling_vision_encoder_decoder.py

Line 666 in 8395f14

def test_inference_coco_en(self):

transformers/tests/test_modeling_flax_vision_encoder_decoder.py

Line 467 in 8395f14

def test_inference_coco_en(self):

I will add it to TF test :)

# Conflicts: # docs/source/model_doc/auto.rst # src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py

ydshieh · 2021-12-31T10:35:38Z

Hi, @patrickvonplaten @Rocketknight1 @NielsRogge

I removed the changes on PT code. I also added

class TFViT2GPT2ModelIntegrationTest(unittest.TestCase):
    @slow
    def test_inference_coco_en(self):

here

transformers/tests/test_modeling_tf_vision_encoder_decoder.py

Line 769 in b325586

def test_inference_coco_en(self):

This PR is ready for review when you have the time :-)

patrickvonplaten · 2022-01-03T10:49:11Z

src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py

+        return super().from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs)
+
+    @classmethod
+    def from_encoder_decoder_pretrained(


We should probably add this part:

transformers/src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py

Line 402 in 8f6373c

if kwargs_encoder.get("from_pt", None):

here as well no? Otherwise it'll be difficult to load TF from PT

Yes, great catch! For some reason I missed it in this PR. I will check again this PR against TFEncoderDecoderModel code.

@patrickvonplaten I corrected this part, and also updated the corresponding test (also forgot in the previous commits).

Thanks a lot!

tests/test_modeling_tf_vision_encoder_decoder.py

patrickvonplaten · 2022-01-03T10:52:18Z

src/transformers/generation_tf_utils.py

@@ -628,14 +629,18 @@ def generate(
            bad_words_ids is None or isinstance(bad_words_ids, list) and isinstance(bad_words_ids[0], list)
        ), "`bad_words_ids` is either `None` or a list of lists of tokens that should not be generated"

+        # This block corresponds to the following line in `generation_utils`:


This whole function is in dire need of a refactor - I'll try to tackle this with @Rocketknight1 this month. Good for me the way it is now though

src/transformers/generation_tf_utils.py

patrickvonplaten

This looks more or less ready to be merged to me:

I think we should add the loading from_pt hack as well to from_encoder_decoder(...) here
I think we can delete the file: docs/source/model_doc/auto.rst no?

ydshieh · 2022-01-03T10:59:25Z

auto.rst no?

Yes. I added the (empty) file to commit by mistake during git rebase/merge.

ydshieh · 2022-01-06T21:03:52Z

tests/test_modeling_tf_vision_encoder_decoder.py

+
+            max_diff = np.max(np.abs(logits_tf_2.numpy() - logits_tf.numpy()))
+            self.assertAlmostEqual(max_diff, 0.0, places=3)
+


There was a # TensorFlow => PyTorch block, which did nothing (and if did, it would fail since we can't use from_pretrained along with from_pt or from_tf in this composite model.)

I removed it - need to remove the corresponding part in test_modeling_tf_encoder_decoder.py too in another PR.

patrickvonplaten · 2022-01-07T07:22:53Z

@NielsRogge @Rocketknight1 - could you guys take a look here as well?

patrickvonplaten

Looks good to me now! I let @sgugger take a final look here

NielsRogge · 2022-01-07T16:03:11Z

src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py

+            Provide for sequence to sequence training to the decoder. Indices can be obtained using
+            [`PreTrainedTokenizer`]. See [`PreTrainedTokenizer.encode`] and [`PreTrainedTokenizer.__call__`] for
+            details.


The PyTorch model (VisionEncoderDecoderModel) automatically creates the decoder_input_ids by shifting the labels. Is this not the case for the TF one?

No. I started this work before the new change (about decoder_input_ids) was done in VisionEncoderDecoderModel, and didn't follow it.

Would it be possible to leave it to another PR - I can make it, but prefer in a separate PR.

Ok makes sense!

See #14139 (for info)

NielsRogge · 2022-01-07T16:04:30Z

src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py

+    """
+    config_class = VisionEncoderDecoderConfig
+    base_model_prefix = "vision_encoder_decoder"
+    load_weight_prefix = "tf_vision_encoder_decoder_model_1"


Where does the _1 come from?

I originally looked TFRagModel as reference for implementing the TF composite model, and saw

transformers/tests/test_modeling_tf_rag.py

Line 973 in ac224bb

load_weight_prefix = "tf_rag_model_1"

I copied it and it works well (fixing the problems I had at that time), and I didn't think this part in more details.

I just spent some time on checking this again - and in fact, we can use tf_vision_encoder_decoder_model here, and "tf_encoder_decoder_model for TFEncoderDecoderModel.

Change this might break some user models - but since this is kind new models, and not popular yet, maybe it is worth the change. cc @patrickvonplaten , @Rocketknight1 , @sgugger, @LysandreJik for their thoughts on this.

(For tf_rag_model_1, maybe there is particular reason to make something work though.)

I changed to "tf_vision_encoder_decoder_model" - better not to continue with the strange _1

sgugger

There are a few typos in the docstrings, but LGTM once it's fixed!
Thanks for all your work on this @ydshieh !

src/transformers/models/encoder_decoder/modeling_tf_encoder_decoder.py

src/transformers/models/vision_encoder_decoder/modeling_tf_vision_encoder_decoder.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

ydshieh · 2022-01-10T17:43:03Z

Applied @sgugger review suggestions. Failed tests are unrelated.

sgugger · 2022-01-10T18:30:23Z

Thanks again for all your work on this!

ydshieh force-pushed the tf_vision_encoder_decoder branch 2 times, most recently from 8e395af to ea19bdf Compare November 9, 2021 19:57

ydshieh force-pushed the tf_vision_encoder_decoder branch from 5e9ffb1 to 5b08958 Compare November 13, 2021 19:11

ydshieh changed the title ~~[WIP] Add TFVisionEncoderDecoderModel~~ Add TFVisionEncoderDecoderModel Nov 14, 2021

ydshieh marked this pull request as ready for review November 14, 2021 15:07

ydshieh marked this pull request as draft November 14, 2021 16:03

ydshieh changed the title ~~Add TFVisionEncoderDecoderModel~~ [WIP] Add TFVisionEncoderDecoderModel Nov 14, 2021

ydshieh changed the title ~~[WIP] Add TFVisionEncoderDecoderModel~~ Add TFVisionEncoderDecoderModel Nov 14, 2021

ydshieh marked this pull request as ready for review November 14, 2021 18:11

LysandreJik requested review from sgugger, NielsRogge, patrickvonplaten and Rocketknight1 November 15, 2021 14:17

sgugger approved these changes Nov 19, 2021

View reviewed changes

ydshieh mentioned this pull request Nov 23, 2021

[Generation] Make generate() method compatible with speech and vision inputs while keeping 100% backward compatibility. #14421

Closed

ydshieh force-pushed the tf_vision_encoder_decoder branch 4 times, most recently from c675811 to e235759 Compare December 7, 2021 17:14