Move TF building to an actual build() method #23760

Rocketknight1 · 2023-05-25T15:07:25Z

This has been a longstanding dream of mine: To move all TF model building into a proper build() method, using symbolic tensors instead of actual dummies. This would allow us to, among other things, stop our very hacky overriding of save_spec, as well as allowing us to build our TF models with zero device flops (although the speedup may be system-dependent, as we do have some compile time with this approach). It would make our models much closer to the Keras standard, which would stop Chollet casting curses upon me from afar.

In the past, we've run into serious problems with tensor names moving around when we tried this - I think I've figured out why, though, and I have a couple of ideas to resolve that without lots of hacky edge-case code.

This is an extremely draft PR that will break everything until I finish testing it properly!

Update: Using symbolic tensors is much slower - it works in most cases, but increases the time it takes for our tests to run by a factor of ~4, which is probably not acceptable. Instead, I'm going to rework this PR to move to a standard build() method using actual dummies. With some optimizations, I believe we can make this work, while still preserving most of the benefits of this PR, including not repeating the build unnecessarily and adding the ability to override build() to speed up our slowest models

HuggingFaceDocBuilderDev · 2023-05-25T15:57:22Z

The documentation is not available anymore as the PR was closed or merged.

Rocketknight1 · 2023-05-30T13:29:13Z

This should be ready to review now! Some tests failing, but that looks like Hub connection issues

sgugger

Thanks for your PR. I think it needs more TF-expert eyes so tagging @amyeroberts

The changes in Longformer and LED are very big so should go in their own PR to make it easier for future blame.

sgugger · 2023-05-30T13:43:05Z

src/transformers/models/deberta/modeling_tf_deberta.py

-                )
+                out = tf.matmul(x, w, transpose_b=True)
+                if b is not None:
+                    out += b


Is the transpose of b unnecessary here?

sgugger · 2023-05-30T13:44:10Z

src/transformers/tf_utils.py

@@ -22,6 +23,8 @@

 logger = logging.get_logger(__name__)

+build_context = threading.local()


This is not used anywhere here?

Ah, you're right, sorry! This is leftover from an earlier approach I was trying.

amyeroberts

Nice :) Generally looks like a good clean-up, but I'm not a fan of the current API - it overrides default keras API behaviour in a non-obvious way. I left a more detailed comment in modeling_tf_utils.py.

One question I have is about the change in conditional logic checks in TF modeling code i.e. removing tf.cond(...) - is this necessary with this new build logic or just an update based on the new logic?

amyeroberts · 2023-05-30T16:22:01Z

src/transformers/models/vit_mae/modeling_tf_vit_mae.py

-            lambda: tf.transpose(pixel_values, perm=(0, 2, 3, 1)),
-            lambda: pixel_values,
-        )
+        if shape_list(pixel_values)[1] == num_channels:


I don't think the updated check is compatible with graph mode / compilation of models for XLA

shape_list is actually quite smart (because I wasn't the one who wrote it) - it returns a list of the static shape for each dimension when this is known at compile time, and the dynamic shape when it isn't. In an XLA compilation pass shapes are fully static, and so the static shape will always be fully known. As a result, all shape_list calls will just return the static shape in an XLA context and not introduce data-dependent conditionals.

Nice! But also don't put yourself down - I thought you'd written it :) I fear we shall never rid ourselves of shape_list

amyeroberts · 2023-05-30T16:32:20Z

src/transformers/modeling_tf_utils.py

+    def build_with_dummies(self, dummies=None):
+        if self.built_with_dummies and dummies is None:
+            return
+        if dummies is None:
+            dummies = self.dummy_inputs
+        self(dummies, training=False)
+        self.built_with_dummies = True


I think the API here is a bit confusing:

Allowing dummies to essentially be any input

Rebuilding if dummies is None.

Having input_shape as an argument for build and then not using it

I would rework to have build match the logic of the parent class, and build_with_dummies just use the model's dummy values. This way build remains generic and build_with_x specifically builds with x e.g. something like:

def build(self, input_shape): if self.built or call_context().in_call: self.built = True return self(input_shape, training=False) self.built = True self.built_with_dummies = True def build_with_dummies(self): if self.built_with_dummies: return self(self.dummy_inputs, training=False) self.built_with_dummies = True self.built = True

This would:

Change all the calls from model.build() to model.build_with_dummies(). We can then remove all the comments next to the build calls explaining we're using dummy inputs.

Remove the need to call super().build(input_shape) when we want the old build logic.

Removes the need to set the input_shape to None in all the current build methods

also - why have the distinction between built and built_with_dummies?

Rocketknight1 · 2023-05-30T17:07:54Z

Actually, I should explain my reasoning for some of the changes here - you're probably right that I can improve the API, though!

Firstly, the removal of tf.cond is actually not a necessary part of this PR anymore, but it is good practice (Longformer and LED are the only two models in all of Transformers that use it in their modelling code). The reason is because of the Keras call stack. In the __call__ method for any TF module, Keras appends that layer to the call stack, and enters that layer's namespace. This means that if you have self.bert and that calls self.encoder and that calls self.attn, Keras will be in the bert/encoder/attn namespace.

Incredibly, though, tf.cond counts as a layer with its own namespace, but only when the tf.cond is not being eagerly evaluated. In my initial PR, I was trying to replace our dummies with symbolic TF tensors, which meant the tf.cond was not evaluated at compile time, but instead had to be compiled as a conditional in the model graph. The result is that all layer weights inside the conditional got encapsulated in a /cond.1/ namespace. This broke compatibility with existing checkpoints.

Removing tf.cond helped, but to be safe I added a manual build to those layers to directly control the weight naming regardless of what the call stack thought it should be. As a result, I could probably revert the tf.cond calls, but I think it's preferable if we don't, and just try to keep it out of modelling code and just use if statements instead (which TF can compile into graph conditionals if it can't resolve the branch to be chosen at compile time). tf.cond is fine in generation code where no weight names are created.

Secondly, the distinction between build() and build_with_dummies() is a bit of an ugly hack - I think I could probably remove build_with_dummies() entirely, but there was a piece of the TF-PT crossloading code that only worked if it could build the model with specific inputs of its choice. I added build_with_dummies() to support that, with a separate built_with_dummies flag to make sure that any repeated calls wouldn't waste more time. However, it would probably make more sense to just manually pass the inputs through the model in those particular crossloading functions and delete the method and the flag. WDYT?

amyeroberts · 2023-06-01T18:21:51Z

tf.cond counts as a layer with its own namespace, but only when the tf.cond is not being eagerly evaluated.

😑

In this case, let's rid ourselves of this pseudolayer! I'm pro the if/else changes :)

it would probably make more sense to just manually pass the inputs through the model in those particular crossloading functions and delete the method and the flag. WDYT?

Yep, that's what I would go for. Would it be possible to still have some of the logic to exit early if already built? Or would this be to tricky to handle to be worth it?

Rocketknight1 · 2023-06-02T14:34:47Z

I think we could, but it's probably not necessary - the only cases where we build the model with specific inputs are in weird PT-TF crossloading functions, which should always be called during or near model init anyway, so I think it's fine if there's a risk of a little bit of duplicated work there to save on overall code complexity.

…), which also sets it

Rocketknight1 · 2023-06-02T15:16:32Z

@amyeroberts Done! build_with_dummies is no more

amyeroberts

Thanks for iterating - nice cleanup!

Just a small Q about build logic for LED / Longformer

amyeroberts · 2023-06-05T09:08:41Z

src/transformers/models/longformer/modeling_tf_longformer.py

+            with tf.name_scope("query_global"):
+                self.query_global.build((self.config.hidden_size,))
+            with tf.name_scope("key_global"):
+                self.key_global.build((self.config.hidden_size,))
+            with tf.name_scope("value_global"):
+                self.value_global.build((self.config.hidden_size,))


Two silly questions

why do these layers need to be built separately here?

why no super().build(input_shape) call in the method?

Good question! The answer is that it's hard to get dummy inputs that correctly touch all of those layers, so they tend to be left un-built unless we explicitly build them.

As for the super().build(), I just forgot! The base build() method doesn't really do anything, but you're right that I should probably still call it just in case.

Rocketknight1 · 2023-06-06T16:35:05Z

Also, this PR looks ready but I'm going to let it sit for a couple of days to make sure the CI is working again after my last library-breaking PR, then merge it.

Rocketknight1 · 2023-06-06T17:30:49Z

Change of plans: The CI is working except for OOM errors during building for some of the pipelines, and since this cleans up building a bit we're going to merge this one too and see if it helps. If it doesn't, I'll open a new PR to see if I can lower the memory usage in the affected models.

frostming · 2023-06-07T10:45:27Z

src/transformers/modeling_tf_utils.py

@@ -69,11 +74,14 @@
 if parse(tf.__version__).minor >= 13:
    from keras import backend as K
    from keras.__internal__ import KerasTensor
+    from keras.engine.base_layer_utils import call_context


This would break since keras 2.13 has moved the import to keras.src.engine

See #23663

Thanks for the catch, I'll make the fix ASAP!

perretv · 2023-06-15T08:16:26Z

src/transformers/models/whisper/modeling_tf_whisper.py

-            lambda: _make_causal_mask(input_shape, past_key_values_length=past_key_values_length),
-            lambda: _expand_mask(tf.ones((batch_size, seq_len + past_key_values_length)), tgt_len=seq_len),
-        )
+        if seq_len > 1:


Hi @Rocketknight1, I am unfortunately having an issue with this change.
When I build a functional keras model using the Whisper encoder & decoder layers, I cannot serialize the model because of this change as it raises the error:

Using a symbolic `tf.Tensor` as a Python `bool` is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.

Here is a minimal reproducible example to raise the error:

from transformers import TFWhisperModel import tensorflow as tf whisper = TFWhisperModel.from_pretrained("openai/whisper-tiny") inp = tf.keras.Input((80, 3000)) stack = whisper.get_encoder()(inp) decoder_input_ids = tf.ones((tf.shape(inp)[0], 1), dtype=tf.int32)* whisper.config.decoder_start_token_id stack = whisper.get_decoder()(input_ids=decoder_input_ids, encoder_hidden_states=stack.last_hidden_state) model = tf.keras.Model(inp, stack) model.summary() model.save("whisper-tiny-custom")

What do you think?
I will open an issue for this to be referenced!

I opened a corresponding issue

* A fun new PR where I break the entire codebase again * A fun new PR where I break the entire codebase again * Handle cross-attention * Move calls to model(model.dummy_inputs) to the new build() method * Seeing what fails with the build context thing * make fix-copies * Let's see what fails with new build methods * Fix the pytorch crossload build calls * Fix the overridden build methods in vision_text_dual_encoder * Make sure all our build methods set self.built or call super().build(), which also sets it * make fix-copies * Remove finished TODO * Tentatively remove unneeded (?) line * Transpose b in deberta correctly and remove unused threading local * Get rid of build_with_dummies and all it stands for * Rollback some changes to TF-PT crossloading * Correctly call super().build()

Rocketknight1 changed the title ~~Try moving all TF building to symbolic tensors~~ Move TF building to an actual build() method May 26, 2023

Rocketknight1 force-pushed the tf_functional_builds branch from e3068b1 to aa599e4 Compare May 30, 2023 13:19

Rocketknight1 requested review from gante and sgugger May 30, 2023 13:28

sgugger reviewed May 30, 2023

View reviewed changes

amyeroberts reviewed May 30, 2023

View reviewed changes

Rocketknight1 added 15 commits June 2, 2023 15:37

A fun new PR where I break the entire codebase again

bc5aa55

A fun new PR where I break the entire codebase again

8a5cadc

Handle cross-attention

4032229

Move calls to model(model.dummy_inputs) to the new build() method

3e408e4

Seeing what fails with the build context thing

2348f2b

make fix-copies

ed164c3

Let's see what fails with new build methods

6220066

Fix the pytorch crossload build calls

ba9d278

Fix the overridden build methods in vision_text_dual_encoder

5efb662

Make sure all our build methods set self.built or call super().build(…

e4227eb

…), which also sets it

make fix-copies

634bb2b

Remove finished TODO

3f41fca

Tentatively remove unneeded (?) line

be4811a

Transpose b in deberta correctly and remove unused threading local

48e2902

Get rid of build_with_dummies and all it stands for

c76c308

Rocketknight1 force-pushed the tf_functional_builds branch from 4386387 to c76c308 Compare June 2, 2023 14:37

Rollback some changes to TF-PT crossloading

600e1ff

amyeroberts approved these changes Jun 5, 2023

View reviewed changes

Correctly call super().build()

3945ffc

Rocketknight1 merged commit 4a55e47 into main Jun 6, 2023

Rocketknight1 deleted the tf_functional_builds branch June 6, 2023 17:30

frostming reviewed Jun 7, 2023

View reviewed changes

MaximeChurin mentioned this pull request Jun 9, 2023

Transformers can not load dependency of tensorflow - cannot import name 'call_context' from 'tensorflow.***.keras.engine' #24133

Closed

4 tasks

perretv reviewed Jun 15, 2023

View reviewed changes

perretv mentioned this pull request Jun 15, 2023

Cannot serialize Whisper decoder layer in a keras model #24291

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move TF building to an actual build() method #23760

Move TF building to an actual build() method #23760

Rocketknight1 commented May 25, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 25, 2023 •

edited

Loading

Rocketknight1 commented May 30, 2023

sgugger left a comment

sgugger May 30, 2023

Rocketknight1 May 30, 2023

sgugger May 30, 2023

Rocketknight1 May 30, 2023

Rocketknight1 May 30, 2023

amyeroberts left a comment

amyeroberts May 30, 2023

Rocketknight1 May 30, 2023

amyeroberts May 30, 2023

amyeroberts May 30, 2023

Rocketknight1 commented May 30, 2023 •

edited

Loading

amyeroberts commented Jun 1, 2023

Rocketknight1 commented Jun 2, 2023

Rocketknight1 commented Jun 2, 2023

amyeroberts left a comment

amyeroberts Jun 5, 2023

Rocketknight1 Jun 6, 2023

Rocketknight1 commented Jun 6, 2023

Rocketknight1 commented Jun 6, 2023

frostming Jun 7, 2023 •

edited

Loading

Rocketknight1 Jun 7, 2023

perretv Jun 15, 2023

perretv Jun 15, 2023

		@@ -22,6 +23,8 @@

		logger = logging.get_logger(__name__)

		build_context = threading.local()

Move TF building to an actual build() method #23760

Move TF building to an actual build() method #23760

Conversation

Rocketknight1 commented May 25, 2023 • edited Loading

HuggingFaceDocBuilderDev commented May 25, 2023 • edited Loading

Rocketknight1 commented May 30, 2023

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rocketknight1 commented May 30, 2023 • edited Loading

amyeroberts commented Jun 1, 2023

Rocketknight1 commented Jun 2, 2023

Rocketknight1 commented Jun 2, 2023

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rocketknight1 commented Jun 6, 2023

Rocketknight1 commented Jun 6, 2023

frostming Jun 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Rocketknight1 commented May 25, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented May 25, 2023 •

edited

Loading

Rocketknight1 commented May 30, 2023 •

edited

Loading

frostming Jun 7, 2023 •

edited

Loading