Resizable image positional embeddings #1695

felipemello1 · 2024-09-26T20:00:50Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

When loading state dict for image models, the positional embeddings may have different shape. This PR allows the reshaping of the embeddings to match the target desired shape that the model was initialized with.

TLDR of the steps:

permute shapes of the input embedding, so that the shapes you are permuting are the last two in the tensor
pass to F.interpolate the shape of your instantiated embedding
now your input embedding have the same shape as the instantiated one

Unit tests and docstrings should help understand the numbers

Changelog

added resizing to match fairs implementation here:

Test plan

New unit tests
Checkpoint loads for instruct and pretraining
NOT ABLE TO ACTUALLY RUN AND SEE LOSS, GIVEN MY SDPA ISSUE

pytorch-bot · 2024-09-26T20:00:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1695

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 59f1996 with merge base 4e69db8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers · 2024-09-30T14:06:19Z

recipes/configs/llama3_2_vision/11B_full_single_device_pretrained.yaml

@@ -0,0 +1,104 @@
+# Config for single device full finetuning in full_finetune_single_device.py


What's the purpose of adding this? Just so that we support the non-instruct version of the model? I'm a bit confused cause I thought one big diff with instruct-tuned vs not is the extra trainable special tokens on the text size, which this PR doesn't address

this is just for testing. I will remove it before the PR is ready

regarding the special token, thats a question for @pbontrager . I am not sure.

ebsmothers · 2024-10-01T00:16:14Z

tests/torchtune/models/clip/test_pos_embedding_interpolation.py

+tile_pos_emb_test_cases = [
+    {
+        "tgt_num_tiles": 1,
+        # [max_num_tiles, max_num_tiles, -1, embed_dim] -> (2, 2, 2, 3)


Sorry I don't fully follow these comments in each test case

just trying to help the reader understand the dimensions provided. I can make it better. -1 is because the actual pos embedding has dim=1 there, but when i created the tests, i created with 2.

ebsmothers · 2024-10-01T00:16:47Z

torchtune/models/clip/_position_embeddings.py

@@ -100,23 +100,314 @@ def __init__(

        self.gate = nn.Parameter(torch.zeros(1))

+        self._register_load_state_dict_pre_hook(self._load_state_dict_hook)
+
+    def _load_state_dict_hook(


I don't see any test case for this, which arguably is the important part

it mostly calls the other functions. But it makes sense to create an unit test for it.

torchtune/models/clip/_position_embeddings.py

ebsmothers · 2024-10-01T00:33:47Z

torchtune/models/clip/_position_embeddings.py

+        """
+        # inverse n_tokens_per_tile = patch_grid_size**2 + 1, where +1 is the cls token
+        inpt_n_tokens_per_tile, inpt_embed_dim = inpt_pos_embed.shape
+        inpt_patch_grid_size = int(math.sqrt(inpt_n_tokens_per_tile - 1))


Kind of a nit, but why do you compute it in the method for local pos embeddings and pass it to the method for global pos embeddings?

will make it consistent. I dont think that there is a reason

took a closer look, in both methods i am passing "tgt_patch_grid_size"

inpt_local_pos_embed = self._resize_local_position_embedding( local_pos_embed=inpt_local_pos_embed, tgt_patch_grid_size=int(math.sqrt(tgt_n_tokens_per_tile - 1)), ) inpt_global_pos_embed = self._resize_global_position_embedding( global_pos_embed=inpt_global_pos_embed, tgt_max_num_tiles=tgt_max_num_tiles_x, tgt_patch_grid_size=int(math.sqrt(tgt_n_tokens_per_tile - 1)), )

torchtune/models/clip/_position_embeddings.py

ebsmothers · 2024-10-01T00:40:53Z

torchtune/models/clip/_position_embeddings.py

+        pos_embed = pos_embed.reshape(
+            max_num_tiles_x,
+            max_num_tiles_y,
+            inpt_patch_grid_size,
+            inpt_patch_grid_size,
+            embed_dim,
+        )
+
+        # combine max_num_tiles and patch_grid_size into one dimension
+        pos_embed = pos_embed.permute(0, 2, 1, 3, 4).contiguous()
+        pos_embed = pos_embed.reshape(
+            max_num_tiles_x * inpt_patch_grid_size,
+            max_num_tiles_y * inpt_patch_grid_size,
+            embed_dim,
+        )


Do we really need all 3 of these?

i think so. I dont see other way to get the same shape/order. I will ask metamate :P

pbontrager

Looks good! Please add the TODOs from the comments and an additional one to support resizing for the none-tiled positional embedding.

pbontrager · 2024-10-01T17:48:27Z

torchtune/models/clip/_position_embeddings.py

@@ -100,23 +100,322 @@ def __init__(

        self.gate = nn.Parameter(torch.zeros(1))

+        self._register_load_state_dict_pre_hook(self._load_state_dict_hook)


Add a TODO here to switch to the public method after 2.5 is stable

pbontrager · 2024-10-01T17:49:16Z

torchtune/models/clip/_position_embeddings.py

@@ -166,16 +464,127 @@ def __init__(
        )
        self.gate = nn.Parameter(torch.zeros(1))

+        # Register load hook to interpolate positional embeddings
+        self._register_load_state_dict_pre_hook(self._load_state_dict_hook)


Same comment for TODO

Co-authored-by: Felipe Mello <felipemello@fb.com>

Felipe Mello added 2 commits September 26, 2024 12:46

pretrained config

a11433e

first commit

4f47b1e

felipemello1 requested a review from pbontrager September 26, 2024 20:00

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 26, 2024

ebsmothers reviewed Sep 30, 2024

View reviewed changes

Felipe Mello added 4 commits September 30, 2024 08:38

fix shape bugs

7e4010c

add docstrings

6135b42

unit tests

ba1578a

docstrings

967d86f

felipemello1 marked this pull request as ready for review September 30, 2024 18:34

felipemello1 changed the title ~~[WIP] Resizable image positional embeddings~~ Resizable image positional embeddings Sep 30, 2024

Felipe Mello added 4 commits September 30, 2024 12:14

small docstring changes

54d71b5

clip docs

085f981

better comments and assertion

2c90b64

delete config

d2452ad

ebsmothers reviewed Oct 1, 2024

View reviewed changes

ebsmothers approved these changes Oct 1, 2024

View reviewed changes

comments + end2end tests

e69fd47

pbontrager approved these changes Oct 1, 2024

View reviewed changes

Felipe Mello added 2 commits October 1, 2024 10:58

add todo

e4cdd82

comment fix

59f1996

felipemello1 merged commit 55b4814 into pytorch:main Oct 1, 2024
14 checks passed

felipemello1 deleted the reshape_pos_emb branch October 1, 2024 18:01

RdoubleA pushed a commit that referenced this pull request Oct 2, 2024

Resizable image positional embeddings (#1695)

df4f01a

Co-authored-by: Felipe Mello <felipemello@fb.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resizable image positional embeddings #1695

Resizable image positional embeddings #1695

felipemello1 commented Sep 26, 2024 •

edited

Loading

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

ebsmothers Sep 30, 2024

felipemello1 Sep 30, 2024 •

edited

Loading

felipemello1 Sep 30, 2024

ebsmothers Oct 1, 2024

felipemello1 Oct 1, 2024

ebsmothers Oct 1, 2024

felipemello1 Oct 1, 2024 •

edited

Loading

ebsmothers Oct 1, 2024

felipemello1 Oct 1, 2024

felipemello1 Oct 1, 2024 •

edited

Loading

ebsmothers Oct 1, 2024

felipemello1 Oct 1, 2024

pbontrager left a comment

pbontrager Oct 1, 2024

pbontrager Oct 1, 2024

		@@ -0,0 +1,104 @@
		# Config for single device full finetuning in full_finetune_single_device.py

		@@ -100,23 +100,322 @@ def __init__(

		self.gate = nn.Parameter(torch.zeros(1))

		self._register_load_state_dict_pre_hook(self._load_state_dict_hook)

Resizable image positional embeddings #1695

Resizable image positional embeddings #1695

Conversation

felipemello1 commented Sep 26, 2024 • edited Loading

Context

Changelog

Test plan

pytorch-bot bot commented Sep 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1695

✅ No Failures

Choose a reason for hiding this comment

felipemello1 Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Oct 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pbontrager left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 commented Sep 26, 2024 •

edited

Loading

pytorch-bot bot commented Sep 26, 2024 •

edited

Loading

felipemello1 Sep 30, 2024 •

edited

Loading

felipemello1 Oct 1, 2024 •

edited

Loading

felipemello1 Oct 1, 2024 •

edited

Loading