Vision cross attention mask transform #1141

RdoubleA · 2024-07-03T16:23:38Z

Context

Multimodal models that use cross-attention layers in the text LLM backbone to attend to outputs of a vision encoder on the images require a cross-attention mask to ensure the correct text tokens attend to the right images. For multiple, interleaved images in a single sample, we follow the approach in Flamingo (Fig. 7 of https://arxiv.org/pdf/2204.14198) where text tokens after an image attend to that particular image (more details in docstrings). To create this mask, this PR adds a transform class that takes a sample dictionary (i.e., during data preprocessing in one of torchtune's dataset classes), computes this cross-attention mask and adds it to the dictionary.

Handling variable number of tiles and number of images across samples will be done in the batch collator in a follow-up PR. For now, we return the masks as a list of tensors, one mask per image, which can have varied n_tiles.

Changelog

Add base Transform interface
Add VisionCrossAttentionMask
Add unit tests for VisionCrossAttentionMask

Test plan

pytest tests/torchtune/modules/transforms/test_transforms.py

Docs

pytorch-bot · 2024-07-03T16:23:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1141

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bead59a with merge base 37636a8 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1

it looks great! I left a few comments more around naming, since tiles/patches/image is so confusing in general. I think that this class would benefit from a visual small example too.

regarding the interval parts, IMO we should try to make the operations as explicit as possible. Instead of tok1, tok2; i; vision_mask[0], we should try to give them names, like idx_start, idx_end, etc.

felipemello1 · 2024-07-03T16:32:29Z

torchtune/modules/transforms/_transforms.py

+    participate in cross-attention with an image token will show True in the mask
+    and follow these rules:
+    1) Text tokens immediately following the image token up until the next image token
+    2) Consecutive image tokens attend to all subsequent text tokens


can we make it more visual? Something like this: https://github.com/huggingface/transformers/blob/60bb571e993b7d73257fb64044726b569fef9403/src/transformers/models/llava_next/modeling_llava_next.py#L446

Or a link to the paper + page where they have an image for it

felipemello1 · 2024-07-03T16:33:06Z

torchtune/modules/transforms/_transforms.py

+
+class CrossAttentionMask(Transform):
+    """
+    Computes the cross-attention mask for text + image inputs. Text tokens that


if this CrossAttention is specific for text + images, should we indicate it in the name? Something like MultimodalCrossAttentionMask or VisionTextCrossAttentionMask?

Yeah that's a good point. Will rename

felipemello1 · 2024-07-03T17:08:49Z

torchtune/modules/transforms/_transforms.py

+    text sequence.
+
+    Args:
+        num_patches (int): Number of patches per image, excluding class token.


I wonder if we should add a link to modules/VisionTransformer for a in-depth explanation of what num_patches mean. For better clarity, would it make sense to rename it num_patches_per_tile, since later we multiply it by n_tiles?

If we say "number of patches per image", it may be confusing, because an image can have a variable number of patches.

later on you say:

single_image_seq_len = n_tiles * self.num_patches + 1 image_seq_len = single_image_seq_len * n_img

So tile != image. Image is a set of tiles.

ah sorry, probably my shallow understanding of patches/images/tiles. What I intended was num_patches per tile. If it makes sense I'd like to keep the name consistent with your vision transformer (either patch_grid_size or patch_size maybe?), whichever parameter you use to compute num patches.

I also assumed at this point tiles is padded to the max in all the images. Is this incorrect? Where does the padding happen?

patches_per_tile is a fixed size, and its calculated as (tile_size // patch_size)**2.

What I did in VisionTransformer was to ask the user to pass tile_size and patch_size, and I calculated it for them. The VisionTransformer has a helper function that saves this value: https://github.com/felipemello1/torchtune/blob/f683812626ad4559464840112ddce516487bea5c/torchtune/modules/vision_transformer.py#L249

Maybe get it from the model, or ask for tile_size and patch_size, to avoid user confusion?

felipemello1 · 2024-07-03T17:13:00Z

torchtune/modules/transforms/_transforms.py

+        pass
+
+
+class Compose(Transform):


Since the torchvision compose has a different behavior, I wonder if it makes sense to change Compose to something else, so users dont get confused with tv.Compose. Maybe "ComposeTransforms"?

how about Pipeline?

Just for my own understanding, the main difference with torchvision compose is that we support multiple inputs and multiple outputs here? Can we not just use torchvision compose with a single dict?

I tried naming something as Pipeline, as Kartikay said it would confuse people, because it is also used by other libraries :P. I guess sklearn?

@ebsmothers our Compose needs to have a slightly different forward signature to unfold dictionary inputs. From torchvision:

def __call__(self, img): for t in self.transforms: img = t(img) return img

but to avoid confusion, I agree should name it something else. Just haven't figured out what yet

felipemello1 · 2024-07-03T17:14:02Z

torchtune/modules/transforms/_transforms.py

+        """
+        Returns a list of tuples of the form (start, end) where start is the index
+        of the current image token and end is the index of the next image token, exclusive.
+        If the image token attends until the end of the sequence, end will be -1.


nit: should we add Args:,Returns, Examples?

torchtune/modules/transforms/_transforms.py

felipemello1 · 2024-07-03T17:30:05Z

torchtune/modules/transforms/_transforms.py

+
+    def __call__(self, *, tokens, images, **kwargs):
+        # We are still at sample level pre-collating
+        n_img, n_tiles, _, _, _ = images.shape


nit: Maybe add a comment with the other dimensions, so we know what they are, but keep the "_", so we know they are not used?

You said "# We are still at sample level pre-collating"
So is n_img == bsz? If so, for consistency with VisionTransformer, should we rename it?

yeah please add type and shape info for arguments to __call__

felipemello1 · 2024-07-03T17:32:21Z

torchtune/modules/transforms/_transforms.py

+        # We are still at sample level pre-collating
+        n_img, n_tiles, _, _, _ = images.shape
+        text_seq_len = len(tokens)
+        single_image_seq_len = n_tiles * self.num_patches + 1


nit: maybe add comment explaining that +1 is for CLS, if thats the case

ebsmothers · 2024-07-03T20:45:35Z

torchtune/modules/transforms/_transforms.py

+                image_num
+                * single_image_seq_len : (image_num + 1)
+                * single_image_seq_len,


nit: split line differently if linter allows it, as written this is confusing

ebsmothers · 2024-07-03T21:08:25Z

torchtune/modules/transforms/_transforms.py

+                * single_image_seq_len,
+            ] = True
+
+        kwargs.update({"encoder_mask": mask, "tokens": tokens, "images": images})


Why do we need to also update with tokens and images? Isn't this a no-op for those args?

Since they are explicit keyword args, they get unfolded from kwargs and you have to add them back in

torchtune/modules/transforms/_transforms.py

ebsmothers · 2024-07-09T04:57:53Z

torchtune/modules/transforms/_transforms.py

+            mask[start:end, :] = True
+            masks.append(mask)
+
+        kwargs.update({"encoder_mask": masks, "tokens": tokens, "images": images})


Maybe it's just me but I find this whole kwargs pattern kinda unintuitive. Like really we are covertly using this transform to also add its inputs into the dictionary (does that mean that it doesn't already contain them?). Maybe I need to see the callsite, but I don't understand why we can't just have e.g.

sample: Dict[str, Any] mask_transform = VisionCrossAttentionMask(...) sample["encoder_mask"] = mask_transform(sample["tokens"], sample["images"])

Alternatively if you want to go more of the compose route:

class VisionCrossAttentionMask: def __call__(sample: Dict[str, Any]) -> Dict[str, Any]: # just key into sample directly in the transform sample["encoder_mask"] = construct_mask(sample["tokens"], sample["images"]) return sample mask_transform = VisionCrossAttentionMask(...) sample = mask_transform(sample)

But rn we are kinda halfway in between the two and it feels clunky to me.

cc @pbontrager who'sthought about this a lot more

We were between two approaches, the kwargs approach here and the sample approach you described where you key into the fields you need for the transform. One issue with that was this implicit assumption that token ids are always under "tokens", image tensors are always under "images", etc., so we thought we may have to provide these keys as attributes to the transform class. Then the key management gets a bit messy to handle and error prone.

Although, thinking about this more, the collator will also implicitly assume the keys "tokens", "labels", "images" etc will be present. If we're consistent with these across the library, and the fact that the number of these keys in the sample dict will remain quite small ~5-10 max, it might be fine to keep these assumptions, or do something similar to checkpointing with model key, optim key, etc as constants.

I don't have a strong opinion either way, although it would be nice to keep a forward signature of just sample.

Yeah imo you're gonna need the keys somewhere anyways, better to just be explicit about it than do this weird back and forth between dict values and standalone arguments.

felipemello1

mostly nits about naming/docs. Feel free to ignore.

tests/torchtune/modules/transforms/test_transforms.py

torchtune/modules/transforms/_transforms.py

felipemello1

.

RdoubleA added 2 commits July 2, 2024 12:19

add basic transforms

d90e5a9

add xattn mask transform

5f7e8aa

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 3, 2024

felipemello1 reviewed Jul 3, 2024

View reviewed changes

ebsmothers reviewed Jul 3, 2024

View reviewed changes

add transforms test

9a280f2

ebsmothers reviewed Jul 3, 2024

View reviewed changes

RdoubleA added 2 commits July 8, 2024 11:02

Merge branch 'main' into mm_transforms

c341f6b

only do cross attention mask

7f22d58

RdoubleA changed the title ~~[WIP] Add basic transforms for multimodal~~ Vision cross attention mask transform Jul 9, 2024

RdoubleA marked this pull request as ready for review July 9, 2024 00:15

RdoubleA requested review from felipemello1 and ebsmothers July 9, 2024 00:15

kartikayk reviewed Jul 9, 2024

View reviewed changes

ebsmothers reviewed Jul 9, 2024

View reviewed changes

felipemello1 reviewed Jul 9, 2024

View reviewed changes

RdoubleA added 6 commits July 9, 2024 13:07

Merge branch 'main' into mm_transforms

92faf71

address comments

edfec5a

fix typing

f470837

fix docs

04d3f98

fix lint

f3ca1c6

update typing

bead59a

ebsmothers approved these changes Jul 10, 2024

View reviewed changes

RdoubleA merged commit bbc48e0 into pytorch:main Jul 10, 2024
29 checks passed

RdoubleA deleted the mm_transforms branch July 10, 2024 00:17

maximegmd pushed a commit to maximegmd/torchtune that referenced this pull request Jul 13, 2024

Vision cross attention mask transform (pytorch#1141)

45dfe20

RdoubleA mentioned this pull request Jul 16, 2024

[RFC] Unified dataset with data and model transforms #1186

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vision cross attention mask transform #1141

Vision cross attention mask transform #1141

RdoubleA commented Jul 3, 2024 •

edited

Loading

pytorch-bot bot commented Jul 3, 2024 •

edited

Loading

felipemello1 left a comment •

edited

Loading

felipemello1 Jul 3, 2024

felipemello1 Jul 3, 2024 •

edited

Loading

RdoubleA Jul 3, 2024

felipemello1 Jul 3, 2024

RdoubleA Jul 3, 2024

felipemello1 Jul 3, 2024 •

edited

Loading

felipemello1 Jul 3, 2024

RdoubleA Jul 3, 2024

ebsmothers Jul 3, 2024

felipemello1 Jul 3, 2024

RdoubleA Jul 8, 2024

felipemello1 Jul 3, 2024

felipemello1 Jul 3, 2024

ebsmothers Jul 3, 2024

felipemello1 Jul 3, 2024

ebsmothers Jul 3, 2024

ebsmothers Jul 3, 2024

RdoubleA Jul 5, 2024

ebsmothers Jul 9, 2024

RdoubleA Jul 9, 2024

ebsmothers Jul 9, 2024

felipemello1 left a comment

felipemello1 left a comment •

edited

Loading

Vision cross attention mask transform #1141

Vision cross attention mask transform #1141

Conversation

RdoubleA commented Jul 3, 2024 • edited Loading

Context

Changelog

Test plan

Docs

pytorch-bot bot commented Jul 3, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1141

✅ No Failures

felipemello1 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felipemello1 left a comment

Choose a reason for hiding this comment

felipemello1 left a comment • edited Loading

Choose a reason for hiding this comment

RdoubleA commented Jul 3, 2024 •

edited

Loading

pytorch-bot bot commented Jul 3, 2024 •

edited

Loading

felipemello1 left a comment •

edited

Loading

felipemello1 Jul 3, 2024 •

edited

Loading

felipemello1 Jul 3, 2024 •

edited

Loading

felipemello1 left a comment •

edited

Loading