Update Full Finetune for MM #1548

pbontrager · 2024-09-11T23:58:48Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Generalize full_finetune_single_device.py to be compatible with Flamingo. The changes in the recipe are small but required a number of bug fixes for multimodal datasets and fusion models. This PR is a grab all of bug fixes that were required to get the recipe working with Flamingo.

Test plan

Please make sure to do each of the following if applicable to your PR. (If you're not sure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.)

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Example of docstring:

torchtune/torchtune/modules/vision_transformer.py

Line 285 in 6a7951f

Examples:

Example in our docs: https://pytorch.org/torchtune/main/tutorials/qat_finetune.html#applying-qat-to-llama3-models

I did not change any public API;
I have added an example to docs or docstrings;

pytorch-bot · 2024-09-11T23:58:52Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1548

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit fe1a781 with merge base c5db813 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

codecov-commenter · 2024-09-12T20:59:37Z

Codecov Report

Attention: Patch coverage is 87.12871% with 13 lines in your changes missing coverage. Please review.

Project coverage is 71.25%. Comparing base (221031a) to head (360108c).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/full_finetune_single_device.py	0.00%	9 Missing ⚠️
torchtune/datasets/multimodal/_llava_instruct.py	75.00%	1 Missing ⚠️
torchtune/datasets/multimodal/_the_cauldron.py	90.00%	1 Missing ⚠️
torchtune/models/flamingo/_transform.py	0.00%	1 Missing ⚠️
torchtune/modules/model_fusion/_fusion.py	92.30%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1548      +/-   ##
==========================================
- Coverage   73.36%   71.25%   -2.11%     
==========================================
  Files         287      290       +3     
  Lines       14142    14233      +91     
==========================================
- Hits        10375    10142     -233     
- Misses       3767     4091     +324

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RdoubleA

I know we need to expose collate_fn for multimodal, but it feels odd that users need to remember to set a collater (or even know what that is) if they want to train multimodal. can we do an alternative approach, like a multimodal flag, or infer it from the dataset?

RdoubleA · 2024-09-18T17:34:22Z

recipes/full_finetune_single_device.py

@@ -240,10 +242,12 @@ def setup(self, cfg: DictConfig) -> None:

        # sampler and dataloader depend on the tokenizer and loss_fn and should be
        # setup after both of these are initialized
+        collate_name = cfg.get("collate_fn", "torchtune.data.padded_collate_sft")


not a consideration for this PR, but I really don't like the proliferation of all these config defaults. It makes the yaml config stray further from being the source of truth because there are hidden parameters. We should figure out some process of adding these to all our configs

If we are planning to make collate_fn configurable from this point on, it might be worth actually updating all our configs to show this. but also not sure if we anticipate this being configured often

This is a good point. I'd be interested in compiling a list of guildelines for recipes and deciding on what are standard is going forward.

RdoubleA · 2024-09-18T17:36:56Z

recipes/full_finetune_single_device.py

@@ -423,6 +427,7 @@ def _setup_data(
        cfg_dataset: DictConfig,
        shuffle: bool,
        batch_size: int,
+        collate_fn: str,


there's no documentation anywhere on how to specify the collate_fn. since it's not surfaced on our configs it's quite obscure how to customize this. I know we don't have docstrings for these setup methods but I would say collate_fn should have a quick explanation, especially it does some dotpath magic

RdoubleA · 2024-09-18T17:37:40Z

recipes/full_finetune_single_device.py

-            else partial(
-                padded_collate_packed,
-            ),
+            else padded_collate_packed,


should also explain somewhere that the collate_fn is ignored if packed=True

I actually don't like this if/else construct here. I think after this we need to do an update and standardization of our padding function offerings.

Yeah I agree, we have a bunch of collate functions now and they are all added in a very adhoc fashion. I think @RdoubleA's comments still hold though.. the change in how we use collate_fn can potentially cause problems for people. Also I hate to say it but all our usage of _get_component_from_path is starting to feel like we are just hacking in a registry.. (not saying we shouldn't do it here btw, I actually think it's the best approach rn)

RdoubleA · 2024-09-18T17:41:29Z

torchtune/datasets/multimodal/_the_cauldron.py

+        for i, message in enumerate(sample[self._column_map["texts"]]):
+            user_content = [{"type": "text", "content": message["user"]}]
+            if i == 0:
+                user_content = img_content + user_content
            messages.append(
                Message(
                    role="user",


not every message you loop through here would be a user message?

Every "message" here is a user/message pair

Just commenting here for convenience, but is the example given in the docstring actually correct? Seems to me like it isn't

RdoubleA · 2024-09-18T18:23:06Z

torchtune/modules/model_fusion/_fusion_utils.py

+
+    """
+    fusion_params = {}
+    for k, v in model.named_modules():


could you make the for loop variables more descriptive? it's hard to follow what's happening

This is a direct port of the get_peft_params function, so I'd prefer to keep both of those implementations in sync for now.

The proliferation of my laziness..

RdoubleA · 2024-09-18T18:23:48Z

torchtune/modules/model_fusion/_fusion_utils.py

+                    current_fusion_params.remove(n)
+            assert (
+                current_fusion_params == []
+            ), f"Fusion params {current_adapter_params} not converted"


maybe "Fusion params retrieved but not found in model's named parameters?"

RdoubleA · 2024-09-18T18:24:31Z

torchtune/utils/_device.py

+        elif isinstance(v, torch.Tensor):
+            batch[k] = v.to(device)
+        else:
+            raise AttributeError(


This should be a ValueError or TypeError

joecummings

1 to all Rafi's comments

joecummings · 2024-09-18T20:27:40Z

docs/source/api_ref_datasets.rst

@@ -34,8 +34,8 @@ Multimodal datasets
    :toctree: generated/
    :nosignatures:

-    llava_instruct_dataset
-    the_cauldron_dataset
+    multimodal.llava_instruct_dataset


joecummings · 2024-09-18T20:31:47Z

torchtune/modules/model_fusion/_fusion_utils.py

@@ -37,3 +37,33 @@ def fusion_params(self) -> List[str]:
        return [k for k, v in self.named_parameters()]

    module.fusion_params = functools.partial(fusion_params, module)
+
+
+def get_fusion_params(model: nn.Module) -> Dict[str, nn.Parameter]:


What's the purpose of this function? Why would you need it?

It allows you to control which parameters you want to freeze. It's similar to get_peft_params. For an example of this, look in the DeepFusionModel init for an example.

joecummings

How does this deal with the kv cache changes?

(it doesnt)

ebsmothers

No huge concerns from me, just a bunch of small comments

ebsmothers · 2024-09-19T14:29:50Z

torchtune/datasets/__init__.py

    "SFTDataset",
    "hh_rlhf_helpful_dataset",
-    "llava_instruct_dataset",
+    "multimodal",


Why not just expose the APIs directly in datasets/multimodal/init.py? Feels a bit weird to list a folder as a public API like this

I tried not including multimodal in this init and the import path stopped working

ebsmothers · 2024-09-19T14:32:56Z

torchtune/models/flamingo/_transform.py

@@ -100,6 +100,7 @@ def __init__(
        self.stop_tokens = self.tokenizer.stop_tokens
        self.max_seq_len = max_seq_len
        self.prompt_template = prompt_template
+        self.pad_id = self.tokenizer.pad_id


I know there's not a clear way around it, but I don't love that we have to do this. It'll be a gotcha for anyone trying to write their own transforms

This is a way around this, but it involves inheriting from the tokenizer

ebsmothers · 2024-09-19T14:34:03Z

torchtune/modules/model_fusion/_fusion.py

+        out = out.masked_scatter(mask, embeds)
+        out = out.masked_scatter(~mask, fusion_embeds)


Is this cause we need to track the grads?

It's because using the inplace masked_scatter was giving bad gradients

ebsmothers · 2024-09-19T14:37:49Z

torchtune/modules/model_fusion/_fusion.py

+        if encoder_trainable:
+            trainable_params |= {
+                f"encoder.{n}" for n, p in self.encoder.named_parameters()
+            }
+        if decoder_trainable:
+            trainable_params |= {
+                f"decoder.{n}" for n, p in self.decoder.named_parameters()
+            }
+        if fusion_trainable:
+            trainable_params |= set(get_fusion_params(self))
+        else:
+            trainable_params -= set(get_fusion_params(self))


This assumes a default state of (decoder_trainable, encoder_trainable, fusion_trainable) = (False, False, True), right? It's a bit unintuitive to me that we remove only the fusion params explicitly but not the other ones

It's because fusion params are not their own separate module but a part of the encoder and decoder. The encoder = pretrained_encoder + fusion_params and decoder = pretrained_decoder + fusion_params. So you add the encoder and/or decoder. Then you can either remove all fusion_params if fusion_trainable = False or you add them all, since if the encoder/decoder is missing you might be missing some.

ebsmothers · 2024-09-19T14:38:47Z

torchtune/modules/model_fusion/_fusion_utils.py

+
+    """
+    fusion_params = {}
+    for k, v in model.named_modules():


The proliferation of my laziness..

ebsmothers · 2024-09-19T14:40:41Z

torchtune/modules/transformer.py

@@ -261,22 +261,27 @@ def forward(

        # A mask of tokens (x) with no encoder_input
        skip_mask = self._skip_mask(encoder_mask)
+        if encoder_mask is not None:
+            # TODO: remove after PyTorch 2.5 is released
+            # This unmasks the skipped rows to avoid NaNs in SPDA Softmax backward


Suggested change

# This unmasks the skipped rows to avoid NaNs in SPDA Softmax backward

# This unmasks the skipped rows to avoid NaNs in SDPA Softmax backward

Also is this todo only related to encoder_mask? Or can we remove the usage of skip_mask altogether when 2.5 is released? I had assumed it was the latter

We still need skip_mask, we just don't need this extra step of updating the encoder_mask. The update would make skip_mask optional as you wouldn't have to worry about NaN's but you still get different behavior as you're not masking out the ffwd or output in attention.

ebsmothers · 2024-09-19T14:45:06Z

recipes/full_finetune_single_device.py

-            else partial(
-                padded_collate_packed,
-            ),
+            else padded_collate_packed,


Yeah I agree, we have a bunch of collate functions now and they are all added in a very adhoc fashion. I think @RdoubleA's comments still hold though.. the change in how we use collate_fn can potentially cause problems for people. Also I hate to say it but all our usage of _get_component_from_path is starting to feel like we are just hacking in a registry.. (not saying we shouldn't do it here btw, I actually think it's the best approach rn)

ebsmothers · 2024-09-19T14:46:53Z

recipes/full_finetune_single_device.py


-        logits = self._model(tokens, mask=mask, input_pos=input_pos)
+        logits = self._model(**batch)


Now that we're fully in dictionary land here, we should think about doing some key validation down the line (or at least giving very clear documentation on what the expected fields are)

We don't strictly need the dictionary unpacking here since we have a standard input for TransformerDecoder, but I think this is much cleaner and makes future changes easier.

ebsmothers · 2024-09-19T14:51:07Z

torchtune/datasets/multimodal/_the_cauldron.py

+        for i, message in enumerate(sample[self._column_map["texts"]]):
+            user_content = [{"type": "text", "content": message["user"]}]
+            if i == 0:
+                user_content = img_content + user_content
            messages.append(
                Message(
                    role="user",


Just commenting here for convenience, but is the example given in the docstring actually correct? Seems to me like it isn't

pbontrager added 5 commits September 10, 2024 12:14

Small fixes

06c3b62

initial recipe integration changes

0b4fcdf

removed packed from mm datasets

7b33a82

Merge branch 'main' into mm_recipe

06fd7f2

fixes after testing runs

864c04b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2024

pbontrager added 2 commits September 12, 2024 13:33

bug fixes and circular import updates

a6d2e43

fixed unit tests

360108c

pbontrager added 4 commits September 17, 2024 11:14

removed inplace masking

1b809c1

Merge branch 'main' into mm_recipe

9c91208

fix when encoder_mask is None

8058aea

fix ci bugs

c42edb8

RdoubleA reviewed Sep 18, 2024

View reviewed changes

joecummings reviewed Sep 18, 2024

View reviewed changes

pbontrager added 2 commits September 18, 2024 13:41

Update from comments

6a60cc0

update test to match last commit

b4fa908

joecummings reviewed Sep 18, 2024

View reviewed changes

ebsmothers reviewed Sep 19, 2024

View reviewed changes

RdoubleA approved these changes Sep 19, 2024

View reviewed changes

pbontrager added 2 commits September 19, 2024 10:17

Updated docstrings

5d11ac0

Merge branch 'main' into mm_recipe

fe1a781

pbontrager merged commit 63208c6 into pytorch:main Sep 19, 2024
17 checks passed

pbontrager deleted the mm_recipe branch September 19, 2024 18:06

pbontrager mentioned this pull request Sep 19, 2024

Update full finetune distributed for MM #1628

Merged

13 tasks

		out = out.masked_scatter(mask, embeds)
		out = out.masked_scatter(~mask, fusion_embeds)

	# This unmasks the skipped rows to avoid NaNs in SPDA Softmax backward
	# This unmasks the skipped rows to avoid NaNs in SDPA Softmax backward


		logits = self._model(tokens, mask=mask, input_pos=input_pos)
		logits = self._model(**batch)

Update Full Finetune for MM #1548

Update Full Finetune for MM #1548

Conversation

pbontrager commented Sep 11, 2024

Context

Test plan

UX

pytorch-bot bot commented Sep 11, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1548

✅ No Failures

codecov-commenter commented Sep 12, 2024

Codecov Report

RdoubleA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joecummings left a comment

Choose a reason for hiding this comment

ebsmothers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch-bot bot commented Sep 11, 2024 •

edited

Loading