Multimodal dataset builder + docs #1667

RdoubleA · 2024-09-24T20:49:48Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

There is currently no way to load a custom multimodal dataset. The current workflow is users will write their own transform and builder - which is already a big lift - but on top of that our custom component imports don't even work (#1540). So basically, it's impossible to use anything other than llava instruct and cauldron. This sucks.

I added a quick multimodal_chat_dataset builder. It assumes conversational data with images, much like llava instruct and ShareGPT4V. To keep changes minimal, I only support ShareGPT format. We can expand to other chat formats or to instruct data later.

This unfortunately required me to generalize the LlavaInstructToMessages (cc @joecummings, I agree we shouldn't generalize too early but in this case I think it's warranted). Again, to keep changes minimal, I simply merged LlavaInstructToMessages with ShareGPTToMessages. There were identical in handling text, the only difference was the image loading logic.

With this builder, now I finally have something to point to and discuss in the multimodal dataset docs, and users have some avenue to finetune multimodal on their own data. A bit of extra work, but worthwhile imo.

Changelog

What are the changes made in this PR?

Merge ShareGPTToMessages and LlavaInstructToMessages
Merge any relevant tests
Add multimodal_chat_dataset builder
Add doc page on multimodal datasets

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-09-24T20:49:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1667

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dc7204a with merge base c3ff864 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

docs/source/basics/multimodal_datasets.rst

felipemello1 · 2024-09-24T22:09:21Z

docs/source/basics/multimodal_datasets.rst

+    dataset:
+      _component_: torchtune.datasets.multimodal.multimodal_chat_dataset
+      source: json
+      data_files: data/my_data.json


can we add a dummy 2-example-file of what my_data.json should look like

that's shown above?

felipemello1 · 2024-09-24T22:10:34Z

docs/source/basics/multimodal_datasets.rst

+        dialogue: conversations
+        image_path: image
+      image_dir: /home/user/dataset/
+      image_tag: "<image>"


noob question: will this ever break if the user adds to their prompt, or is a special token and the tokenizer understands if it comes from the user or the ChatML?

yes it will leade to unexpected behaviors, the burden is on the user for setting this correctly.. I don't know if there's an easy way to check for those edge cases.

docs/source/basics/multimodal_datasets.rst

felipemello1 · 2024-09-24T22:30:44Z

torchtune/data/_messages.py

@@ -335,14 +336,22 @@ class ShareGPTToMessages(Transform):
        ]

    Args:
-        train_on_input (bool): whether the prompt should remain unmasked. Default: False
+        train_on_input (bool): whether the prompt should remain unmasked. For multimodal datasets, ``train_on_input``


can we add why this is always false?

I'm not sure really, I think it's because it's not functionally useful? @pbontrager

torchtune/data/_messages.py

felipemello1 · 2024-09-24T22:49:22Z

torchtune/datasets/multimodal/_llava_instruct.py

-
-        return {"messages": messages}
-
-
 # TODO: point to Flamingo model transform as an example


do we need to address this TODO?

after certain events have transpired

felipemello1 · 2024-09-24T22:51:20Z

torchtune/datasets/multimodal/_multimodal.py

+    **load_dataset_kwargs: Dict[str, Any],
+) -> SFTDataset:
+    """
+    Configure a text+image dataset with conversations between user and model assistant.


nit: It bothers me a bit that its called multimodal, but it actually means text+image. Not asking for changes, just sharing

felipemello1

very nice! thanks for putting this up. I didnt have time to finish reviewin "multimodal_chat_dataset". Approving to unblock.

torchtune/datasets/multimodal/_multimodal.py

add

c5fa57b

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 24, 2024

RdoubleA added 2 commits September 24, 2024 14:01

add to toctree

f2b7121

fix tests

923cc3a