Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multimodal dataset builder + docs #1667

Merged
merged 5 commits into from
Sep 25, 2024
Merged

Conversation

RdoubleA
Copy link
Contributor

@RdoubleA RdoubleA commented Sep 24, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

There is currently no way to load a custom multimodal dataset. The current workflow is users will write their own transform and builder - which is already a big lift - but on top of that our custom component imports don't even work (#1540). So basically, it's impossible to use anything other than llava instruct and cauldron. This sucks.

I added a quick multimodal_chat_dataset builder. It assumes conversational data with images, much like llava instruct and ShareGPT4V. To keep changes minimal, I only support ShareGPT format. We can expand to other chat formats or to instruct data later.

This unfortunately required me to generalize the LlavaInstructToMessages (cc @joecummings, I agree we shouldn't generalize too early but in this case I think it's warranted). Again, to keep changes minimal, I simply merged LlavaInstructToMessages with ShareGPTToMessages. There were identical in handling text, the only difference was the image loading logic.

With this builder, now I finally have something to point to and discuss in the multimodal dataset docs, and users have some avenue to finetune multimodal on their own data. A bit of extra work, but worthwhile imo.

Changelog

What are the changes made in this PR?

  • Merge ShareGPTToMessages and LlavaInstructToMessages
  • Merge any relevant tests
  • Add multimodal_chat_dataset builder
  • Add doc page on multimodal datasets

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Sep 24, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1667

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit dc7204a with merge base c3ff864 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 24, 2024
dataset:
_component_: torchtune.datasets.multimodal.multimodal_chat_dataset
source: json
data_files: data/my_data.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a dummy 2-example-file of what my_data.json should look like

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's shown above?

dialogue: conversations
image_path: image
image_dir: /home/user/dataset/
image_tag: "<image>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noob question: will this ever break if the user adds to their prompt, or is a special token and the tokenizer understands if it comes from the user or the ChatML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it will leade to unexpected behaviors, the burden is on the user for setting this correctly.. I don't know if there's an easy way to check for those edge cases.

@@ -335,14 +336,22 @@ class ShareGPTToMessages(Transform):
]

Args:
train_on_input (bool): whether the prompt should remain unmasked. Default: False
train_on_input (bool): whether the prompt should remain unmasked. For multimodal datasets, ``train_on_input``
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add why this is always false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure really, I think it's because it's not functionally useful? @pbontrager


return {"messages": messages}


# TODO: point to Flamingo model transform as an example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to address this TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after certain events have transpired

**load_dataset_kwargs: Dict[str, Any],
) -> SFTDataset:
"""
Configure a text+image dataset with conversations between user and model assistant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It bothers me a bit that its called multimodal, but it actually means text+image. Not asking for changes, just sharing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I concur.

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice! thanks for putting this up. I didnt have time to finish reviewin "multimodal_chat_dataset". Approving to unblock.

torchtune/datasets/multimodal/_multimodal.py Show resolved Hide resolved
@RdoubleA RdoubleA merged commit 7c59516 into pytorch:main Sep 25, 2024
16 checks passed
@RdoubleA RdoubleA deleted the mm_dataset_docs branch September 25, 2024 00:56
@RdoubleA RdoubleA mentioned this pull request Oct 13, 2024
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants